1. Introduction
In recent years, Unmanned Aerial Vehicles (UAVs) have been widely used in environmental monitoring, logistics, and precision agriculture [
1,
2,
3]. Despite their success, single-UAV systems are inherently limited by payload capacity, restricted mission coverage, and the inability to execute multiple tasks simultaneously, which restricts mission efficiency and scalability. Consequently, multi-UAV collaboration has emerged as a promising solution [
4], offering higher efficiency, greater coverage, and improved task flexibility, and has been quickly applied in power line inspection [
5], crop protection [
6], and disaster relief operations [
7].
In multi-UAV systems, efficient path planning is essential to ensure safe, coordinated, and high-performance operations. However, path planning in three-dimensional continuous environments is particularly challenging due to the presence of obstacles and the need for inter-UAV coordination. Researchers have explored various methods for UAV path planning, including cell decomposition, roadmap approaches, potential fields, and evolutionary computation (EC) algorithms [
8].
Specifically, cell methods are easy to implement but become inefficient as the resolution of the grid increases. Roadmap methods, such as visibility graphs and probabilistic roadmaps, can create dense graphs in cluttered environments, consuming more memory and computational time. Potential field methods are fast, but easily fall into a local optima, making them unreliable in complex scenarios. More details are provided in
Section 2.
As an important branch of evolutionary computation, swarm intelligence optimization algorithms have shown strong performance in path planning. By simulating interactions among simple agents, they enable parallel computation and self-organizing global optimization. Among them, ant colony optimization (ACO) [
9], inspired by the foraging behavior of ants, is particularly effective for global path planning and has been extended in various ways, including bidirectional search [
10], multi-ant colony systems [
11], and hybrid pheromone strategies [
12]. However, traditional ACO algorithms are mainly designed for discrete problems and struggle with the computational demands and precision requirements of 3D continuous UAV path planning. To address this, Socha et al. [
13] proposed ACO
R, which uses continuous probability density functions for solution sampling. Based on this, several variants have been developed, such as MACO
R (multi-operator) [
14], MACO
R-LS (with local search) [
15], and ACOSRA
R (with adaptive waypoint repair method) [
16]. These extensions preserve the benefits of pheromone-guided search while improving suitability for high-dimensional continuous spaces by enhancing accuracy.
Despite progress in ACO variants for path planning, several limitations remain in real-world scenarios. First, most methods focus on single-UAV or discrete environments, with relatively limited research on multi-UAV fine-grained path planning in 3D continuous space. Second, current approaches often depend on fixed, scene-specific strategies, resulting in poor adaptability to different environments. Third, due to limited environmental information in practice, planning algorithms often rely on trial-and-error learning, reducing search efficiency, increasing infeasible waypoints, and adding computational overhead.
To overcome the limitations of existing ACO variants: limited multi-UAV research in 3D continuous space, poor adaptability to diverse environments, and inefficient search, this paper proposes a reinforcement-learning-driven multi-strategy continuous ant colony optimization algorithm(QMSR-ACOR). The primary contributions of this paper are listed as follows.
- (1)
A realistic multi-UAV path planning model is developed in a continuous 3D space, incorporating collaboration constraints between UAVs. This provides a practical framework for addressing complex path planning problems in continuous environments.
- (2)
A novel multi-strategy framework for multi-UAV path planning is proposed. It designs four new constructor strategies combined with two established walk strategies, and introduces a Q-learning-based mechanism to dynamically select the best combination of strategies. This significantly enhances the generality and adaptability of the algorithm in diverse scenarios.
- (3)
An elite waypoint repair mechanism is introduced based on existing repair strategies. By reducing redundant blind searches and preventing UAV collisions, it improves search efficiency and ensures safe path planning.
- (4)
Comprehensive experimental validation is conducted, including comparative and ablation studies in multiple scenarios and dimensions. The results demonstrate the effectiveness, efficiency, and adaptability of QMSR-ACOR compared to existing methods.
The remainder of the paper is structured as follows:
Section 2 reviews traditional and evolutionary path planning algorithms.
Section 3 defines the problem and the corresponding constraints.
Section 4 details the proposed algorithm.
Section 5 presents experiments and
Section 6 concludes the paper with future research directions.
4. Reinforcement-Learning-Driven Multi-Strategy Continuous Ant Colony Optimization
This section presents the core algorithmic framework and the enhancements proposed in this paper. First,
Section 4.1 introduces the standard continuous ant colony optimization (ACO
R), which serves as the foundation for the proposed approach.
Section 4.2 then reviews the basic principles of the Q-learning algorithm, highlighting its key components such as the state set, the action set, and the reward mechanism. Building upon these foundations,
Section 4.3 proposes a novel reinforcement-learning-driven multi-strategy continuous ant colony optimization (QMSR-ACO
R). This enhanced approach integrates multiple construction and walk strategies, a waypoint repair mechanism with elite selection, and a Q-learning-based strategy selection module to improve performance in complex continuous optimization tasks.
4.1. Standard Continuous Ant Colony Optimization
Socha et al. [
13] extended ant colony optimization to continuous problems with ACO
R, which uses probabilistic modeling to sample new solutions from continuous distributions, unlike discrete ACO finite selections.
Let us consider a general continuous optimization problem defined as follows in Equation (
20).
where
D be the continuous search space and
the objective function to minimize. ACO
R maintains an archive of elite
k solutions
, sorted by their objective values. Each solution receives a weight
derived from a Gaussian function capturing its quality, as defined in Equation (
21).
where
l regulates the rate of weight decay: the smaller
l concentrates the search for the optimal solutions, while the larger
l encourages a wider exploration.
Next, the selection probability
is derived by normalizing its weight, as given in Equation (
22).
Then, ACOR randomly selects a guiding solution based on the probabilities and generates a new candidate by sampling each dimension from a Gaussian distribution. The mean is the value of the guide solution for that dimension and the standard deviation is based on the average distance to other solutions in the archive.
Specifically, a new solution
is sampled as shown in Equation (
23).
where
is the component of the guiding solution in dimension
d and the parameter
controls convergence: larger
means a wider search, while smaller
means faster convergence but higher risk of premature stagnation.
4.2. Q-Learning Framework and Components
This section briefly reviews the basic Q-learning framework and presents its key components, including the state set, the action set, and the reward mechanism.
4.2.1. Q-Learning Framework
Q-learning is a model-free reinforcement learning algorithm that learns optimal policies in Markov Decision Processes (MDPs) by interacting with the environment, and has been widely applied in various fields [
39,
40,
41]. Update action value (Q) functions using immediate rewards and state transitions, gradually converging to an optimal policy. During the learning process, the agent observes states, takes actions, and receives rewards that guide future decisions. A well-designed reward function is crucial: positive rewards reinforce good actions, while negative rewards discourage poor ones.
Q-learning balances exploration and exploitation: first exploring more to learn the environment, then exploiting the learned knowledge to improve performance. Its core update rule is defined as shown in Equation (
25).
where
denotes the current state,
is the action taken in that state,
is the immediate reward received after performing the action, and
is the resulting state. The parameter
controls how much new information updates the Q value, and
is the discount factor that weighs immediate versus future rewards.
4.2.2. State Set
In multi-UAV path planning, the fitness function measures the quality of the solution. To better represent the search status of the population, we include both the average fitness and the current best fitness in the state design. The formulas are calculated using Equations (
26)–(
28).
where
is the fitness of the
ith solution in generation
t, and
is its fitness in the initial generation.
and
are the best fitness values in generations
t and 1, respectively. Consequently,
and
represent the normalized performance at the population level and the individual level, while
reflects the general state of the search. The state
is discretized into four intervals
with a width of 0.1. For example, if
, then
.
4.2.3. Action Set
The action set includes eight combinations made up of four constructor selection strategies and two walk strategies. These combinations represent different ways to build and explore solutions. Detailed explanations of each method and strategy will be provided in
Section 4.3. The reinforcement learning agent dynamically selects among these combinations to adaptively optimize both construction and exploration, thereby improving search efficiency and solution quality.
In our Q-learning implementation, the action set consists of eight distinct actions, formed by combining four constructor selection strategies with two walk strategies. In the code, these actions are represented as an indexed list, where each index corresponds to a unique combination.
During the first half of the iterations, the agent selects actions randomly to encourage exploration. In the second half, it follows a greedy policy to exploit the learned knowledge. Each selected action determines:
- (1)
Constructor selection strategy—guides how the partial solution is expanded.
- (2)
Walk strategy—defines how the partial solution is extended or mutated to generate a new solution.
After executing the chosen action, the agent observes the resulting state and receives a reward. The Q-table is then updated according to the standard Q-learning update rule. This design allows the agent to balance exploration of alternative paths with exploitation of high-reward strategies.
4.2.4. Reward Mechanism
The reward function is based on the changes in the average fitness and the best fitness of the population between consecutive generations, guiding the agent to learn the optimal combination of strategies. The reward is calculated using Equation (
29).
where
measures the improvement of the best individual and
evaluates the overall improvement of the population.
4.3. Reinforcement-Learning-Driven Multi-Strategy Continuous Ant Colony Optimization
In order to improve the effectiveness of ACOR for complex continuous optimization problems, this paper proposes a reinforcement-learning-driven multi-strategy continuous ant colony optimization (QMSR-ACOR). Specifically, we design four new constructor strategies, integrate two walk strategies from existing studies, and adopt a waypoint repair strategy that incorporates the proposed elite selection mechanism. Finally, we apply reinforcement learning to adaptively select the best combination of strategies.
4.3.1. Constructor Selection Strategies
In this section, we introduce four solution construction strategies.
- (1)
Hybrid strategy mean-guided single-individual constructor
In high-dimensional and complex optimization problems, guiding vectors generated by a single selection strategy often fails to balance search diversity and solution quality. To overcome this, we propose a hybrid strategy mean-guided constructor (HMGSC). It simultaneously applies tournament selection, roulette wheel selection, and random selection to choose one individual from each, and generates the guiding vector
by averaging these three individuals, as shown in Equation (
32).
where
,
, and
are the individuals selected by tournament, roulette wheel, and random selection, respectively. After obtaining the guiding vector
, the constructor further generates a new solution
using a walk strategy, as given in Equation (
33).
In this mechanism, tournament selection focuses on high-quality individuals to strengthen local exploitation. Roulette wheel selection introduces a chance to select weaker individuals, preserving population diversity. Random selection increases the search’s jumpiness, enhancing global exploration. By averaging individuals from these three strategies, the method reduces the bias of any single selection approach. The pseudo code of HMGSC is presented in Algorithm 2.
Algorithm 2: Solution construction in HMGSC |
![Drones 09 00638 i002 Drones 09 00638 i002]() |
- (2)
Adaptive differential-guided single-individual constructor
To better balance global exploration and local exploitation, we design an adaptive differential-guided single constructor (ADGSC). It extends the differential evolution (DE) strategy and incorporates a diversity-driven mechanism to adaptively adjust the perturbation strength, improving both solution quality and adaptability. In ADGSC, the guiding vector
is generated using two individuals,
and
, selected from the archive by roulette wheel selection, as given in Equation (
34).
where
is a perturbation factor that controls the step size of the differential vector. Instead of using a fixed value, ADGSC adaptively sets
based on a normal distribution with a mean of 0.5 and a standard deviation determined by the current population diversity
D (see Equations (
35) and (
36)).
The diversity indicator
D can be calculated by Equation (
37).
where
denotes the value of the
ith individual in the
jth dimension,
N is the population size, and
n is the number of decision variables. Once the guiding vector
is obtained, a new solution
is generated using a walk strategy by Equation (
33).
This mechanism increases perturbation when diversity is high to enhance global exploration, and reduces it when diversity is low to improve local exploitation. It helps maintain a dynamic balance between search efficiency and solution quality. The pseudo code of ADGSC is shown in Algorithm 3.
Algorithm 3: Solution construction in ADGSC |
![Drones 09 00638 i003 Drones 09 00638 i003]() |
- (3)
Adaptive differential-guided multi-individual constructor
To improve diversity, stability, and solution quality at different stages, we propose an adaptive differential-guided multi-individual constructor (ADGMC), building on ADGSC. Unlike ADGSC, which creates one guiding vector, ADGMC generates multiple guiding vectors and candidate solutions in parallel, enabling a more robust and multidirectional search. Specifically, in each construction phase, ADGMC generates three guiding vectors , , and , each representing a distinct search direction:
First, the convergence-guided vector
is constructed using the current best individual
and a randomly selected individual
from the population, as given in Equation (
38).
Second, the diversity-guided vector
is generated by selecting two individuals,
and
, from the population using roulette wheel selection, as given in Equation (
39).
Third, the exploration-guided vector
is constructed from two randomly selected individuals,
and
, from the population, as given in Equation (
40).
All guiding vectors share the same adaptive mechanism for determining the perturbation factor
, which can be computed as (26). Finally, the three guiding vectors are used in a random walk strategy to generate three candidate solutions
,
, and
, as given in Equation (
41).
By integrating convergence, diversity, and exploration vectors, ADGMC achieves a balanced trade-off between local exploitation and global exploration, thereby enhancing the comprehensiveness and robustness of the solution construction process. The pseudo code of ADGMC is shown in Algorithm 4.
Algorithm 4: Solution construction in ADGMC |
![Drones 09 00638 i004 Drones 09 00638 i004]() |
- (4)
Elite genetic multi-individual constructor
Based on the “selection–recombination” idea in genetic algorithms, we design the elite genetic multi-individual constructor (EGMC). It uses tournament selection to improve robustness and accelerate convergence by promoting high-quality individuals. Compared to roulette or random selection, tournament selection relies on local competition, making it more effective in complex or dynamic environments. During each construction cycle, EGMC combines elite guidance and tournament selection to generate three guiding vectors , , and , each responsible for directing the search in different directions. The specific construction methods are as follows.
First, the convergence-guided vector
is constructed using the current best individual
and a randomly selected individual
, as given in Equation (
42).
Second and third, the competitive-guided vectors
and
are generated using individuals
,
,
, and
selected by tournament selection. Each pair of relatively superior individuals is used to construct one guiding vector, as given in Equations (
43) and (
44).
These vectors emphasize the genetic propagation of high-quality individuals through local competition, promoting effective exploration of elite regions while avoiding excessive reliance on the global best individual. The pseudo code of EGMC is shown in Algorithm 5.
Algorithm 5: Solution construction in EGMC |
![Drones 09 00638 i005 Drones 09 00638 i005]() |
4.3.2. Random Walk Strategy
In ACO
R, new solutions are sampled using a Gaussian distribution, leading to small-step random walks similar to Brownian motion. This often limits exploration and increases the risk of premature convergence, especially in later stages. To address this, others introduce the Lévy flight [
14], a stochastic process with a heavy-tailed power law distribution, defined as
. Unlike Gaussian walks, Lévy flight allows for occasional long jumps, enhancing global exploration, and helping the algorithm escape local optima. Its effectiveness has been demonstrated in metaheuristics such as DE [
44] and PSO [
45]. The new solution is generated by Equation (
45).
where
is the step length in the Lévy flight.
4.3.3. Elite Waypoint Repair Strategy
In path planning, obstacles and constraints divide the search space into feasible and infeasible regions. The generated waypoints may fall within infeasible areas, making the path invalid. To simplify correcting these invalid points, we adopt a waypoint repair mechanism [
16], enhanced with an elite repair strategy that selectively repairs only part of the population to improve computational efficiency.
In
Figure 2, the first shows two UAVs (
and
) starting at different points with initial paths crossing cylindrical obstacles. These initial trajectories are generated without considering obstacle constraints, resulting in infeasible segments where the UAVs would collide with obstacles. The second one provides a top view of the path-obstacle intersections. Once these points are identified as infeasible, a two-step relocation strategy is applied to move them to feasible locations nearby.
Specifically, the third figure shows that in is shifted along the x-axis to within intervals and . in is moved randomly along the y axis to within the intervals , , and .
Finally, the fourth figure shows the outcome of the repair strategy. After the relocation of infeasible waypoints, both the original and repaired paths are reconnected using smooth curved segments, ensuring that the resulting trajectories are not only collision-free, but also continuous and feasible for UAV navigation. This process improves the overall path quality and reliability of UAV operations in complex environments with dense obstacles. The pseudo code is shown in Algorithm 6.
Algorithm 6: Elite waypoint repair strategy |
![Drones 09 00638 i006 Drones 09 00638 i006]() |
4.3.4. Reinforcement-Learning-Driven Multi-Strategy Continuous Ant Colony Optimization for Path Planning
This proposed algorithm is built on the Q-learning framework, treating the population of ACOR as a reinforcement learning agent. During multi-UAV path planning, it adaptively optimizes solutions by selecting among multiple construction strategies and walk strategies. Specifically, the action set consists of eight combinations of strategies formed by pairing four construction strategies of solutions with two walking strategies: HMGSC + Gaussian Mutation, ADGSC + Gaussian Mutation, ADGMC + Gaussian Mutation, EGMC + Gaussian Mutation, HMGSC + Lévy Flight, ADGSC + Lévy Flight, ADGMC + Lévy Flight, and EGMC + Lévy Flight.
In early training, actions are randomly chosen to encourage exploration. Later, actions with the highest Q values are selected to exploit the learned experience. After each action, a reward is calculated based on changes in average and best fitness, and the Q values are updated accordingly. Then, the algorithm uniformly applies the elite waypoint repair strategy to correct invalid waypoints in the path, thereby improving the feasibility and robustness of the solution.
This learning mechanism helps the algorithm discover effective combinations of strategies, improving both search efficiency and path quality. It is especially suitable for complex, dynamic multi-UAV planning tasks. The pseudo code of QMSR-ACO
R is shown in Algorithm 7.
Algorithm 7: Pseudo-code of QMSR-ACOR |
![Drones 09 00638 i007 Drones 09 00638 i007]() |
4.3.5. Time Complexity Analysis
To evaluate the efficiency of the proposed QMSR-ACOR in solving multi-UAV path planning problems, its time complexity is analyzed using big-O notation. In the algorithm, N denotes the population size, the number of waypoints, the number of obstacles, the number of UAVs, and the maximum number of iterations.
The algorithm consists of two main stages. The first stage is population initialization, which generates initial solutions for all UAVs. The time complexity of this stage is approximately , since each UAV in each individual requires generating waypoints.
The second stage is the iterative optimization process, repeated for each iteration, which includes four main components:
- (1)
Solution construction and mutation: Each ant constructs a new solution based on selected strategies. The update of each coordinate using Gaussian mutation or Lévy flight has a complexity of .
- (2)
Repair operation: For infeasible paths, each UAV’s trajectory is repaired considering obstacles, with a time complexity of .
- (3)
Collision detection between UAVs: To avoid inter-UAV collisions, pairwise checks of all UAVs are performed for each path. The complexity is . Since is relatively small in practice, this term does not dominate the overall computation.
- (4)
Fitness evaluation and global best update: Fitness evaluation involves computing path length, obstacle avoidance, and flight constraints for all UAVs (), while updating the global best solution requires sorting, with a complexity of .
The Q-learning component only involves table lookups and updates, which are negligible compared to the ACOR operations.
In summary, the total time complexity of QMSR-ACO
R is:
When the population size
N is much larger than the number of waypoints
and obstacles
, and the number of UAVs
is small, the time complexity simplifies to:
which is consistent with the traditional ACO
R family, indicating that the integration of Q-learning and multi-UAV collision checking does not significantly increase the overall computational burden.
5. Experimental Results and Analysis
5.1. Experimental Setting
To evaluate the performance and robustness of QMSR-ACOR in multi-UAV path planning, two groups of experiments were carried out in a simulated environment based on data from the digital elevation model (DEM) to replicate realistic terrains.
The first set tested the algorithm on different scales and complexities of the problem by varying the number of UAVs (2 to 6) and obstacles (5 to 12).
Table A1 and
Table A2 list UAV IDs, start/end points, and environmental details. Obstacles have a fixed height of 450 m. QMSR-ACO
R was compared with six advanced algorithms: ACOSRA
R [
16], EEFO [
34], HLOA [
35], MACO
R [
14], SPSO [
27], and ACO
R [
13]. Three problem dimensions, each with 8 cases, were independently tested 30 times. The parameter settings are shown in
Table 1. Performance results and success rates are shown in
Table 2,
Table 3 and
Table 4.
Figure A1 shows the convergence curves (10D), and
Figure A2 shows the optimal paths of QMSR-ACO
R. The second set analyzed the internal mechanisms of QMSR-ACO
R by evaluating different combinations of its eight component strategies to assess their contribution to the quality and optimization of the path.
All experiments used consistent settings: 30 independent runs per case, with average and best path costs recorded. Statistical significance was verified using the Wilcoxon rank sum test (* marks 5% significance). The maximum number of iterations was 200, and the population size was 400. The best results are highlighted in bold.
5.2. The Comparison to Different Algorithms
This section presents a comprehensive comparison of QMSR-ACOR with seven baseline algorithms for test problems of increasing dimensionality: 10D, 20D and 30D. Evaluation metrics include success rate and average cost of the path, measured in eight benchmark cases for each dimension.
5.2.1. Comparison of Different Algorithms on 10D Cases
As shown in
Table 2, we compared seven algorithms in eight test cases using the success rate and the average cost of the route as metrics in the 10D (30 parameters) path planning problem. In simpler cases (Case 1–Case 4), most algorithms (MACO
R, SPSO, EEFO, HLOA, ACO
R) achieved 100% success, indicating their effectiveness in low complexity scenarios. However, their performance dropped significantly in more complex cases (Case 5–Case 8), with some success rates falling to 0, highlighting weaknesses in solving high-dimensional constrained problems.
Table 2.
Comparison results on 10D cases.
Table 2.
Comparison results on 10D cases.
Case | MACOR [14] | SPSO [27] | EEFO [34] | HLOA [35] | ACOR [13] | ACOSRAR [16] | QMSR-ACOR |
---|
1 | 1598 ± 40 | 1783 ± 27 | 2134 ± 137 * | 2538 ± 121 * | 1776 ± 428 | 1521 ± 40 | 1690 ± 12 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
2 | 2891 ± 158 | 3152 ± 132 * | 3836 ± 253 * | 4547 ± 182 * | 3385 ± 800 * | 2867 ± 246 | 2462 ± 16 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | 3240 ± 764 * | 3268 ± 134 * | 3249 ± 121 * | 4111 ± 237 * | 8219 ± 755 * | 3159 ± 415 * | 2516 ± 127 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
4 | 6872 ± 782 * | 4721 ± 230 * | 5950 ± 490 * | 7004 ± 523 * | 18,135 ± 681 * | 5452 ± 367 * | 4080 ± 177 |
| 1.00 | 1.00 | 1.00 | 1.00 | 0.80 | 1.00 | 1.00 |
5 | 18,215 ± 2434 * | 19,864 ± 1115 * | 18,585 ± 3105 * | 25,852 ± 2697 * | 26,309 ± 3693 * | 13,699 ± 1793 * | 8195 ± 516 |
| 0.80 | 0.60 | 0.80 | 0.20 | 0.20 | 1.00 | 1.00 |
6 | 22,390 ± 3217 * | 32,955 ± 2510 * | 43,937 ± 2888 * | - | - | 22,350 ± 1699 * | 13,920 ± 1707 |
| 0.20 | 0.17 | 0.13 | 0.00 | 0.00 | 1.00 | 1.00 |
7 | 21,360 ± 1849 * | 37,032 ± 1993 * | - | - | - | 46,927 ± 3259 * | 16,433 ± 2621 |
| 0.10 | 0.07 | 0.00 | 0.00 | 0.00 | 0.73 | 0.83 |
8 | 44,396 ± 5232 * | - | - | - | - | 54,711 ± 7340 * | 23,717 ± 2786 |
| 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.57 | 0.70 |
In contrast, ACOSRAR and QMSR-ACOR maintained success rates above 50% in all cases, benefiting from their adaptive waypoint repair mechanism, which detects and corrects infeasible points during the search process. As the obstacle density increased, these two algorithms continued to find feasible paths, demonstrating strong robustness.
Between the two, QMSR-ACOR consistently achieved higher success rates, especially in highly constrained environments (e.g., Case 7 and Case 8). It also outperformed ACOSRAR in the cost of the path from Case 2 to Case 8, the performance gap expanding in more difficult problems. This advantage comes from its Q-learning-based constructor selection, which adaptively chooses among four strategies (HMGSC, ADGSC, ADGMC, EGMC), balancing exploration and exploitation. Combined with repair mechanisms, QMSR-ACOR ensures both feasibility and path quality under challenging conditions.
As shown in
Figure A1, in the 10D path planning problem, the convergence curves from Case 1 to Case 8 show that QMSR-ACO
R consistently achieves superior convergence performance across all scenarios. As environmental complexity increases, especially from Case 5 onwards due to more UAVs and denser obstacles, many algorithms tend to generate infeasible solutions in early iterations. The high penalty for obstacle collisions (up to 10,000) leads to a sharp increase in initial path costs, creating a noticeable ‘initial jump’ that hinders the quality of the initial solution and overall convergence. Although SPSO, EEFO, and HLOA can produce good initial paths and show strong early stage performance, their effectiveness drops significantly in complex environments. Some even fail to find feasible solutions entirely, indicating that they are better suited for simpler scenarios.
5.2.2. Comparison of Different Algorithms on 20D Cases
As shown in
Table 3, when the dimension of the problem increases from 10D to 20D—corresponding to an increase in the number of parameters from 30 to 60—the search space and uncertainty of the path planning problem increase substantially. This significantly raises the difficulty of finding feasible paths and the path cost accordingly increases.
In high-dimensional scenarios, MACOR achieves 100% success in cases of low to medium complexity (Case 1–Case 4), but decreases dramatically in cases of higher dimension (Case 5–Case 8) to 0.57, 0.23, 0.06, and fails completely in Case 8, indicating that its search mechanism tends to fall into infeasible regions and lacks the ability to escape local optima. SPSO maintains 100% success in simple cases, but decreases rapidly to 0.13, 0.10, 0.00, and 0.00 in later cases, as its velocity–position update mechanism struggles to maintain direction in complex terrains and narrow feasible regions. EEFO and HLOA are theoretically designed to balance exploration and exploitation, but prove unstable under high-dimensional constraints, with path success rates below 0.2 or infeasible from Case 5 onwards, highlighting difficulties in generating feasible paths.
The original ACOR maintains reasonable performance in lower-dimensional problems but shows clear limitations under 20D conditions. From Case 4 onwards, its success rate drops drastically to 0.23, 0.00, 0.00, and 0.00, highlighting that its elite-solution-guided sampling strategy is prone to fall into “infeasibility traps” under high-dimensional constraints and lacks mechanisms for path correction and diversity enhancement.
In contrast, ACOSRAR and QMSR-ACOR exhibit strong stability and adaptability even as dimensionality increases. In particular, in the four most constrained scenarios (Case 5–Case 8), both algorithms maintain high success rates. Among them, QMSR-ACOR continues to demonstrate a significant advantage in path cost due to its Q-learning-based constructor selection mechanism and adaptive waypoint repair strategy, achieving both robustness and high quality of the solution under complex constraints.
Table 3.
Comparison results on 20D cases.
Table 3.
Comparison results on 20D cases.
Case | MACOR [14] | SPSO [27] | EEFO [34] | HLOA [35] | ACOR [13] | ACOSRAR [16] | QMSR-ACOR |
---|
1 | 2922 ± 78 * | 2511 ± 35 | 2790 ± 260 * | 2556 ± 117 | 2482 ± 211 | 1942 ± 155 | 2887 ± 90 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
2 | 5483 ± 311 * | 4868 ± 524 * | 5489 ± 353 * | 6671 ± 416 * | 5915 ± 122 * | 5250 ± 124 | 4804 ± 745 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | 7198 ± 866 * | 6019 ± 541 * | 7815 ± 748 * | 9015 ± 541 * | 9944 ± 873 * | 5627 ± 545 * | 4746 ± 610 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
4 | 10,518 ± 947 * | 10,124 ± 1454 * | 13,594 ± 2132 * | 15,597 ± 2654 * | 18,310 ± 2248 * | 10,758 ± 800 * | 8138 ± 854 |
| 1.00 | 0.77 | 0.87 | 0.77 | 0.60 | 1.00 | 1.00 |
5 | 22,795 ± 4562 * | 23,261 ± 3422 * | 25,323 ± 4100 * | 29,554 ± 3337 * | 33,710 ± 2849 * | 17,045 ± 2113 * | 12,125 ± 1022 |
| 0.57 | 0.13 | 0.17 | 0.06 | 0.23 | 0.93 | 0.93 |
6 | 28,583 ± 3798 * | 36,487 ± 3091 * | 50,010 ± 4587 * | - | - | 25,714 ± 5571 * | 19,590 ± 1636 |
| 0.23 | 0.10 | 0.10 | 0.00 | 0.00 | 0.80 | 0.87 |
7 | 29,749 ± 6523 * | - | - | - | - | 50,397 ± 4440 * | 22,848 ± 2472 |
| 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | 0.63 |
8 | - | - | - | - | - | 59430 ± 5388 * | 29,880 ± 2261 |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.53 | 0.53 |
5.2.3. Comparison of Different Algorithms on 30D Cases
As shown in
Table 4, in the 30D high-dimensional path planning problem, although MACO
R, SPSO, EEFO, HLOA, and ACO
R all maintained a success rate 100% in Case 1 and Case 2, performance degradation became apparent from Case 3. In particular, the success rates of HLOA and ACO
R began to decline, and by Case 4, most algorithms failed to generate feasible paths, with the success rates dropping to zero. This indicates that these algorithms struggle to remain stable in complex high-dimensional scenarios. In particular, HLOA and ACO
R completely failed after Case 4, revealing a weak adaptability under such conditions.
In contrast, ACOSRAR performs relatively stably in Case 1 to Case 5, with a success rate consistently remaining above 0.5. Although its success rate gradually decreased in Cases 6–8, it still outperformed most other algorithms, indicating a certain degree of robustness. However, its average path cost in high complexity scenarios (Cases 6–8) remained relatively high, suggesting that there is room for improvement in search precision.
The most notable performance was observed with QMSR-ACOR. Using the dynamic guidance of a Q-learning-based constructor selection mechanism, it can intelligently choose the optimal construction strategy during path generation. As a result, QMSR-ACOR consistently maintained a high success rate in all test cases (the lowest being 0.33), showing strong feasibility seeking capability even under complex, high-dimensional constraints. Furthermore, its average cost in Cases 3–8 was significantly lower than that of other algorithms, demonstrating superior global search capabilities and effective control over path quality. Therefore, QMSR-ACOR demonstrates outstanding stability and efficiency in high-dimensional path planning problems, significantly outperforming both traditional and improved algorithms. It proves to be an effective solution to tackle path planning in large-scale, complex environments.
Table 4.
Comparison results on 30D cases.
Table 4.
Comparison results on 30D cases.
Case | MACOR [14] | SPSO [27] | EEFO [34] | HLOA [35] | ACOR [13] | ACOSRAR [16] | QMSR-ACOR |
---|
1 | 4001 ± 155 * | 4716 ± 415 * | 5201 ± 316 * | 5038 ± 279 * | 4998 ± 407 * | 2388 ± 299 | 4499 ± 372 * |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
2 | 7186 ± 344 | 8296 ± 397 * | 9914 ± 518 * | 8727 ± 534 * | 9252 ± 3619 * | 8192 ± 537 | 6813 ± 470 |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | 13,857 ± 2156 * | 9944 ± 990 * | 11,209 ± 955 * | 13,550 ± 2210 * | 15,380 ± 2437 * | 9362 ± 838 * | 8824 ± 525 |
| 1.00 | 1.00 | 1.00 | 0.87 | 0.80 | 1.00 | 1.00 |
4 | 22,656 ± 6100 * | 22,470 ± 1901 * | 21,594 ± 1219 * | - | - | 16,398 ± 1215 * | 15,346 ± 572 |
| 0.80 | 0.63 | 0.50 | 0.00 | 0.00 | 0.93 | 1.00 |
5 | 33,925 ± 5793 * | - | - | - | - | 22,967 ± 1138 * | 19,567 ± 893 |
| 0.57 | 0.00 | 0.00 | 0.00 | 0.00 | 0.57 | 0.90 |
6 | - | - | - | - | - | 30,988 ± 1825 * | 24,580 ± 1292 |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.30 | 0.40 |
7 | - | - | - | - | - | 53,001 ± 4107 * | 27,111 ± 2965 |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.33 |
8 | - | - | - | - | - | 59,999 ± 5065 * | 34,304 ± 3389 |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.20 | 0.33 |
In general, the results confirm that QMSR-ACOR provides robust and reliable convergence across varying complexities, thanks to its Q-learning-based strategy selection and elite waypoint repair mechanism.
5.2.4. Computational Time
To evaluate the computational complexity of the algorithms, we measured the time each algorithm required to solve the 10D UAV path planning problems. Specifically, we recorded the total execution time for each algorithm using 10,000 fitness evaluations and averaged the results over 30 independent runs. The results are summarized in
Table 5.
From
Table 5, several observations can be made. First, with the increase in case number, the execution time of all algorithms shows a rising trend, which is mainly because more UAVs and obstacles lead to larger problem scales and higher computational burdens. Second, SPSO, EEFO, and HLOA generally exhibit lower execution times compared to other algorithms, benefiting from their relatively simple update mechanisms. Third, the execution times of MACO
R, ACO
R, ACOSRA
R, and the proposed QMSR-ACO
R are higher, since constructing new solutions requires additional sampling and ranking operations. Among them, QMSR-ACO
R achieves computational times comparable to MACO
R and ACOSRA
R, indicating that the incorporation of the proposed strategies does not lead to a significant increase in computational cost. Finally, although SPSO runs faster in most cases, it sacrifices solution quality in complex UAV path planning scenarios, while QMSR-ACO
R strikes a better balance between execution time and solution quality, making it a more competitive choice in practice.
5.3. The Impacts of Various Strategies
This study designed eight strategies by combining four constructor strategies and two walk strategies. They are: HMGSC + Gaussian Mutation, ADGSC + Gaussian Mutation, ADGMC + Gaussian Mutation, EGMC + Gaussian Mutation, HMGSC + Lévy Flight, ADGSC + Lévy Flight, ADGMC + Lévy Flight, and EGMC + Lévy Flight correspond to Strategies 1–8, respectively. The results are summarized in
Table 6 and the convergence curves of different strategies are illustrated in
Figure A3.
As shown in
Table 6 and
Figure A3, we can conclude that the type of constructor significantly affects the algorithm performance. Among the eight strategies, those that used ADGSC (Strategies 2 and 6) and EGMC (Strategies 4 and 8) consistently achieved lower path costs, especially in complex scenarios (Cases 4–8), indicating strong search ability and better adaptability. In contrast, HMGSC (Strategies 1 and 5) and ADGMC (Strategy 7) were unable to find feasible routes in highly constrained cases (Cases 5–8), showing weaker robustness.
Regarding walk strategies, the Gaussian mutation (Strategies 1–4) generally performed better than the Lévy flight (Strategies 5–8). This is particularly evident when comparing the same constructor under different walk strategies, such as Strategy 2 vs. 6 and Strategy 4 vs. 8. Although Lévy flight offers strong global search ability, applying it from the early stages introduces high variability, which may disrupt promising individuals and reduce convergence stability. Thus, Lévy flight is more effective when applied in later search stages to help escape local optima, rather than throughout the entire process. In particular, Strategies 5, 7, and 8 sometimes escaped local optima, but their overall path quality and stability were inferior to Strategies 1–4. This suggests that the Gaussian mutation provides better guidance and reliability in most scenarios.
In general, the combination of ADGSC or EGMC with Gaussian mutation (Strategies 2 and 4) delivered the best performance in terms of both convergence speed and solution quality. Furthermore, QMSR-ACOR achieved the lowest path cost in most cases, maintaining high robustness even under complex conditions (e.g., Cases 6–8). Its Q-learning mechanism enables dynamic selection of constructor–walk combinations based on current search states, balancing exploration and exploitation. This adaptability helps prevent premature convergence and improves overall path planning performance.
6. Conclusions
This paper proposes a reinforcement-learning-driven multi-strategy continuous ant colony optimization, namely QMSR-ACOR, to address the complex problem of multi-UAV path planning in 3D continuous environments. The proposed method incorporates a Q-learning-based strategy selector that adaptively chooses among eight path generation strategies to improve environmental adaptability. In addition, an elite waypoint repair mechanism is introduced to improve obstacle avoidance efficiency and reduce the generation of infeasible paths. Experimental results demonstrate that QMSR-ACOR not only achieves superior performance in path quality and computational efficiency, but also exhibits strong generalization across different scenarios compared to existing state-of-the-art algorithms.
The main contributions of this work are threefold: (1) modeling the multi-UAV path planning problem in 3D continuous space with realistic collaborative constraints; (2) integrating reinforcement learning into the strategy selection process to improve adaptability and robustness; and (3) enhancing search efficiency via an elite waypoint repair mechanism that guides UAVs away from invalid regions.
Although the proposed method shows promising results, there are still several directions worth exploring in future work. First, the current model assumes static obstacles and prior knowledge of the environment. In future research, incorporating online learning and real-time perception to handle partially known or dynamic environments will be critical. Second, communication and coordination strategies among UAVs can be further optimized to enhance collaborative efficiency under stricter constraints. Third, we plan to extend our study beyond simulations by conducting small-scale hardware experiments with multiple UAVs, which will provide further validation of the proposed method in real-world scenarios. Overall, QMSR-ACOR provides a viable and extensible solution to multi-UAV path planning in challenging environments and lays the foundation for further exploration into intelligent swarm-based navigation systems.