1. Introduction
With the increasing frequency of natural disasters, the security and stability of human society are facing severe challenges [
1]. In recent years, post-disaster relief has become an important research field [
2]. In some severely affected regions, high-risk factors such as structural collapse and leakage of hazardous substances make it difficult for rescue workers to enter the disaster area safely [
3]. The development of robot technology provides a new solution for post-disaster rescue. Rescue robots are capable of replacing or assisting humans in search, positioning, material transportation, and other tasks in dangerous environments [
4]. However, the post-disaster rescue tasks are usually numerous and widely distributed, and a single robot is difficult to meet the actual needs. Multi-robots are able to work together to complete complex rescue tasks efficiently [
5]. Therefore, how to effectively design rescue sequences for multiple robots in post-disaster rescue is an urgent problem to be solved.
Currently, the multi-robot path planning (MRPP) research mainly focuses on the shortest path or the minimum energy consumption and has made significant progress [
6,
7]. Many researchers model these problems as variants of the VRP to optimize the path of the robot. For example, Cui et al. [
8] introduced a mathematical model for the rescue vehicle routing problem with time windows and limited survival time, which aims to minimize the total distance traveled by helicopters while considering time windows, survivor life strength, and the capacities of transport and medical helicopters. Lin et al. [
9] designed a grouping method in disaster environments to minimize task failures and rescue time based on task locations and time constraints. Geng et al. [
10] proposed a modified PSO algorithm to maximize the number of successfully rescued survivors by assigning tasks to robots while considering uncertain survival times, ensuring that each robot operates within time constraints and optimizes task efficiency. Huang et al. [
11] proposed a model to minimize both path cost and time cost to complete the tasks in the shortest time, while ensuring that each robot has enough supplies to perform its assigned task. Wang et al. [
12] developed an improved IG algorithm to maximize the number of rescued survivors by considering life strength constraints and the time robots take to reach each survivor. Although existing research in rescue problems often considers optimization objectives such as path length, time, and the number of robots, most studies overlook the critical factor of task utility. Task utility refers to the rescue effect of the timely delivery of emergency supplies, and it decays exponentially over time. If emergency supplies can be delivered promptly and assistance can be provided, the task utility will be extremely high. If delivery is delayed, even if the supplies are intact, their utility value will decrease. Therefore, path planning should not only optimize the shortest distance and time but also maximize task utility to enhance rescue effectiveness. Additionally, considering that rescue robots have limited battery, path planning must also account for battery usage to ensure that robots can effectively complete their tasks. Therefore, based on the traditional VRP model, a new problem model is established considering the timeliness of the task and the battery constraint of the robot. The optimization objective of the model is composed of path length, task utility, number of robots, and rescue time. It aims to achieve efficient path planning by minimizing the objective function.
The MRPP problem is essentially an NP-hard problem, as demonstrated by reductions from well-established NP-hard problems such as the vehicle routing problem (VRP) and the path planning problem [
13,
14,
15,
16]. This makes it challenging to achieve the optimal solution using exact algorithms within a specified time. Using heuristics and metaheuristics is one of the most efficient algorithms for finding the optimal or approximate optimal solution to NP-hard problems [
17]. As a straightforward and efficient metaheuristic algorithm, the iterative greedy (IG) algorithm offers the benefits of a simple structure, minimal parameter requirements, and embeddability. The IG algorithm has been extensively used to solve combinatorial optimization problems [
18,
19,
20,
21]. The MRRPP problem and combinatorial optimization problems have similarities in solution structure. Both problem types require identifying the optimal sequence among a large number of possible configurations. Therefore, this study explores the applicability of the IG algorithm in the MRRPP problem. However, given the features of the MRRPP problem, the traditional IG algorithm is limited by its single-solution greedy strategy and is susceptible to becoming trapped in local optima when handling large-scale problems. Therefore, this work proposes an improved population-based iterative greedy algorithm for solving the MRRPP problem. Specifically, a history-based exploration strategy, a collective intelligence-driven strategy, and an elite-oriented optimization strategy are designed to enhance solution diversity. Additionally, six neighborhood operators based on critical and non-critical paths are developed for further exploration, and Q-learning is investigated to select these operators, thereby enhancing the algorithm’s effectiveness in exploring the search space. The primary contributions of this study are as follows:
- (1)
A novel mathematical model for the MRRPP problem is proposed, emphasizing the integration of time-decaying task utility to optimize rescue performance.
- (2)
A competition-oriented destruction-reconstruction mechanism is designed to improve the QPIG’s global search ability.
- (3)
A Q-learning-based local search strategy is developed to increase the exploitation ability of the algorithm.
- (4)
A historical information-based constructive strategy is investigated to accelerate the algorithm’s convergence speed.
The remainder of this study is structured as follows.
Section 2 reviews the related literature.
Section 3 explains the formulation and description of the problem.
Section 4 provides a detailed explanation of the proposed QPIG algorithm.
Section 5 describes parameter calibration, effectiveness analysis, and comparative algorithms for the experiment. Finally,
Section 6 summarizes this study and presents possible directions for further research.
2. Literature Review
In recent years, several algorithms have been utilized by many researchers to solve rescue path planning problems. For example, Xu et al. [
22] proposed an obstacle constraint and task of equal division-based k-means clustering algorithm, along with a glow-worm swarm optimization algorithm with chaotic initialization (GSOCI), which significantly improves the accuracy and convergence speed in addressing the emergency rescue planning issue. Michael et al. [
23] applied a variant of the ACO algorithm to determine the optimal search path under time and resource constraints, aiming to maximize the probability of locating survivors. Additionally, Wen et al. [
24] developed an unmanned aerial vehicle (UAV) route planning method that incorporated a probabilistic model to estimate potential target locations and two search algorithms to maximize detection probability in solving maritime search and rescue challenges. Meanwhile, Alrayes et al. [
25] used an enhanced drone route planning scheme that integrates quasi-oppositional-based learning with the traditional search-and-rescue optimization algorithm to optimize secure communication flight paths. Liu et al. [
26] implemented a multi-objective path planning framework, using hierarchical target filtering and sorting strategies to improve efficiency in complex disaster scenarios. Although the aforementioned algorithms have achieved notable results in rescue path planning, there remains room for improvement in balancing solution quality and computational efficiency, especially in large-scale environments.
Given its efficiency and ability to generate near-optimal solutions, the IG algorithm has been extensively adopted to address combinatorial optimization problems since its introduction. In the past several years, many variants of IG have been developed, among which the most notable is the population-based iterative greedy (PIG) algorithm. For instance, Bouamama et al. [
27] were the first to establish the PIG algorithm for the minimum weight vertex cover problem. Miao et al. [
28] used a PIG algorithm enhanced with Q-learning for solving the scheduling issue of multiple weeding robots, focusing on minimizing completion time while lowering total operational costs. In addition, Zou et al. [
29] implemented a mixed-integer linear optimization model and PIG algorithm to tackle the scheduling of automated guided vehicles with unloading safety detection in a matrix workshop. Wang et al. [
30] designed the PIG algorithm to address the cascaded flowshop joint scheduling problem, where the search alternates between two sequential phases to effectively reduce total tardiness. Moreover, Zhao et al. [
31] presented the PIG algorithm, incorporating heuristic initialization, local search, and selection mechanism, and achieving superior performance on extensive instances. Wang et al. [
32] developed two rapid evaluation approaches and a modified idle time insertion method within the PIG algorithm, resulting in notable improvements in both efficiency and solution robustness. Therefore, considering its verified efficiency in solving various combinatorial optimization problems, the PIG algorithm is employed as the core framework to balance search intensification and diversification.
Reinforcement learning (RL) has attracted growing interest recently, prompting researchers to investigate its potential for tackling optimization problems. To improve solution quality in multi-objective optimization, Li et al. [
33] designed a Q-learning-based dimension detection search method for identifying promising non-dominated solutions in resource-constrained flexible flowshop scheduling with robotic transportation. Zhao et al. [
34] introduced a Q-learning-based approach for determining the weighting coefficients, aiming to achieve smaller objective values in the distributed no-idle permutation flowshop scheduling problem. Ren et al. [
35] used an ensemble artificial bee colony algorithm enhanced with Q-learning to solve the bi-objective disassembly line scheduling problem, optimizing both disassembly time and smoothing index while considering task interference. In addition, Luo et al. [
36] developed a Q-learning memetic algorithm (QLMA) that considers product priorities and factory heterogeneity to minimize energy consumption and total tardiness. To address the multi-UAV cooperative plant protection task allocation problem, Chen et al. [
37] investigated a learning-based memetic algorithm (L-MA) that utilizes Q-learning to assist operators in determining the optimal number of UAVs for each field. Moreover, Fang et al. [
38] suggested an improved ant colony optimization algorithm (IAC-IQL), integrated with enhanced Q-learning, for global path planning of search and rescue robots based on Bessel curves, achieving higher efficiency and smoother paths. Zhan et al. [
39] developed a reinforcement learning-based genetic algorithm (GA-RL), which leverages Q-learning to guide offspring selection and population renewal in UAV maritime search and rescue involving multiple rescue centers. Inspired by the successful integration of Q-learning for scheduling and path planning problems, this study employs Q-learning to intelligently steer the choice of operators, allowing the algorithm to learn from experience and improve solution quality.
According to the algorithm described above, we chose nine representative algorithms and detailed their advantages and limitations in
Table 1. Through comparison, we found that swarm intelligence algorithms such as GSOCI, IACO, DABC, QLMA, IAC-IQL, and GA-RL have more parameters, complex population structures, and higher computational complexity. In the MRRPP problem, we usually only need an optimal path solution. Compared to swarm intelligence algorithms, the improved IG algorithm can explore the solution more thoroughly and effectively optimize its quality. Based on the advantages of the IG algorithm and the current research, this work introduces further improvements on the basis of the IG algorithm.
The traditional IG algorithm performs excellently in solving small-scale problems. As the number of rescue points increases, the advantages of its iterative improvement strategy gradually reduce, and the algorithm may overly focus on certain areas, leading to getting stuck in local optima. This work proposes a competition-oriented destruction-reconstruction mechanism to improve global search capability. Additionally, the traditional IG algorithm only uses insertion operators, resulting in incomplete exploration of the solution space. Therefore, this work designed six neighborhood operators based on the critical path and non-critical path, expanding the search range and enabling more comprehensive exploration of the solution space.
3. Problem Description and Formulation
3.1. Problem Description
In post-disaster rescue scenarios, there are multiple rescue points, assuming each rescue point corresponds to a task. Robots are mainly responsible for carrying out material distribution tasks from rescue centers to various rescue points, usually involving the delivery of emergency supplies. Tasks are usually time sensitive, and their utility gradually declines over time. The task utility refers to the effect or value of the timely delivery of emergency supplies. In post-disaster rescue scenarios, for example, the task is to deliver medication to patients. The “value” of the medication does not refer to its monetary worth as an emergency supply, but rather to the therapeutic effect it can provide. If the medication can be delivered to the patient promptly and provide treatment, its task utility will be very high, as it can significantly improve the patient’s health condition and increase the likelihood of a successful recovery. If the delivery is delayed, even if the medicine is not physically damaged, its task utility value will decrease because it does not provide timely assistance, impacting the rescue effectiveness. This study formulated the rescue problem as a variation of the VRP with timeliness and battery constraints.
The problem is formulated as a directed connected graph G = (N, A), where N = {0} ∪ T is the set of vertices, 0 denotes the rescue center, T is defined as the set of rescue points, and A = {(i, j)|i, j ∈ T, i = j} is τ, the arc set composed of distances between rescue points. All rescue tasks are performed by k robots, and the robot set is R = {1, 2, …, k}. Each rescue point must be accessed by at least one robot to ensure smooth completion of all tasks. Additionally, without compromising generality, the following assumptions have been made:
- (1)
All robots are homogeneous.
- (2)
There is only one rescue center, and robots are required to leave the rescue center and return after completing their tasks.
- (3)
All robots are operating normally without any downtime or malfunctions.
- (4)
The maximum load of all robots is the same.
- (5)
Robots move at a known constant speed.
3.2. Problem Formulation
In this section, a mathematical model is established as follows:
Decision variables:
| xijk | If the robot Rk goes from the rescue point Ti to the rescue point Tj, then xijk = 1; otherwise, xijk = 0. |
| yik | If the robot Rk goes to the rescue point Ti to perform a task, then yik = 1; otherwise, yik = 0. |
The objective function (1) consists of four parts: path length, number of robots, rescue time, and task utility. Constraints (2) and (3) indicate that each path taken by the robot starts and ends at the rescue center. Constraint (4) represents that the robot can only access each rescue point once. Constraint (5) ensures that the service time of each rescue point is within its time window. Constraint (6) guarantees that the robot’s material transport does not surpass its maximum carrying capacity. Constraint (7) represents the battery consumption of the robot during task execution, which depends on the travel distance and the battery usage per unit distance. Constraint (8) indicates a battery constraint, ensuring that the combined battery usage of the robot and its backup does not exceed the robot’s total battery capacity.
4. The Proposed QPIG Algorithm
The proposed QPIG is structured around four components: the population initialization, destruction and reconstruction phase, local search phase, and acceptance criteria phase. By introducing population-based optimization, multiple solution individuals are processed in parallel. The population-based IG framework not only preserves the advantage of IG in exploring a single solution in depth but also maintains high diversity in the solution space, thereby enhancing global search capability.
In the initialization phase, two problem-specific heuristic schemes (i.e., a utility-based initialization strategy and a rescue cost-based initialization strategy) are designed to establish an initial population composed of high-quality individuals. In the destruction and reconstruction phase, a competition-oriented destruction-reconstruction mechanism generates candidate solutions and selects the optimal individual from the candidate solutions to conduct local search. This mechanism includes a history-based exploration strategy, a collective intelligence-driven strategy, and an elite-oriented optimization strategy to improve the search process. By integrating into the competition-oriented destruction-reconstruction mechanism, individuals in the population undergo a process of selection, continuously filtering out the optimal solutions, which further improves search efficiency. The competition mechanism prompts the algorithm to select the best individuals, thereby accelerating the global search. In the local search phase, a Q-learning-based local search strategy is investigated to select six neighborhood operators to improve the quality of solutions. To overcome the limitations of traditional fixed selection strategies in complex problems, Q-learning adapts by selecting operators and dynamically adjusting the search strategy, allowing the algorithm to make flexible decisions based on the current solution’s state and historical experience. This effectively balances global and local optimization, improving search efficiency and enhancing the quality of the solution. The proposed QPIG framework is presented in Algorithm 1. Lines 10 to 19 of Algorithm 1 describe the acceptance criteria phase, while lines 20 to 24 provide a historical information-based constructive strategy, which clears the stored historical solution set periodically after a predefined number of executions. Specifically, the historical information-based constructive strategy includes the historical path allocation operator and path refinement operator to accelerate the convergence speed of the algorithm. The algorithm continues executing the above loop until the stopping criterion is met and then outputs the optimal path.
Figure 1 provides a flowchart of the QPIG.
| Algorithm 1: Framework of the proposed QPIG |
| Input: population size: ps, destruction-reconstruction size: ld, simulated annealing acceptance probability: p, predefined constant: c |
| Output: the best solution πbest |
| 1 | (π, POP) = Initialization; |
| 2 | i = 0, RSeq = ϕ, Histsol = ϕ; |
| 3 | πbest= π; |
| 4 | while the stopping criterion is not satisfied do |
| 5 | POP’ = Competition-oriented destruction-reconstruction mechanism(POP, π, ld); |
| 6 | POP’’ = Q-learning-based local search strategy(POP’); |
| 7 | idx_best = arg(1(f(POP’’j)); |
| 8 | π1 = POP’’idx_best; |
| 9 | RSeq = Rseq ∪ π1; |
| 10 | if f(π1) < f(π) then |
| 11 | π = π1; |
| 12 | if f(π1) < f(πbest) then |
| 13 | πbest = π1; |
| 14 |
end |
| 15 |
else |
| 16 | if p < rand(0,1) then |
| 17 | π = π1; |
| 18 |
end |
| 19 |
end |
| 20 | if mod(i, c) == 0 then |
| 21 | πhistory = Historical information-based constructive strategy(RSeq); |
| 22 | Add πhistory to Histsol; |
| 23 | RSeq = ϕ; |
| 24 |
end |
| 25 | i = i + 1; |
| 26 | end |
| 27 | return πbest; |
4.1. Solution Representation
Figure 2 shows an example of the MRRPP problem, where 0 represents the rescue center and {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} denote the rescue points.
R1,
R2, and
R3 represent the robots. Taking the path of robot
R1 as an example, it starts from the rescue center {0} and visits the rescue points {5, 3, 4, 1} in sequence. The
t5(6,9) next to rescue point 5 indicates that the earliest service start time of rescue point 5 is 6 and the latest service start time is 9. The demand for rescue point 5 is
dr5 = 10, which means 10 units of rescue supplies are required. Each time a robot visits a rescue point, its load is reduced. If the remaining load is insufficient to meet the demand at the rescue point, the robot will return to the rescue center. The initial utility value at rescue point 5 is
U = 79, indicating that the robot’s timely delivery of rescue supplies is expected to contribute a value of 79 to this rescue point. As shown in the upper right corner of
Figure 2, each robot has an initial load capacity of 150 and a total battery capacity of 110. In this study, a two-dimensional array is employed to represent each solution. The first dimension of the array corresponds to the robot number, and the second dimension represents the rescue points allocated to that robot.
Figure 3 illustrates an example of the encoding method.
4.2. Population Initialization
An effective initial solution significantly speeds up the convergence process and improves the solution’s quality. In this study, two problem-specific heuristic schemes are designed. The first heuristic scheme is the utility-based initialization strategy, which sorts tasks according to their utility values. Firstly, tasks with low utility values are assigned to each robot as the starting point. Then, the optimal task insertion is iteratively selected for each path to maximize the overall task utility and improve rescue efficiency while satisfying constraint conditions. Algorithm 2 describes pseudocode for the utility-based initialization strategy.
| Algorithm 2: Utility-based initialization strategy |
| Input: parameters |
| Output: the solution: π |
| 1 | N’ ← N; |
| 2 | Calculate f0j, j ∈ {1, 2, …, n}, n = sizeof(N’); |
| 3 | Sort tasks in N’ by ascending order of f0j, and select the top m tasks; |
| 4 | Delete the above m tasks in N’; |
| 5 | k = 1, i ← πk,1; π |
| 6 | while N’ ≠ ϕ do |
| 7 | Calculate fij; |
| 8 | fmin = min{fi1, fi2, …, fiN}; |
| 9 | Insert the task x of fmin into πk; |
| 10 | if the constraint is satisfied then |
| 11 | i ← x, delete task x from N’; |
| 12 |
else |
| 13 | k = k + 1, i ← πk,1; |
| 14 |
end |
| 15 | end |
| 16 | return ; |
The second heuristic scheme is the rescue cost-based initialization strategy. The rescue cost-based initialization strategy has the same structure as the utility-based initialization strategy, except that the utility value is replaced by the ranking of rescue cost. This strategy assigns tasks with low rescue cost to each robot as the starting point, selects the optimal task insertion for each path in turn, and constructs a multi-robot initial solution under the premise of satisfying the constraint conditions.
As a population-based metaheuristic algorithm, high-quality and diverse populations are essential for the evolution of the QPIG algorithm. Therefore, we use the proposed heuristic algorithms (utility-based initialization strategy and rescue cost-based initialization strategy) to generate two individuals to ensure the population’s quality. Meanwhile, randomly generate the remaining ps-2 individuals to guarantee the diversity of the population.
4.3. Destruction and Reconstruction Phase
The IG algorithm perturbs the current solution during the destruction and reconstruction phase, improving the diversity of the search process. In the destruction phase,
ld tasks are chosen randomly from the path sequence, yielding two subsequences:
πD, which includes the removed tasks, and
πR, which contains the remaining tasks. Parameter
ld represents the destruction size of the path sequence. The value of
ld is analyzed and set in
Section 5.3. In the construction phase, the tasks
πD extracted from the destruction stage are reinserted into the sequence
πR. First, randomly extract a task
δ from the sequence
πD. After attempting to insert task
δ into all positions of
πR, select the position that minimizes the rescue cost for insertion. This process continues until all the removed tasks have been reinserted into the path sequence.
To enhance the diversity and quality of solutions, this study proposes a competition-oriented destruction-reconstruction mechanism. This mechanism includes a history-based exploration strategy, a collective intelligence-driven strategy, and an elite-oriented optimization strategy. Among them, the history-based exploration strategy destroys and reconstructs the random selection of individuals within the history solutions, enhancing the diversity of solutions; the collective intelligence-driven strategy performs destruction reconstruction operations on all individuals in the population while retaining the one with the best rescue cost and mining global information to enhance search depth; and the elite-oriented optimization strategy focuses on the current optimal individual and conducts fine optimization around their neighborhood to enhance local search ability. Moreover, this mechanism evaluates the candidate solutions generated by the three strategies based on their rescue cost and selects the best-performing solution. Algorithm 3 provides the pseudocode for the competition-oriented destruction-reconstruction mechanism.
| Algorithm 3: Competition-oriented destruction-reconstruction mechanism |
| Input: population: POP, population size: ps, destruction-reconstruction size: ld, the set of historical solutions: Histsol |
| Output: new population: POP’ |
| 1 | Candsol ← ϕ, Union ← ϕ; |
| 2 | # History-based exploration strategy # indicates a remark. |
| 3 | πrh ← randomly select an individual from Histsol; |
| 4 | , ← Destruction(πrh, ld); |
| 5 | π1 ← Reconstruction(, ); |
| 6 | Add π1 to Candsol; |
| 7 | # Collective intelligence-driven strategy |
| 8 | for each individual ∈ POP do |
| 9 | , ← Destruction(πi, ld); |
| 10 | π’i ← Reconstruction(, ); |
| 11 | end |
| 12 | POP1 ← update POP; |
| 13 | idxbest1= arg((1f(POP1j)); |
| 14 | π2 ← POP1idxbest; |
| 15 | Add π2 to Candsol; |
| 16 | # Elite-oriented optimization strategy |
| 17 | idxbest2= arg(1(f(POPj)); |
| 18 | πeo ← POPidxbest2; |
| 19 | , ← Destruction(πeo, ld); |
| 20 | π3 ← Reconstruction(, ); |
| 21 | Add π3 to Candsol; |
| 22 | Union ← POP ∪ Candsol; |
| 23 | POP’ ← Select the top ps individuals with the rescue cost ranking from Union; |
| 24 | return POP’; |
4.4. Q-Learning-Based Local Search Strategy
Q-learning is an RL algorithm that allows agents to identify the most effective actions through feedback obtained from reward mechanisms. It mainly consists of intelligent agents, environment, state, actions, and rewards. An agent takes actions in a given state, and the environment will transition to a new state based on the actions and the current state and reward the agent. Through repeated loops, the experience of the intelligent agent will be improved. The Q-table is an important component of Q-learning, used to guide individuals in selecting actions during different iteration processes, as shown in
Table 2. Rows represent states, and columns represent actions. Each combination of states and actions corresponds to a Q-value, which reflects the learning process of the agent. The Q-value update is given by the following formula:
where
Q(
St,
at) denotes the Q-value for executing action
at in state
St,
α is the learning rate,
R represents the reward received after taking the action,
γ is the discount rate, and max
Q(
St+1,
at+1) corresponds to the maximum expected Q-value in the Q-table for state
St+1.
- (1)
Status
The solutions form the set of states in Q-learning, with each individual representing a particular state. Specifically, state St represents the current solution of the QPIG algorithm at time step t. The state St is observable and distinguishable, meaning that the agent can accurately identify and distinguish different states based on the characteristics of the current solution. State transitions satisfy the Markov property, meaning that future states depend only on the current state and the selected action and are independent of historical states.
- (2)
Action
The action indicates the neighborhood structure of the solution in the Q-learning environment model. Currently, no universal improvement strategy based on a neighborhood structure exists that is applicable to all problems. In QPIG, the set of neighborhood structures is designed with six operators specific to the problem. The critical path mentioned refers to the path sequence with the highest rescue cost. These neighborhood operators are explained as follows, and the graphical representation is provided in
Figure 4.
- (a)
Critical path swap operator (LS_op1)
Identify the critical path first, and then randomly choose a task from it as the critical task,
Tc. Then, exchange
Tc with each non-critical task
Tn until a promising path sequence is found, as shown in
Figure 4a.
- (b)
Critical path and non-critical path swap operator (LS_op2)
Randomly choose a critical task
Tc from the critical path, and randomly choose a non-critical path from the solution. Swap the critical task
Tc with each non-critical task
Tn in the non-critical path to generate a new path sequence, as depicted in
Figure 4b.
- (c)
Critical path insertion operator (LS_op3)
Select a task,
Tc, at random from the critical path. Subsequently,
Tc is inserted into the preceding and following locations of every non-critical task
Tn for swapping to generate a new path sequence, as illustrated in
Figure 4c.
- (d)
Critical path and non-critical path insertion operator (LS_op4)
Choose a task
Tc at random from the critical path, and randomly choose a non-critical path from the solution. Subsequently, insert the critical task
Tc into the preceding and following locations of every non-critical task
Tn in the non-critical path to generate a promising path sequence, as presented in
Figure 4d.
- (e)
Path reverse operator (LS_op5)
Select a continuous task in the critical path or non-critical path, and reverse the subsequence to generate a promising path sequence, as depicted in
Figure 4e.
- (f)
Path right shift operator (LS_op6)
Move the subsequence to the right in the critical path or non-critical path, move the last element of the subsequence to the first position of the subsequence, and move the other elements back in sequence to generate a new path sequence, as shown in
Figure 4f.
- (3)
Action selection
The
ε-greedy strategy is employed to explore new action options during the training process, thereby improving the global optimality of learning. In this study, a larger
ε value is set in the initial stage to encourage the algorithm to explore a wide range of local operators. As iterations continue, the value of
ε progressively decreases linearly, causing the strategy to lean towards utilizing existing knowledge. The formula used to compute
ε is as follows:
where
ε0 is the initial value, and
εmin is the minimum value of
ε,
g indicates the current iteration count, and
G denotes the total allowed iterations.
- (4)
Reward
After executing an action, an individual can receive a reward
R. In QPIG, the reward
R is calculated as follows:
where
f’ denotes the total rescue cost of the new solution, and
f represents the total rescue cost of the current solution.
4.5. Acceptance Criteria
The acceptance criterion based on simulated annealing (SA) is employed to enhance the diversity of solutions [
40]. Its core strategy involves accepting the current solution with a given probability
p if the rescue cost of the newly generated solution exceeds that of the current one, rather than directly replacing it with the new solution. The calculation formula for receiving probability
p is as follows:
where
π represents the current solution and
πnew denotes the newly generated solution. The temperature settings are as follows:
where
σ is the constant temperature value,
tik is the service time of robot
k at rescue point
i,
m denotes the number of robots, and
n refers to the number of rescue points.
4.6. Historical Information-Based Constructive Strategy
The construction strategy based on historical information aims to generate a better path by borrowing high-quality parts from multiple historical solutions. Its core idea is to identify the parts that perform well by utilizing path data from historical solutions and optimize them accordingly. By integrating the advantages of different solutions, this strategy aims to reduce the search space, thereby improving the algorithm’s convergence efficiency. The pseudocode of the historical information-based constructive strategy is shown in Algorithm 4. The historical path allocation operator is described in Algorithm 5. Algorithm 6 provides pseudocode for the path refinement operator.
| Algorithm 4: Historical information-based constructive strategy |
| Input: the set of current solutions: RSeq |
| Output: historical solution: πhistory |
| 1 | rmax = 0; |
| 2 | for each π ∈ RSeq do |
| 3 | cn = count (unique robots in π); |
| 4 | if cn > rmax then |
| 5 | rmax = cn; |
| 6 |
end |
| 7 | end |
| 8 | RobMat = CreateRobotPathMatrix(rmax); |
| 9 | PEbest = Historical path allocation operator(rmax, RobMat); |
| 10 | πhistory = Path refinement operator(PEbest, RSeq); |
| 11 | return πhistory; |
| Algorithm 5: Historical path allocation operator |
| Input: maximum number of robots: rmax, path matrix: RobMat |
| Output: the optimal path for each robot: PEbest |
| 1 | nsol = size(RobMat, 2); |
| 2 | for i = 1:rmax do |
| 3 | Pbest = inf; |
| 4 | for j = 1:nsol do |
| 5 | Pnew = RobMat(i, j); |
| 6 | if f(Pnew) < f(Pbest) then |
| 7 | Pbest = Pnew; |
| 8 |
end |
| 9 |
end |
| 10 | PEbest(i) = Pbest; |
| 11 | end |
| 12 | return PEbest; |
| Algorithm 6: Path refinement operator |
| Input: the optimal path for each robot: PEbest, the set of current solutions: RSeq |
| Output: historical solution: πhistory |
| 1 | πopt = CombinePath(PEbest); |
| 2 | πopt1 = RemoveDuplicates(πopt); |
| 3 | πref = RSeq(randi([1, length(RSeq)])); |
| 4 | MN = FindMissingPoints(πref, πopt1); |
| 5 | for each point z in MN do |
| 6 | bestpos = FindOptimalInsertion(πopt1, z); |
| 7 | πopt1 = InsertPointsToPath(πopt1, z, bestpos); |
| 8 | end |
| 9 | πhistory = πopt1; |
| 10 | return πhistory; |
4.7. Computational Time Complexity of QPIG
In the initialization phase, two problem-specific heuristic strategies are designed, including the utility-based initialization strategy and the rescue cost-based initialization strategy. The time complexity of the utility-based initialization strategy is O(n2), and similarly, the time complexity of the rescue cost-based initialization strategy is also O(n2), where n represents the number of rescue points. In the destruction and reconstruction phase, a competition-oriented destruction-reconstruction mechanism is developed to improve the diversity of the search process. Specifically, the time complexity of the history-based exploration strategy is O(ld), where ld represents the destruction-reconstruction size. The collective intelligence-driven and elite-oriented optimization strategies each have a time complexity of O(ps⋅ld), where ps denotes the population size. Additionally, operations such as merging populations and selecting the top individuals introduce a complexity of O(ps) and O(pslogps), respectively. Therefore, the overall time complexity of this mechanism is O(ps⋅ld + pslogps). In the local search phase, six neighborhood operators, i.e., critical path swap, critical path and non-critical path swap, critical path insertion, critical path and non-critical path insertion, path reverse, and path right shift operators, each have a time complexity of O(n). Additionally, operations such as Q-value updates, reward calculations, and state management are also performed. Therefore, the overall time complexity of the Q-learning-based local search strategy is O(n⋅G), where G is the number of iterations in this strategy. The time complexity of the historical information-based constructive strategy is O(n2). In summary, the time complexity of the QPIG algorithm is O(n2).
6. Conclusions and Future Work
This study presents the QPIG algorithm to solve the MRRPP problem. To obtain a high-quality initial population, two problem-specific heuristic schemes are designed. A competition-oriented destruction-reconstruction mechanism is applied, which includes the history-based exploration strategy, the collective intelligence-driven strategy, and the elite-oriented optimization strategy to enhance global exploration ability. A Q-learning-based local search strategy is implemented to select six operators to improve the algorithm’s exploitation ability. Additionally, a historical information-based constructive strategy is investigated to accelerate the algorithm’s convergence. Finally, experiments on 56 instances are conducted using the QPIG algorithm and other compared algorithms. This confirms the effectiveness of the QPIG algorithm in tackling the MRRPP problem. In addition, we analyzed the convergence curve of the algorithm and found that the proposed QPIG can achieve results close to the optimal solution within 10 s and 20 s, and the difference from the optimal value is usually not more than 5%. This result indicates that our method not only converges rapidly but also has good stability and efficiency, making it suitable for real-time applications with high time requirements. Therefore, even with the continuous improvement of scale or map resolution, QPIG can maintain stable computing behavior, converge within a reasonable time range, and provide near-optimal solutions, indicating its good scalability.
The limitations of this work include the following: (1) The QPIG algorithm does not adequately account for dynamic obstacles in rescue scenarios. (2) The proposed QPIG lacks consideration of the priority of task points, which may lead to neglecting the priority processing of emergency task points. (3) This study does not consider robot malfunctions or damage, which could disrupt the vital search process.
Future research will investigate the following directions: (1) The influence of environmental conditions, particularly the challenges introduced by moving obstacles, will be taken into account. (2) As mentioned in [
43], we will address the rescheduling problem when robots encounter damage, ensuring that tasks are reassigned and operations continue efficiently. (3) We will focus on incorporating priority approaches to handle urgent rescue tasks in path planning.