Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios

Tang, Zhangquan; Jiao, Yuanyuan; Wang, Xiao; Pan, Xiaogang; Peng, Jiawu

doi:10.3390/drones9080567

Open AccessArticle

Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios

by

Zhangquan Tang

,

Yuanyuan Jiao

,

Xiao Wang

,

Xiaogang Pan

^* and

Jiawu Peng

National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410003, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 567; https://doi.org/10.3390/drones9080567

Submission received: 15 July 2025 / Revised: 8 August 2025 / Accepted: 11 August 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

This study investigates the scheduling problem of return communication tasks for unmanned aerial vehicle (UAV) swarms, where disaster relief environmental global positioning is hampered. To characterize the utility of these tasks and optimize scheduling decisions, we developed a time window-constrained scheduling model that operates under constraints, including communication base station time windows, battery levels, and task uniqueness. To solve the above model, we propose an enhanced algorithm through integrating Dueling Deep Q-Network (Dueling DQN) into adaptive large neighborhood search (ALNS), referred to as Dueling DQN-ALNS. The Dueling DQN component develops a method to update strategy weights, while the action space defines the destruction and selection strategies for the ALNS scheduling solution across different time windows. Meanwhile, we design a two-stage algorithm framework consisting of centralized offline training and decentralized online scheduling. Compared to traditionally optimized search algorithms, the proposed algorithm could continuously and dynamically interact with the environment to acquire state information about the scheduling solution. The solution ability of Dueling DQN is 3.75% higher than that of the Ant Colony Optimization (ACO) algorithm, 5.9% higher than that of the basic ALNS algorithm, and 9.37% higher than that of the differential evolution algorithm (DE). This verified its efficiency and advantages in the scheduling problem of return communication tasks for UAVs.

Keywords:

UAV swarm; task scheduling; utility impact; dueling deep Q-network; adaptive large neighborhood search algorithm

1. Introduction

Unmanned aerial vehicle (UAV) technology has gained widespread adoption in disaster relief operations due to its mobility, flexibility, and low-risk profile, making it indispensable for critical tasks including damage assessment, material transport, and communication relay [1,2]. In disaster environments, UAV swarms can efficiently execute complex missions—such as multi-zone searches, large-scale supply delivery, and rescue material distribution—through coordinated operations. However, single-UAV systems exhibit limitations in resources and operational capacity when handling complex relief scenarios, struggling to meet diverse demands independently. Consequently, collaborative UAV swarm systems demonstrate increasing advantages in disaster operations [3]. Multi-UAV systems offer enhanced flexibility and fault tolerance, enabling effective task completion in dynamic disaster environments [4]. Nevertheless, merely increasing UAV numbers fails to realize full collaborative potential. Without robust coordination mechanisms, expanded swarm operations risk task conflicts and resource inefficiencies [5]. After completing batch missions, UAVs typically return to supply centers in scheduled groups. Yet the challenging disaster environment often lacks unified return guidance systems, making mobile communication base stations essential for coordinating swarm returns. Previous research focused primarily on outbound mission scheduling while neglecting the critical importance of return communication scheduling [6,7,8]. Despite UAV swarms’ vital role in reconnaissance, transport, and emergency response, they face significant challenges during return operations due to environmental unpredictability and decentralized control. Thus, optimizing return communication task scheduling between multiple base stations and rescue UAVs remains crucial for improving relief efficiency in complex, constrained disaster scenarios.

Scheduling return communication tasks between multiple base stations and rescue UAVs serves as a critical coordination mechanism for UAV return operations in disaster scenarios. This process entails complex multi-factor task allocation to ensure efficient and safe swarm returns post-mission. It requires comprehensive consideration of multidimensional factors, including base station coverage, bandwidth constraints, multi-factor utility, mobile station power levels, flight paths, time windows, and task urgency [9,10,11,12]. Primary objectives include optimizing mobile communication resource utilization, enhancing efficiency, and maintaining continuous rescue operations [13]. Given the limited capabilities of mobile base stations in disaster zones and UAV swarms operating in GPS-denied environments, multi-factor utility considerations become essential [14]. Effective scheduling reduces information loss during returns, increases successful return rates, and improves system reliability and anti-interference capabilities—strengthening overall disaster response.

Fundamentally, this scheduling constitutes a task allocation optimization problem. Current research offers various solutions, including intelligent algorithms [15,16,17,18], adaptive large neighborhood search (ALNS) [19], and reinforcement learning methods [20]. However, conventional intelligent algorithms demand substantial computational resources per solution and struggle to yield high-quality results efficiently. Reinforcement learning approaches, while capable of high-quality solutions, incur excessive computation times due to training requirements. Since task scheduling with node constraints can be transformed into constrained path selection, this paper proposes an advanced algorithm for return communication task scheduling. In the scheduling of return communication tasks for unmanned aerial vehicle (UAV) swarms, the hybrid method of Dueling DQN-ALNS is of great significance and practical value. Traditional intelligent algorithms and reinforcement learning have their own limitations. However, this hybrid method effectively combines the advantages of deep reinforcement learning and the adaptive large neighborhood search algorithm. Dueling DQN can accurately evaluate the value of task allocation schemes and flexibly respond to complex and changeable environments. ALNS, with its efficient search and flexible strategy adjustment, can quickly find high-quality scheduling schemes. In practice, the hybrid method uses offline deep learning to acquire universal strategies and online real-time scheduling to meet immediate demands. This combination of global and local optimization enables the hybrid method to overcome the limitations of traditional methods and provides an efficient solution for UAV return communication task scheduling. The specific main contributions are as follows:

(1) We formulate the return communication task scheduling problem involving multiple base stations and rescue UAVs in disaster scenarios. Accounting for GPS-denied environments during UAV returns, we model multi-mobile-base-station guidance systems, incorporating practical constraints: task uniqueness, time windows, and base station energy limits. Our model’s objective function integrates multi-factor utility impacts.

(2) We develop a hybrid deep reinforcement learning and adaptive large neighborhood search algorithm (Dueling DQN-ALNS) with an offline training and online deployment framework. The reinforcement learning component employs Dueling Deep Q-Network (Dueling DQN), which decomposes Q-value estimation into separate value and advantage streams for enhanced precision. The enhanced ALNS component implements five destruction strategies, five repair strategies, combined simulated annealing (SA) and rapidly exploring random tree (RRT) acceptance criteria, and dynamic strategy weighting. This framework generates high-quality solutions efficiently.

(3) We conduct simulation experiments validating the model’s task capacity. Comparative analyses demonstrate our algorithm’s superior problem-solving capability against existing intelligent methods.

This paper is structured as follows: Section 2 reviews current research on UAV task scheduling. Section 3 develops a mathematical model for return communication scheduling in rescue scenarios. Section 4 details the Dueling DQN-ALNS algorithm and its offline–online framework. Section 5 presents comparative simulations validating the algorithm’s performance. Section 6 concludes the study and outlines future research directions.

2. Related Work

This study is closely related to the literature on task scheduling models with multiple constraints and scheduling model optimization solution algorithms. We present a concise review below.

2.1. Task Scheduling Model with Multiple Constraints

Numerous models have been developed for UAV task scheduling in real-world scenarios. The return communication task scheduling between multiple base stations and rescue UAVs fundamentally constitutes a multi-task scheduling problem. Wang et al. [21] established a multi-layer correlation network model for UAV swarms based on inter-network dependencies, along with a task network model leveraging structural connectivity for scheduling. Gong et al. [22] analyzed high-performance UAV swarm organization and designed a reliability assessment model using task cycles. Qin et al. [23] formulated a Multi-Objective Multi-Constraint Cooperative Task Allocation Problem (MOMCCTAP) for grassland restoration, incorporating collaboration requirements, task priorities, and range constraints. Dui et al. [24] developed a unified task model considering mission criticality for reliability optimization. Shi et al. [25] proposed a sub-area task offloading model for ground users in disaster scenarios. Chen et al. [26] created an optimization model, minimizing the total mission time for UAV-UGV collaborative planning.

2.2. Scheduling Model Optimization Solution Algorithms

The task scheduling problem of return-guided communication from multiple communication base stations to rescue UAVs is, by its nature, a complex combinatorial optimization problem. This is due to factors such as task ordering, time window constraints, number of tasks, and reallocation, making it an NP-hard problem. Methods for solving this type of problem are typically classified into three categories: mathematical planning algorithms, swarm intelligent optimization algorithms, and heuristic algorithms [27].

Commonly used mathematical planning algorithms include integer planning algorithms and dynamic planning algorithms. Swarm intelligence algorithms mainly consist of particle swarm optimization algorithms, ant optimization algorithms, and artificial bee colony algorithms. Heuristic algorithms primarily include genetic algorithms, forbidden search algorithms, and simulated annealing algorithms [28].

Here are some examples of mathematical planning methods. Stefan et al. [29] used the branch-and-bound method to solve the traveler problem for UAVs. Gabbeau et al. [30] applied the column generation method to address the allocation problem for multivariate manned machines. Numerous solutions also exist based on swarm intelligence algorithms. Wu et al. [31], in studying the task planning problem for the formation of multivariate human–machine systems, developed a two-phase path replanning method. This method combines the efficient and fast exploratory randomized tree (RRT) algorithm with the improved consensus (MO) algorithm, which efficiently solves the problem. Luo et al. [32], aiming to meet the requirements of multivariate human–machine systems during breakthrough processes, proposed a trajectory planning algorithm based on Improved Complete Particle Swarm Optimization (IHPSO). This effectively addressed the path planning problem. Wang et al. [33] found that the traditional artificial bee colony algorithm suffers from local planning redundancy and long planning times. They proposed an improved algorithm to accurately solve the UAV path planning problem. Several solutions also exist based on heuristic algorithms. Chen et al. [34] proposed an ant colony system (ACS)-based algorithm for heterogeneous UAVs. This algorithm effectively covers the set area and obtains good UAV paths. Cao et al. [35] proposed an algorithm for autonomous UAV path planning, using a genetic algorithm for path planning evaluation. They improved the adaptive genetic algorithm to enhance UAV autonomous path planning capabilities. Saadi et al. [36] designed a hybrid IMRFO-TS algorithm, which combines the improved Frequency Domain Load Optimization (IMRFO) algorithm with the forbidden search (TS) algorithm. This provides an effective solution for UAV deployment in smart cities. Fan et al. [37] pointed out that the Multi-UAV Multi-Reconnaissance Mission Planning and Scheduling Problem (MURMPP) often fails to meet computational demands, resulting in sub-optimal solutions. They incorporated simulated annealing (SA) into the Optimal Objective-Based Deep Reinforcement Learning (NNO-DRL) method and obtained an optimal workaround. The key methodologies and contributions of these related works are summarized in Table 1 for a comparative overview.

The existing research on UAV task scheduling models and solution algorithms has laid a solid foundation for this field. However, when applied to the specific problem of UAV return communication task scheduling in disaster scenarios, these conventional methods show clear limitations. Mathematical programming algorithms struggle with computational complexity and lack adaptability to dynamic disaster environments. Swarm intelligence algorithms often fail to balance exploration and exploitation effectively, leading to insufficient intensification around promising solutions. Heuristic algorithms, while efficient, are typically problem-specific and lack generalizability. These gaps highlight the need for more advanced algorithms. The hybrid Dueling DQN-ALNS algorithm offers a promising solution by integrating deep reinforcement learning with adaptive large neighborhood search, aiming to enhance solution quality, adaptability, and computational efficiency in disaster scenarios.

2.3. Comparison and Analysis

Based on the above related work, most scholars have primarily focused on fixed-policy-based heuristic algorithms. These methods generally rely on predefined fixed heuristic rules and fixed search strategies. In contrast, the Dueling DQN-ALNS algorithm proposed in this paper can dynamically and adaptively adjust the optimization strategy. During communication task assignment in disaster relief scenarios, it leverages deep reinforcement learning. Dueling DQN-ALNS integrates reinforcement learning with adaptive large neighborhood search (ALNS) algorithms at different stages. It employs an algorithmic framework that combines offline training and online invocation. This allows for time savings during offline invocation. The algorithm independently learns adapted search strategies through training rather than relying on rigidly fixed predefined rules. As a result, the Dueling DQN-ALNS algorithm demonstrates greater adaptability and proves to be more effective in addressing the problem of return-guided communication mission planning for multiple UAVs with multiple communication base stations.

3. Problem Description and Modeling

In this section, we describe the UAV return-guided communication task scheduling problem in disaster relief environments and then formulate a detailed mathematical scheduling model for this problem.

3.1. Problem Description

The schematic diagram of the problem is shown in Figure 1.

3.1.1. Basic Problem Description

Let

N = {1, 2 \dots, n}

denote a set of mobile communication base stations, where

i \in N

. Let

A^{T} = {1, 2, \dots, m}

denote a set of UAV clusters, with UAV

j \in A^{T}

. Each communication base station carries a communication module of a certain power size, and each UAV requiring return for replenishment represents a communication guidance task. In the task scheduling of multi-communication base stations guiding multiple UAVs for return communication, a centralized research method is adopted to allocate the

N^{-}

communication base stations and

A^{T}

UAVs. This centralized approach involves a central scheduler that makes decisions based on the size and composition of the UAV clusters, enabling efficient coordination and resource management. Unlike decentralized approaches, where each base station or UAV makes decisions independently, the centralized method provides a unified view of the entire system. This allows for more efficient resource utilization and conflict resolution, as the central scheduler can balance the load and optimize the overall performance of the system. The guidance effectiveness of communication base stations varies with the size and composition of the UAV cluster. Specifically, when a UAV’s planned trajectory intersects a base station’s coverage, the scheduler determines whether to execute the guidance task.

Let

M = {1, 2, \dots, c}

denote the set of disaster relief supply centers, where the supply center is

p \in M

. The UAV needs to depart from the return starting position

(x_{j}, y_{j})

and proceed to one of the disaster relief supply centers for replenishment. The location of the supply center is

(x_{p}, y_{p})

, and the final time of the UAV is related to the location of the supply center. All UAVs depart at time 0, and reaching the supply center is the final time

t_{j p}^{a r r i v e}

. The current time at the UAV’s position is

t

. If the UAV’s planned flight trajectory passes through this base station, then this communication base station

i

corresponds to the earliest start time

t_{i j}^{s t a r t}

and the latest end time

t_{i j}^{e n d}

for each task

j

, which constitutes the executable window of the base station for the UAV return guidance task. Since the communication base station requires a certain known duration

t_{i j}^{D}

to execute the task for the UAV and

t_{i j}^{D}

is a known quantity, it is therefore necessary to first calculate the latest start time

t_{i j}^{L}

for the UAV to be executed for the guidance task at this base station. The calculations related to the aforementioned time nodes are all related to the base station’s position

(x_{i}, y_{i})

, the base station’s communication range

r_{i}

, the UAV’s speed

v_{j}

, and the UAV’s return starting position

(x_{j}, y_{j})

. The calculation method related to time nodes is given as follows.

When the UAV is at the communication edge of the communication base station, an equation regarding time

t

can be obtained:

\sqrt{(x_{j} + \frac{(x_{p} - x_{j})}{t_{j p}^{a r r i v e}} \cdot t - x_{i})^{2} + (y_{j} + \frac{(y_{p} - y_{j})}{t_{j p}^{a r r i v e}} \cdot t - y_{u})^{2}} = r_{i}

(1)

The analysis shows that this equation is a standard quadratic equation form

a_{t} t^{2} + b_{t} t + c_{t} = 0

, where

a_{t}

,

b_{t}

, and

c_{t}

are equation parameters. Let the discriminant of the equation be

D = b_{t}^{2} - 4 a_{t} c_{t}

. From this, the relevant coefficients and parameter expressions for the quadratic equation regarding time

t

can be calculated as follows:

t_{j, p}^{a r r i v e} = \frac{\sqrt{{(x_{p} - x_{j})}^{2} + {(y_{p} - y_{j})}^{2}}}{v_{j}}

(2)

D = b^{2} - 4 a_{t} c_{t} = \frac{{(x_{p} - x_{j})}^{2} + {(y_{p} - y_{j})}^{2}}{t_{j, p}^{a r r i v e 2}}

(3)

b_{t} = \frac{2 [(x_{p} - x_{j}) (x_{j} - x_{i}) + (y_{p} - y_{j}) (y_{j} - y_{i})]}{t_{j, p}^{a r r i v e}}

(4)

c_{t} = {(x_{j} - x_{i})}^{2} + {(y_{j} - y_{i})}^{2} - r_{i}^{2}

(5)

Thus, the calculation expressions for the earliest start time

t_{i j}^{s t a r t}

, latest end time

t_{i j}^{e n d}

, and latest start time

t_{i j}^{L}

are obtained. Among these, calculations depend on whether the UAV’s starting position is within the communication range of the base station and whether the location of the supply center is within that base station’s communication range.

t_{i j}^{s t a r t} = \{\begin{array}{l} 0, & \sqrt{{(x_{j} - x_{i})}^{2} + {(y_{j} - y_{i})}^{2}} \leq r_{i} \\ - \frac{b_{t} - \sqrt{D}}{2 a_{t}}, & \sqrt{{(x_{j} - x_{i})}^{2} + {(y_{j} - y_{i})}^{2}} > r_{i} and D \geq 0 and \frac{b_{t} - \sqrt{D}}{2 a_{t}} \in [0, t_{j, p}^{a r r i v e}] \\ none, & otherwise \end{array}

(6)

t_{i j}^{e n d} = \{\begin{array}{l} t_{j, p}^{a r r i v e}, & \sqrt{{(x_{p} - x_{i})}^{2} + {(y_{p} - y_{i})}^{2}} \leq r_{i} \\ \frac{- b_{t} - \sqrt{D}}{2 a_{t}}, & \sqrt{{(x_{p} - x_{i})}^{2} + {(y_{p} - y_{i})}^{2}} > r_{i} and D \geq 0 and \frac{- b_{t} + \sqrt{D}}{2 a_{t}} \in [0, t_{j, p}^{a r r i v e}] \\ none, & otherwise \end{array}

(7)

The latest start time ensures the task is completed before the UAV leaves the communication range:

t_{i j}^{L} = \{\begin{array}{l} t_{i j}^{e n d} - t_{i j}^{D}, & t_{i j}^{L} > t_{i j}^{s t a r t} \\ none, & otherwise \end{array}

(8)

3.1.2. Multi-Factor Utility Impact Description

In the task scheduling of multi-communication base stations guiding multiple UAVs for return communication, considering multi-factor utility impacts is crucial. Traditional communication task scheduling methods typically focus on single factors, such as task priority or completion time. This study innovatively introduces a multi-factor utility model that comprehensively evaluates task value and execution costs by integrating task weight, energy consumption, fault cost expenditure, and network congestion costs. This model assists communication base stations in autonomously allocating resources, prioritizing the execution of high-value tasks, thereby enhancing task success rates and system resource utilization efficiency. Simultaneously, through refined cost accounting, this model effectively balances task utility and cost expenditures, strengthening system stability and reliability in complex environments. Unlike previous simple task allocation strategies, the task allocation mechanism in this study enables communication base stations to respond to changes in the utility environment, flexibly adjust task selection, and optimize resource allocation. This comprehensive evaluation and allocation approach significantly improves the scientific validity and effectiveness of communication base stations in UAV task scheduling, providing important guidance for practical applications.

When communication base station

i

executes task

j

, the basic utility obtained is

W_{i j}

, where utility represents the final gain or loss after the base station executes the task. For each successful execution of a task and guidance of the target by the communication base station system, corresponding rewards are obtained, thus constructing the value indicator function. Whenever the communication base station system completes a task, while gaining corresponding utility, it also faces certain cost expenditures. These mainly include energy resource consumption during the guidance process by the communication base station system, and potential fault payments in different guidance tasks, thereby generating the cost indicator function. In UAV cluster return guidance task scheduling, guidance tasks are heterogeneous, with significant differences in nature and objectives among different types of tasks. Therefore, the weight coefficient and value of tasks vary according to task type differences. For example, tasks guiding important rescue UAVs have higher weight and value, while guidance tasks for general rescue UAVs have relatively lower weight; when guiding rescue UAV clusters, core UAVs have larger task weights, while tasks performing rescue site cleanup and assessment have relatively smaller weights. This differentiated setup helps reasonably allocate communication base station bandwidth resources, improve the success rate of guidance tasks, and ensure the stable progress of rescue missions.

For communication base station

i

, when executing task

j

, the utility generated upon complete completion is

W_{i j}

, the power consumption per unit time is

a_{i}^{T}

, and the unit power cost is

f_{i}^{T}

. The value indicator for communication base station

i

when executing task

j

is

F_{i j}^{1}

, and the cost indicator is

F_{i j}^{2}

. Therefore, the utility for communication base station

i

upon completing task

j

is

F_{i j}^{1} - F_{i j}^{2}

. The time for UAV cluster

j

to reach the boundary of communication base station

i

from the starting position at the rescue location is

t_{i j}^{c}

.

P_{i j}^{w e i g h t}

is the task weight coefficient for communication base station

i

executing task

j

. The importance coefficient of communication base station

i

is

e_{i}

. The default compensation amount for communication base station

i

when executing task

j

is

C_{i j}^{3}

, which is the network congestion cost. The probability of communication base station

i

causing other fault costs when executing task

j

is

P_{i j}^{f a u l t}

, and the fault cost caused when executing task

j

is

c_{j}

. Then, the utility obtained by communication base station

i

when executing task

j

is

F_{i j}^{3} = F_{i j}^{1} - F_{i j}^{2}

, where

F_{i j}^{1} = W_{i j} P_{i j}^{w e i g h t} e_{i}

, and

F_{i j}^{2} = f_{i}^{T} t_{i j}^{C} + P_{i j}^{f a u l t} c_{j} + C_{i j}^{3}

.

In multi-base station guidance for multiple UAV return task scheduling, the economic efficiency and importance of each guidance task vary. The communication base station system must consider this factor when performing guidance. For this purpose, task weight coefficients are introduced to reflect the priority of the communication system executing tasks in the utility function. These coefficients integrate both distance and value factors, determined through the value of executing the task and the distance from the communication base station system to the task. The calculation formula for the task weight coefficient is as follows:

P_{i j}^{w e i g h t} = \frac{W_{i j} e_{i}}{d_{i j}} \forall i \in N, j \in A^{T}

(9)

In the scheduling of multiple base stations executing UAV return communication tasks,

d_{i j}

represents the distance from the communication base station

i

’s position to the initial position of UAV cluster task

j

. The task weight coefficient considers the initial distance, enabling communication base stations to autonomously select tasks with higher weights, making task allocation more reasonable and optimized and better aligned with the characteristics of edge environments. The value function is as follows:

F_{i j}^{1} = W_{i j} P_{i j}^{w e i g h t} e_{i} \forall i \in N, j \in A^{T}

(10)

When communication base station

i

successfully completes each UAV guidance task, it receives rewards but also incurs certain cost expenditures. These mainly include energy consumption during task execution by communication base station

i

, the cost of fault expenditures, default compensation amounts for different collaborative tasks, and other costs, which are closely related to long-term usage and long-term task scheduling. This leads to the cost function. The first term is the energy consumption cost, the second term is the cost of fault expenditures, and the third term is the network congestion cost. When UAVs return simultaneously, communication base stations may face network congestion problems, causing phenomena such as data transmission delays and packet loss, which affect the real-time performance and reliability of return guidance. To address network congestion, measures such as network expansion and load balancing are required, which incur additional costs:

C_{i j}^{1} = f_{i}^{T} a_{i}^{T} t_{i j}^{C} \forall i \in N, j \in A^{T}

(11)

C_{i j}^{2} = P_{i j}^{f a u l t} c_{j} \forall i \in N, j \in A^{T}

(12)

Therefore, the cost function is as follows:

F_{i j}^{2} = C_{i j}^{1} + C_{i j}^{2} + C_{i j}^{3} \forall i \in N, j \in A^{T}

(13)

The overall value utility function is obtained as follows:

F_{i j}^{3} = F_{i j}^{1} - F_{i j}^{2} \forall i \in N, j \in A^{T}

(14)

Before formulating the specific task scheduling utility maximization model, the following assumptions are listed:

The communication radius of mobile base stations is much larger than the flight altitude of UAVs, and the communication range is treated as a two-dimensional planar constraint.
The time required for UAV return communication guidance and the utility of base stations executing this task are deterministic.
Base station guidance for UAV return communication requires integrated control, which is a hard constraint—a single base station can only guide one UAV at a time.
UAVs within the communication range of a base station maintain real-time visibility with the base station, and their signals are unaffected.
Communication guidance signals and visible signals within the range belong to different channels and do not interfere with each other.
During the execution of UAV communication tasks by the base station, the process cannot be interrupted.

The key assumptions we propose effectively streamline the complexity of task scheduling for multiple communication base stations and multiple UAVs while preserving the core elements of the problem. By treating the communication range as a two-dimensional plane, setting deterministic task requirements and utilities, limiting the number of UAVs simultaneously guided by a base station, and ensuring the stability of communication signals and continuity of task execution, these assumptions enable us to focus on studying how to optimize the task allocation strategy for UAV return communication guidance. This enhances the overall task execution efficiency and resource utilization of the base station system for rescue UAV returns.

3.2. Model Construction

In UAV swarm return guidance task scheduling, the overall effective scheduling of communication base stations guiding UAV swarm tasks is a complex process. First, it is essential to analyze the types and scale of UAVs involved in communication tasks, as well as the specific conditions of the communication area. Simultaneously, careful evaluation of each communication base station’s resources must be conducted, including factors such as the number of base stations, their power, energy storage capacity, types of sensors they carry, their endurance capabilities, and communication capacities. Based on the urgency and importance of the UAV swarm tasks being guided, priorities must be clearly established. For example, communication tasks for fast and highly important rescue UAVs are often ranked as the highest priority. Then, according to priority, tasks are allocated to the most suitable communication base stations, while considering the geographical locations of the UAVs and the compatibility between their resource capabilities and task requirements. Additionally, it is necessary to consider the collaborative potential between different communication base stations. UAVs with abundant nearby resources and task collaboration potential should be arranged as much as possible in adjacent or related task areas, laying the foundation for cooperation in subsequent guidance processes. This enables efficient execution of guidance and related work during the UAV swarm return guidance task scheduling phase.

Before proceeding with centralized problem modeling, to simplify the solution approach and increase solving speed, the problem is transformed. First, a virtual endpoint base station is introduced. In the multi-base station execution of multiple UAV return guidance communication task scheduling, there are multiple base stations and multiple UAVs, but communication control resources are limited. It may not be possible to execute all UAV guidance communication tasks within the limited time. Therefore, a virtual endpoint communication base station is introduced to accommodate all unexecuted UAV tasks. This yields a set containing virtual communication base stations,

N^{+} = {1, 2, \dots, n, n + 1}

, where

n + 1

represents the virtual communication base station, meaning that when a communication base station fails to successfully guide a UAV, the UAV communication task is accepted by this virtual base station.

Secondly, virtual paths are introduced, considering that different base stations have different time windows for corresponding tasks, with some tasks being optional and others non-optional. Therefore, a fully connected task path is introduced. To form a closed loop, virtual start and end tasks are introduced, with the start and end tasks being the same virtual tasks. The resource-limited, utility-considering task scheduling problem with time constraints is transformed into a resource-limited, utility-considering multi-vehicle path selection problem with time constraints. This yields a set containing virtual tasks,

A = {0, 1, 2 \dots m, m + 1}

, where 0 and

m + 1

represent the virtual start task and end task, respectively.

The problem description after introducing the virtual communication base station and virtual paths is shown in Figure 2.

3.2.1. Symbol Settings

Before building the model, the following symbols are introduced. For ease of understanding, Roman letters and Greek letters are used to represent parameters and decision variables, respectively. The specific representation is shown in Table 2.

3.2.2. Model Establishment

In multi-communication base station task scheduling for multiple UAVs, traditional methods primarily focus on task allocation and path scheduling. Building on this, we further incorporate multiple factors into the decision-making process, including base station task weights, online task weights, energy consumption, detection costs, and network congestion costs. Additionally, we comprehensively consider constraints such as UAV flight time, payload capacity, and communication base station resource limitations. A multi-factor utility model is constructed, aiming to maximize the total profit of all tasks.

Objective Function:

Maximize \sum_{j \in A^{T}} [W_{i j} P_{i j}^{w e i g h t} e_{i j} - \sum_{i \in N} (f_{i}^{T} t_{0 j}^{F} + P_{i j}^{w e i g h t} c_{j} + C_{i j}^{3})] \sum_{k \in A^{-}} α_{i j k}

(15)

Subject to the following:

\sum_{k \in A^{+}} α_{i 0 k} = 1 \forall i \in N^{+}

(16)

\sum_{j \in A^{-}} α_{i j (a + 1)} = 1 \forall i \in N

(17)

\sum_{j \in A^{-}} α_{i j k} = \sum_{j \in A^{+}} α_{i j k} \leq 1 \forall i \in N, k \in A^{T}

(18)

\sum_{i \in N^{+}} \sum_{j \in A^{-}} α_{i j k} = 1 \forall k \in A^{T}

(19)

τ_{i k} \geq τ_{i k} + t_{i k}^{D} + t_{i j k}^{F} - G (1 - α_{i j k}) \forall i \in N^{+}, j \in A^{T}, k \in A^{+}

(20)

τ_{i j} \geq τ_{i 0} + t_{i 0 k}^{F} - G (1 - α_{i 0 k}) \forall i \in N^{+}, k \in A^{T}

(21)

τ_{i 0} = 0 \forall i \in N^{+}

(22)

τ_{i j} \leq t_{i j}^{L} \forall i \in N^{+}, j \in A^{T}

(23)

t_{i j}^{L} = t_{i j}^{e n d} - t_{i j}^{D} \forall i \in N^{+}, j \in A^{T}

(24)

β_{i j} - β_{i k} + 1 \leq | A | (1 - α_{i j k}) \forall i \in N^{+}, j \in A^{-}, k \in A^{+}

(25)

β_{i k} \leq | A | \sum_{j \in A^{-}} α_{i j k} \forall i \in N^{+}, k \in A^{+}

(26)

β_{i 0} = 0 \forall i \in N^{+}

(27)

β_{i (a + 1)} = \sum_{i \in A^{-} j \in A^{+}} α_{i j k} \forall i \in N

(28)

τ_{i k} b_{i}^{T} + \sum_{k = 0}^{k} t_{i k}^{D} a_{i}^{T} \leq {\bar{g}}_{i} + G (1 - α_{i j k}) \forall i \in N, k \in A^{T}, j \in A^{-}

(29)

b_{i}^{T} τ_{i (m + 1)} \leq {\bar{g}}_{i} \forall i \in N

(30)

α_{i j k} \in {0, 1} \forall i \in N^{+}, j \in A^{-}, k \in A^{+}

(31)

0 \leq β_{i j} \leq | A | \forall i \in N^{+}, j \in A

(32)

τ_{i j} \geq 0 \forall i \in N^{+}, j \in A

(33)

Objective (15): Maximize total utility: Constraints (16) and (17) ensure that communication base station

i

starts from the virtual end task and ultimately returns to the virtual start task. Constraint (18) guarantees flow conservation, where each task can be executed by a communication base station at most once. Constraint (19) ensures that every task needs to be executed; if a communication base station does not guide the UAV cluster, then the UAV is guided back by the supply center, and the base station gains no utility from UAV guidance. Constraints (20) and (21) ensure the time when communication base station

i

starts task node

j \in A

. Constraint (22) stipulates that the time at the virtual task start is zero. Constraints (23) and (24) ensure that the start time of the communication base station

i

at task node

j

is no later than

t_{u i}^{L}

. Constraints (25)–(27) are subtour elimination constraints. Constraints (28) and (29) stipulate that the tasks executed by communication base station

i

should satisfy the power constraints of the communication base station. Constraints (30)–(33) define the decision variables.

4. An Improved Deep Reinforcement Learning-Driven ALNS Search Algorithm

4.1. Dueling DQN-ALNS Algorithm Framework

To solve the scheduling problem of multi-base station collaborative multi-UAV return guidance communication tasks in disaster environments, we propose a two-stage collaborative optimization framework (DRL-ALNS) based on deep reinforcement learning and adaptive large neighborhood search. Using the Dueling Deep Q-Network algorithm as the reinforcement learning component, we design an improved joint algorithm (Dueling DQN-ALNS), combining Dueling Deep Q-Network and an adaptive large neighborhood algorithm specifically for this guidance task scheduling problem. This framework adopts a combination of offline pre-training and online real-time optimization, fully leveraging the capabilities of deep reinforcement learning in strategy learning and the advantages of adaptive large neighborhood search (ALNS) in local search. In the offline phase, we train the Dueling DQN algorithm using historical alternating scenario data. These data cover multiple attributes of UAVs and communication base stations, such as initial positions, speeds, communication ranges, and time windows. Through these data, the network can learn optimal destruction–repair strategy combinations in different environments, and these strategy parameters are subsequently stored in the experience replay buffer. The Dueling DQN algorithm can more accurately estimate Q-values by separating the value stream and advantage stream, thereby improving the accuracy of destruction–repair strategy selection.

Specifically, the Dueling DQN network includes a feature extraction layer, value stream, advantage stream, and output layer. The training process aims to minimize the mean squared error between predicted Q-values and actual calculated Q-values. In the online phase, the pre-trained Dueling DQN parameter model is loaded, and network weights are frozen to ensure the stability of learned parameters. At this point, the state encoder perceives the environmental state of the current scheduling solution in real time, including base station load characteristics and the global optimization state, forming a high-dimensional state vector of the scheduling solution as input to the DQN network. The Dueling DQN Agent generates adaptive destruction and repair strategy combinations based on the current state, which are used to guide the destruction and repair operations of the ALNS algorithm. Within each appropriate rolling optimization window, the Dueling DQN-ALNS algorithm framework transfers the strategy knowledge learned offline to the online decision engine, thereby generating a globally optimal guidance communication scheduling solution that satisfies energy constraints and avoids time window conflicts. The structure of the Dueling DQN-ALNS algorithm framework is shown in Figure 3 below.

The specific algorithms are detailed in Algorithms 1 and 2:

Algorithm 1 Offline training

\begin{array}{l} 1 : Initialize replay buffer M \\ 2 : for episode = 1 to N_{episodes} do \\ 3 : Initialize environment with random S ~ D \\ 4 : S_{current} \leftarrow Greedy Initial Solution (χ_{1}, χ_{2}, χ_{3}) \\ 5 : for t = 1 to N_{iter} do \\ 6 : s_{t} \leftarrow Encode State (S_{current}) \\ 7 : a_{t} \leftarrow \{\begin{array}{l} Random Action () & with prob ϵ \\ \arg \max_{a} Q (s_{t}, a; θ) & otherwise \end{array} \\ 8 : (d, r) \leftarrow DecodeAction (a_{t}) \\ 9 : S_{destroyed} \leftarrow D_{d} (S_{current}, ρ_{t}) \\ 10 : S_{new} \leftarrow R_{r} (S_{destroyed}) \\ 11 : r_{t} \leftarrow Φ (S_{new}, S_{current}, S_{best}) \\ 12 : s_{t + 1} \leftarrow Encode State (S_{new}) \\ 13 : B \leftarrow B \cup {(s_{t}, a_{t}, r_{t}, s_{t + 1})} \\ 14 : Update θ with minibatch B ~ M \\ 15 : θ^{-} \leftarrow τ θ + (1 - τ) θ^{-} \\ 16 : Update Strategy Weights (d, r, r_{t}) \\ 17 : ϵ \leftarrow ϵ \cdot ϵ_{decay} \\ 18 : e n d f o r \\ 19 : e n d f o r \end{array}

Algorithm 2 Online optimization

\begin{array}{l} I n p u t : θ^{*} : Trained policy parameters S_{real - time} : Current disaster scenario \\ O u t p u t : S^{*} : Optimized schedule \\ 1 : S_{current} \leftarrow Greedy Initial Solution (χ_{1}, χ_{2}, χ_{3}) \\ 2 : for t = 1 to N_{iter} do \\ 3 : s_{t} \leftarrow EncodeState (S_{current}) \\ 4 : a_{t} \leftarrow \arg \max_{a} Q (s_{t}, a; θ^{*}) \\ 5 : (d, r) \leftarrow Decode Action (a_{t}) \\ 6 : S_{destroyed} \leftarrow D_{d} (S_{current}, ρ_{t}) \\ 7 : S_{new} \leftarrow R_{r} (S_{destroyed}) \\ 8 : if Accept (S_{new}, S_{current}, T) then \\ 9 : S_{current} \leftarrow S_{new} \\ 10 : e n d i f \\ 11 : T \leftarrow T \cdot α_{cool} \\ 12 : e n d f o r \\ 13 : r e t u r n S_{best} \end{array}

In these two algorithms, the ALNS algorithm is used during the offline training phase to generate diverse experience samples, helping the Dueling DQN network learn the effects of different strategy combinations; during the online optimization phase, it is used to adjust and optimize scheduling solutions in real time, ensuring the real-time performance and adaptability of the algorithm. By combining ALNS and Dueling DQN, the algorithm can effectively handle complex task scheduling for UAV return guidance communication in rescue environments, generating high-quality communication scheduling solutions.

4.2. Improved ALNS Main Search Algorithm

The adaptive large neighborhood search (ALNS) algorithm is a heuristic method that enhances neighborhood search by incorporating measures for evaluating the effectiveness of operators such as destruction and repair. This enables the algorithm to adaptively select effective operators to destroy and repair solutions, thereby increasing the probability of obtaining better solutions. Building upon LNS, ALN allows the use of multiple destroy and repair methods within the same search process to construct the neighborhood of the current solution. ALNS assigns a weight to each destroy and repair method, controlling their usage frequency during the search. Throughout the search process, ALNS dynamically adjusts the weights of various destroy and repair methods to build better neighborhoods and obtain improved solutions. This paper designs an improved Adaptive Large Neighborhood Search (ALNS) algorithm for UAV swarm return guidance communication scheduling, implemented through a four-phase optimization framework: First, construct a high-quality initial solution based on time urgency, task weights, and potential utility. Second, employ an intelligent strategy selection mechanism to dynamically match the optimal destruction–repair strategy combination for the current solution, precisely removing critical tasks during the targeted destruction phase and reconstructing scheduling solutions during the intelligent repair phase. Then, balance exploration and exploitation through hybrid solution acceptance criteria. Finally, dynamically adjust search weights based on the performance of selected strategies. This algorithm innovatively integrates multi-factor utility models with complex constraint handling mechanisms, enabling collaborative optimization of communication utility, energy costs, failure risks, and network congestion while significantly enhancing solution efficiency, scheduling solution quality, and real-time decision-making capabilities.

4.2.1. Encoding Method

To better reflect local and global characteristics in scheduling solutions, the current communication task scheduling solution is converted into a state vector

s

. The state vector at the current stage is

s_{t}

, providing comprehensive state information for the deep reinforcement learning model. The overall state space is

S

. The state vector contains base station features and global scheduling solution features:

s_{t} = [n_{1}, h_{1}, u_{1}, \dots, n_{N}, h_{N}, u_{N} ∣ f_{t}, p_{t}, T e m_{t}] \in ℝ^{3 N + 3}

(34)

The information represented by the constructed coding structure is shown in Figure 4 below:

Here,

n_{i}

denotes the number of tasks at base station

i

,

h_{i}

represents energy consumption utilization rate,

u_{i}

indicates time window utilization rate,

f_{t}

is the current objective function value,

p_{t}

is the iteration progress ratio, and

T e m_{t}

is the current temperature parameter. By designing a state vector incorporating both base station features and global features, comprehensive state information is provided for the Dueling DQN model. Base station features reflect the current load and resource usage of each base station, while global features provide macroscopic information about the entire scheduling solution. This encoding method not only preserves the core elements of the problem but also provides rich contextual information for the Dueling DQN model, enabling more accurate assessment of the current solution state and thereby facilitating better strategy selection.

4.2.2. Initial Solution Construction

The quality of the initial solution in the ALNS algorithm has a direct impact on subsequent search processes and optimal results. A good initial solution can significantly accelerate the search optimization process. Given that UAV return guidance communication tasks in disaster environments possess multiple attributes, the initial solution construction adopts a greedy heuristic method. This method comprehensively considers factors such as time windows, task weights, and potential utility to assign appropriate communication base stations to each UAV, constructing a feasible and high-quality initial scheduling solution.

For each UAV

j \in A^{T}

, a comprehensive score is calculated, reflecting its urgency and importance:

{score}_{i} = χ_{1} \times {time_factor}_{i} + χ_{2} \times {weight_factor}_{i} + χ_{3} \times {revenue_factor}_{i}

(35)

where

χ_{1}

,

χ_{2}

, and

χ_{3}

are weighting coefficients. The time factor weighting coefficient

χ_{1}

reflects the importance of time urgency in the comprehensive score. The weight factor weighting coefficient

χ_{2}

reflects the importance of task weight in the comprehensive score. The utility factor weighting coefficient

χ_{3}

reflects the importance of potential utility in the comprehensive score.

{time_factor}_{j} = 1 - \frac{\min_{i \in N} t_{i j}^{e x i t}}{\max_{j \in A^{T}} \max_{i \in N} t_{i j}^{e x i t}}

represents the time urgency of the UAV, where

t_{i j}^{e x i t}

denotes the time window span.

{weight_factor}_{j} = \frac{\max_{i \in N} \bar{P_{i j}}}{\max_{j \in A^{T}} \max_{i \in N} \bar{P_{i j}}}

represents the task weight.

{revenue_factor}_{j} = \frac{\max_{i \in N} W_{i j}}{\max_{j \in A^{T}} \max_{i \in N} W_{i j}}

represents potential utility. By appropriately setting the weighting coefficients, the initial solution construction can better reflect the urgency and task priorities of the problem, providing a good starting point for subsequent optimization.

Sort UAVs in descending order based on their scores, prioritizing tasks of UAVs with higher scores. For each UAV in the sorted list, assign it to the base station that can provide the maximum utility while satisfying constraints, including time windows and energy limitations. After assigning each task, update the task list and energy consumption status of the corresponding base station, ensuring no time overlap between tasks, to obtain the final initial solution.

By comprehensively considering time windows, task weights, and potential utility, the initial solution’s construction can generate a high-quality starting point. This method considers not only the urgency of tasks (time windows) but also their importance and utility, ensuring the rationality of the initial solution under multi-factor tradeoffs. Compared to traditional random initialization or simple heuristic initialization, this multi-factor comprehensive evaluation-based initialization method can significantly improve the algorithm’s convergence speed and solution quality.

4.2.3. Destruction Strategies

In the adaptive large neighborhood search (ALNS) algorithm, the design of destruction strategies is crucial. Its core function is to selectively remove some tasks from the current solution, breaking the limitations of local optima and creating opportunities for reoptimizing task allocation during the repair phase. Reasonable destruction strategies can effectively reduce the quality of the current solution without affecting its feasibility, thereby improving solution quality through task reinsertion during the repair phase. This process of destruction and reconstruction helps the algorithm explore extensively in the solution space, avoiding entrapment in local optima. The five destruction strategies designed in this paper damage the current solution from different perspectives, comprehensively exploring the potential of the solution space and enhancing the algorithm’s global search capability and solution quality.

Base Station-Based Random Destruction

This strategy randomly selects a subset of base stations and removes a portion of tasks from them. Let

N = {1, 2 \dots, n}

be the set of mobile communication base stations and

A^{T} = {1, 2, \dots, m}

be the set of UAVs. Given a removal ratio

δ \in (0, 1)

, a random subset

N^{'} \subseteq N

is selected, with

| N^{'} | = \max (1, ⌊| N | \times δ⌋)

.

T_{i}

is the set of tasks for base station

i

. Randomly remove

k_{j} = \min (3, | T_{j} |)

tasks.

T^{'}

is the set of tasks after the destruction:

T^{'} = T \ \underset{i \in N^{'}}{\cup} {t_{i j} ∣ t_{i j} \in T_{i}, t_{i j} randomly selected k_{j}}

(36)

B.: Task-Based Random Removal

This strategy removes a random selection of tasks from the entire set, irrespective of their associated base station. Let

T = {t_{i j} | i \in N, j \in A^{T}}

be the complete task set. Given a removal ratio

δ \in (0, 1)

, the number of tasks to remove is

k = \max (1, ⌊| T | \times δ⌋)

.

k

tasks are then randomly selected and removed. The resulting task set

T^{'}

is as follows:

T^{'} = T \ {t_{i j} | t_{i j} \in T_{i}, t_{i j} randomly selected k}

(37)

C.: Low-Benefit Removal

This strategy prioritizes removing tasks contributing least to the objective function, freeing resources for potentially higher-benefit tasks. Let

T = {t_{i j} | i \in N, j \in A^{T}}

be the complete task set. Each task

t_{i j}

has a benefit value

V_{i j}

calculated as

V_{i j} = W_{u h} \cdot P_{i j}^{w e i g h t} \cdot e_{i} - (f_{i}^{T} t_{i j}^{C} + P_{i j}^{f a u l t} c_{j} + C_{i j}^{3})

(38)

Given a removal ratio

δ \in (0, 1)

, all tasks are sorted in ascending order of

V_{i j}

. The number of tasks to remove is

k = \max (1, ⌊| T | \times δ⌋)

. The

k

tasks with the smallest

V_{i j}

values are removed. The resulting task set

T^{'}

is

T^{'} = T \ {t_{i j} | t_{i j} \in Top - k (T, V_{i j}, ascending)}

(39)

where

Top - k (T, V_{u, i}, ascending)

selects the

k

tasks with the smallest

V_{i j}

values.

D.: Time-Critical Removal

This strategy prioritizes removing tasks with tight time windows, facilitating the reallocation of time resources and resolving scheduling conflicts. Let

T = {t_{i j} | i \in N, j \in A^{T}}

be the complete task set. The time criticality

U_{i j}

of task

t_{i j}

is defined as its time window length:

U_{i j} = t_{i j}^{end} - t_{i j}^{start}

. Given a removal ratio

δ \in (0, 1)

, the number of tasks to remove is

k = \max (1, ⌊| T | \times δ⌋)

. The

k

tasks with the smallest

U_{i j}

values (i.e., tightest time windows) are removed. The resulting task set

T^{'}

is

T^{'} = T \ {t_{i j} | t_{i j} \in Top - k (T, U_{i j}, ascending)}

(40)

where

Top - k (T, U_{i j}, ascending)

selects the

k

tasks with the smallest

U_{i j}

values.

E.: High-Weight Removal

This strategy prioritizes removing high-weight tasks to trigger their rescheduling, aiming to discover more optimal resource allocations. Let

T = {t_{i j} | i \in N, j \in A^{T}}

be the complete task set. The weigh

P_{i j}^{w e i g h t}

of task

t_{i j}

is calculated as

P_{i j}^{w e i g h t} = \frac{W_{i j} e_{i}}{d_{i j}}

. Given a removal ratio

δ \in (0, 1)

, the number of tasks to remove is

k = \max (1, ⌊| T | \times δ⌋)

. The

k

tasks with the largest

P_{i j}^{w e i g h t}

values are removed. The resulting task set

T^{'}

is

T^{'} = T \ {t_{i j} | t_{i j} \in Top - k (T, P_{i j}^{w e i g h t}, ascending)}

(41)

where

Top - k (T, P_{i j}^{w e i g h t}, ascending)

selects the

k

tasks with the largest

P_{i j}^{w e i g h t}

values.

These five destruction strategies remove tasks from the current solution in distinct ways, creating modified solutions that serve as starting points for the repair phase to rebuild and optimize task assignments. The diversity and specificity of these strategies enable the enhanced ALNS algorithm to efficiently explore the solution space, improve solution quality, and strengthen algorithmic robustness. By intelligently selecting and combining these destruction strategies, the ALNS algorithm becomes highly effective for addressing the UAV return communication mission scheduling problem in disaster relief environments.

4.2.4. Repair Strategies

Time Window-Aware Repair

This strategy assigns removed tasks to feasible base stations starting with the earliest available time window. Let

N = {1, 2 \dots, n}

be the set of mobile base stations and

A^{T} = {1, 2, \dots, m}

be the UAV swarm. The complete task set is

T = {t_{i j} | i \in N, j \in A^{T}}

, with removed tasks

R = {t_{i j} ∣ i \in N, j \in A^{T}}

. For each task

t_{i j} \in R

, calculate its earliest feasible start time at base station

i

:

t_{i j}^{f e a s i b l e} = \max (t_{i j}^{start}, \max_{t_{i k} \in T_{u}^{i}} t_{i k}^{end})

(42)

where

T_{i}^{i}

is the current task set at base station

i

. Tasks are then reinserted into base stations in ascending order of

t_{i j}^{f e a s i b l e}

.

B.: Random Repair

This strategy randomly reinserts removed tasks by first selecting a task

t_{i j} \in R

uniformly at random and then assigning it to a randomly chosen base station

i \in N

. The stochastic nature of this approach promotes solution space exploration during the repair phase.

C.: Maximum-Benefit Repair

Tasks are reinserted to maximize benefit by calculating the value

V_{i j}

for each removed task

t_{i j} \in R

and assigning it to the base station

i^{*} = \arg \max_{i \in N} V_{i j}

that yields the highest value. This greedy selection prioritizes high-impact task allocations.

D.: Weight-Priority Repair

This approach prioritizes high-weight tasks by computing the weight

P_{i j}^{w e i g h t} = \frac{W_{i j} e_{i}}{d_{i j}}

for each removed task and assigning it to the optimal base station

i^{*} = \arg \max_{i \in N} P_{i j}^{w e i g h t}

. The strategy focuses resources on tasks with elevated operational significance.

E.: Greedy Repair

Net value optimization is achieved through the evaluation metric

Δ_{i j} = F_{i j}^{1} - F_{i j}^{2}

, representing benefit minus cost. Each removed task

t_{i j} \in R

is assigned to the base station

i

maximizing

Δ_{i j}

, ensuring locally optimal resource utilization during reinsertion.

These five repair strategies implement distinct mathematical methodologies for task reallocation, collectively enhancing the algorithm’s adaptability and flexibility. Their synergistic operation enables effective optimization of UAV return communication scheduling in disaster relief scenarios through systematic reconstruction of solutions.

4.2.5. SA and RRT Acceptance Criteria

The ALNS algorithm requires a critical decision mechanism: whether to accept newly generated solutions to replace the current solution. This decision significantly impacts the algorithm’s exploration (searching broadly for new feasible solutions) and exploitation (intensively improving solutions in promising regions). An effective acceptance criterion balances these capabilities, preventing premature convergence to local optima while guiding the algorithm toward global optima. Traditional criteria (e.g., solely objective-based improvement) show limitations in complex scheduling environments. Thus, we designed a hybrid criterion integrating simulated annealing (SA) and Rapid Restoration Threshold (RRT).

The SA criterion, inspired by metallurgical annealing, probabilistically accepts inferior solutions. We implement geometric cooling:

T e m_{t + 1} = T e m_{t} \times α

, where

α

is a fixed cooling factor. The acceptance probability is calculated as

P_{SA} = \{\begin{array}{l} \exp (\frac{Δ}{T_{k + 1}}) & if Δ < 0 \\ 1 & if Δ \geq 0 \end{array}

(43)

Here,

Δ = f_{t + 1} - f_{t}

denotes the objective value difference, and

T e m_{t + 1}

represents the temperature parameter that decreases iteratively. Solutions with

Δ \geq 0

(superior/equal quality) are always accepted. Solutions with

Δ < 0

(inferior quality) are accepted with probability

\exp (Δ / T_{e m})

, which decreases as temperature cools.

The RRT criterion prevents premature convergence by establishing a dynamic acceptance threshold. This permits solutions with marginal quality deterioration, balancing exploration and exploitation. The dynamic threshold is defined as follows:

{RRT}_{t h r e s h o l d} = f_{t} - θ \times | f_{t} |

(44)

where

β \in (0, 1)

controls threshold strictness (

β = 0.05

). Solutions satisfying requests are candidate solutions. The combined SA-RRT acceptance condition is

Acceptance of new interpretations i f \{\begin{array}{l} f_{t + 1} > f_{t} \\ f_{t + 1} \geq {RRT}_{t h r e s h o l d} and ξ < P_{SA} \end{array}

(45)

where

ξ \in [0, 1)

is a uniform random number. The SA and RRT criteria complement each other effectively in the hybrid mechanism. The SA criterion’s stochastic nature allows the algorithm to explore new regions of the solution space, helping it escape local optima and ensuring a diverse search. Meanwhile, the RRT criterion’s conservative approach, with its dynamic acceptance threshold, prevents the algorithm from deviating too far from promising solutions, thus maintaining a degree of exploitation to refine and improve upon better-than-average solutions. This balance between exploration and exploitation makes the combined criterion particularly effective for complex scheduling tasks like UAV return communication task scheduling.

4.2.6. Strategy Weight Update

The strategy weight update mechanism is a core innovation in Dueling DQN-ALNS, serving as the foundation for action selection in Dueling Deep Q-Networks. By dynamically adjusting destruction and repair strategy weights based on historical performance, this mechanism enables automatic selection of optimal strategy combinations. This adaptive capability significantly enhances search efficiency and solution quality in complex optimization environments.

Combined Scoring Mechanism

The hybrid Dueling DQN-ALNS framework enables dynamic weight updates for strategy pairs. Each combined strategy

a = (d, r)

consists of a destruction strategy

d \in {1, 2, 3, 4, 5}

and repair strategy

r \in {1, 2, 3, 4, 5}

, forming 25 unique combinations. The cooperative evaluation principle assesses strategy pairs while updating weights independently. The composite score is calculated as

{Score}_{a} = λ_{1} Δ f + λ_{2} (1 - \frac{E_{violation}}{E_{0}}) + λ_{3} I_{time_v alid}

(46)

Here,

Δ f = f_{new} - f_{current}

denotes objective function improvement,

E_{violation} = \sum \max (0, E_{used} - E_{\max})

quantifies total energy constraint violation, and

I_{time_valid} \in {0, 1}

indicates time window compliance. The weight coefficients are

λ_{1} = 0.7

,

λ_{2} = 0.2

, and

λ_{3} = 0.1

.

The weight coefficients

λ_{1} = 0.7

,

λ_{2} = 0.2

, and

λ_{3} = 0.1

were chosen based on extensive experimental validation. These values reflect the relative importance of solution quality, energy constraint adherence, and time window compliance in our scheduling problem. The higher weight for

λ_{1}

emphasizes the importance of improving the objective function value, which directly impacts the overall efficiency of UAV return communication task scheduling. The weight

λ_{2}

ensures that energy constraints are closely monitored to prevent battery depletion in UAVs, while

λ_{3}

reinforces the critical nature of time window compliance to maintain the feasibility of schedules. This multi-objective scoring approach evaluates solution quality, energy constraints, and temporal feasibility, with normalized energy violation to mitigate scale bias. The binary

λ_{3}

term reinforces time sensitivity.

B.: Scoring Mechanism for Strategies

The destruction strategy weights are updated as follows:

w_{d}^{(t + 1)} = (1 - ρ_{d}) w_{d}^{(t)} + ρ_{d} \cdot {\bar{s}}_{d}

(47)

{\bar{s}}_{d} = \frac{1}{| W_{d} |} \sum_{k \in W_{d}} {Score}_{a} \cdot δ_{d (a), d}

(48)

where

δ_{d (a), d} = \{\begin{array}{l} 1 & if d (a) = d \\ 0 & otherwise \end{array}

(49)

Here,

w_{d}

= weight of tactic

d

,

ρ_{d} \in (0, 1)

= learning rate, and

{\bar{s}}_{d}

= average reward over sliding window

W_{d}

(size = 100).

The repair strategy weights follow symmetrically:

w_{r}^{(t + 1)} = (1 - ρ_{r}) w_{r}^{(t)} + ρ_{r} \cdot {\bar{s}}_{r}

(50)

{\bar{s}}_{r} = \frac{1}{| W_{r} |} \sum_{k \in W_{r}} {Score}_{a} \cdot δ_{r (a), r}

(51)

C.: Dynamic Learning Rate Adjustment

While repair strategy updates mirror destruction updates, they use a separate sliding window

W_{r}

. To adapt to task phases, learning rates

ρ_{d}

and

ρ_{r}

are dynamically adjusted:

ρ_{d} = ρ_{r} = \{\begin{array}{l} 0.15 & p_{t} < 0.3 \\ 0.10 & 0.3 \leq p_{t} \leq 0.7 \\ 0.05 & p_{t} > 0.7 \end{array}

(52)

where

p_{t} = t / T

(normalized mission time). A constraint-sensitive modifier further adjusts rates during energy violations:

ρ \leftarrow ρ \cdot (1 + η \cdot \frac{E_{violation}}{E_{0}})

(53)

D.: Strategy Diversity and Urgency Response

Diversity maintenance prevents strategy space collapse. When strategy weight ranges narrow (

\max w_{d} - m i n w_{d} < 0.2

), the following occurs:

Top performing strategies: Weights increased by 10% (

w_{d} \leftarrow 1.1 w_{d}

);

Others: Weights reduced by 5% (

w_{d} \leftarrow 0.95 w_{d}

).

Symmetric rules apply for repair strategies.

Time window urgency response prioritizes time-critical tasks by boosting faster strategies as communication windows close:

w_{d} \leftarrow w_{d} \cdot \exp (\frac{τ_{d}}{σ}), w_{r} \leftarrow w_{r} \cdot \exp (\frac{τ_{r}}{σ})

(54)

where

τ_{d}

= time urgency of destruction tactic

d

,

τ_{r}

= execution speed of repair tactic

r

, and

σ = 0.2 T

.

This decoupled weight update mechanism enables independent optimization of destruction/repair strategies while maintaining dynamic balance, significantly enhancing robustness and adaptability for disaster relief UAV scheduling.

4.3. Deep Reinforcement Learning

Reinforcement learning (RL) has gained significant attention in both academic research and industrial applications. While traditional RL algorithms demonstrate strong performance for certain complex problems, they often struggle with relevance in specific practical scenarios. Meanwhile, deep learning has achieved notable success across domains due to its powerful feature extraction and pattern recognition capabilities. RL offers unique advantages for task scheduling through its dynamic policy adjustments based on state–action relationships. However, the high-dimensional and uncertain state spaces in UAV task scheduling pose challenges for traditional RL methods like Q-learning, which rely on tabular Q-value storage. This approach becomes computationally inefficient and memory-intensive for large state spaces, leading to poor sample efficiency during training.

Deep reinforcement learning (DRL) addresses these limitations by integrating deep neural networks. DRL replaces tabular Q-functions with function approximators and incorporates experience replay, effectively mitigating sample sparsity issues. Combining RL’s policy optimization strengths with deep learning’s feature extraction capabilities, DRL excels in decision-making and scheduling within complex, high-dimensional environments. Among DRL algorithms, Deep Q-Networks (DQNs) and its variants like Dueling DQN are widely adopted. The key innovation of Dueling DQN is its decomposition of the Q-value function into separate value and advantage streams, respectively estimating state value and action advantages. This architecture improves Q-value estimation accuracy. The network takes environmental states as inputs and outputs Q-value estimates for each action, providing robustness in continuous and uncertain state spaces. The Dueling DQN architecture used in this study is shown in Figure 5.

4.3.1. State Design

To effectively integrate Dueling DQN with ALNS, the state representation aligns with ALNS encoding. The state space is defined as vector

s_{t}

, comprising base station features and global mission characteristics:

s_{t} = [n_{1}, h_{1}, u_{1}, \dots, n_{N}, h_{N}, u_{N} ∣ f_{t}, p_{t}, T e m_{t}] \in ℝ^{3 N + 3}

.

4.3.2. Loss Function

We employ a modified Huber loss function as the optimization objective, designed for the complexities of UAV return communication tasks and training stability requirements in disaster relief scenarios. This loss combines benefits of the mean squared error (MSE) and mean absolute error (MAE): it uses quadratic behavior for small errors to stabilize gradients and linear behavior for large errors to prevent gradient explosion. The DQN loss function is defined as follows:

L (θ) = \frac{1}{B} \sum_{i = 1}^{B} \{\begin{array}{l} \frac{1}{2} {(r_{i} - Q (s_{i}, a_{i}; θ))}^{2} & for | r_{i} - Q_{i} | \leq δ_{L} \\ δ_{L} \cdot (| r_{i} - Q_{i} | - \frac{1}{2} δ_{L}) & otherwise \end{array}

(55)

where

r_{i}

denotes the target Q-value, calculated as

r_{i} = r + γ \max_{a^{'}} Q (s_{i + 1}, a^{'}; θ^{-})

(56)

where

θ

represents parameters of the online Q-network (including feature extraction layers, value stream, and advantage stream weights).

B = 64

denotes the training batch size.

r_{i}

is the immediate reward after taking the action

a

in state

s_{i}

.

γ \in [0, 1]

is the discount factor balancing immediate and future rewards.

θ^{-}

denotes parameters of the target Q-network, which stabilizes training.

\max_{a^{'}} Q (s_{t + 1}, a^{'}; θ^{-})

estimates the maximum future Q-value.

Q (s_{i}, a_{i}; θ)

is the current network’s Q-value prediction for state–action pair

(s_{i}, a_{i})

, representing expected cumulative reward.

δ_{L}

is a threshold parameter controlling quadratic–linear transitions.

When the predicted and target values are close, use the MSE term

\frac{1}{2} {(r_{i} - Q (s_{i}, a_{i}; θ))}^{2}

for stable gradient updates. For large differences, switch to the linear term

δ_{L} \cdot (| r_{i} - Q_{i} | - \frac{1}{2} δ_{L})

to avoid gradient explosion. This enhances the loss function’s robustness and improves training stability and convergence speed.

In Dueling DQN, the Q-value is decomposed into a value function

V (s; θ)

and an advantage function

A (s, a; θ)

, combined as

Q (s, a; θ) = V (s; θ) + (A (s, a; θ) - \frac{1}{| A |} \sum_{a^{'}} A (s, a^{'}; θ))

(57)

4.3.3. Action Space Design

Action space

a \in A = {1, 2, \dots, 25}

consists of all possible combinations of destruction and repair strategies. Specifically, the destruction strategies include time urgency-based destruction and minimum profit destruction, among others, while the repair strategies include maximum profit repair and priority based repair, among others. Each action corresponds to selecting one destruction strategy

d \in D_{d} = {1, 2, 3, 4, 5}

(such as base station-based random destruction or weight-based destruction) and one repair strategy

r \in R_{r} = {1, 2, 3, 4, 5}

(such as time-window-awareness repair or greedy repair). The combination forms action

a = (d, r)

, creating a discrete decision-making action space of 25 strategy combinations, where

a = (d, r) (i \in [1, 5], j \in [1, 5])

.

4.3.4. Reward Function Design

The reward function is designed to balance exploration and exploitation, encouraging the agent to select high-quality strategies. It is defined as

r_{i} = \{\begin{array}{l} 10 + 0.1 Δ f & Δ f > 0 and a record is broken \\ 5 + 0.05 Δ f & Δ f > 0 \\ \max (- 1, - 0.01 Δ f) & otherwise \end{array}

(58)

where

Δ f = f_{new} - f_{current}

. This function motivates the agent to pursue higher goals by offering tiered rewards: high rewards when the new solution’s objective value exceeds historical bests, medium rewards when it surpasses the current solution, and low rewards when it does not exceed the current solution but is still acceptable. This tiered reward mechanism effectively guides the agent to balance exploration and exploitation.

4.3.5. Action Selection Policy

The agent uses a

ϵ

-greedy strategy to balance exploration and exploitation. The action selection rule is

π (a | s) = \{\begin{array}{l} random & with probability ϵ \\ \arg \max_{a} Q (s, a; θ) & with probability 1 - ϵ \end{array}

(59)

Here,

π (a | s)

is the probability of choosing action

a

in state

s

, and

ϵ

(the exploration rate) starts at 1.0 and decays to 0.01 over iterations.

Q (s, a; θ)

represents the expected cumulative reward for choosing action

a

in state

s

, determined by the Q-network parameter

θ

. This strategy ensures thorough exploration early in learning and focuses on exploitation later, effectively balancing exploration and exploitation.

4.3.6. DQN Agent Optimization Process

The DQN Agent continuously learns and optimizes strategies through interaction with the environment, utilizing a Dueling DQN network to enhance the accuracy of Q-value estimation. Initially, the Dueling DQN network and its target network are randomly initialized. During interaction with the environment, the DQN Agent gathers experiences, including states, actions, rewards, and new states, and stores them in a replay buffer. Periodically, it samples small batches of experiences from the buffer to compute the loss function and updates the network weights via backpropagation. Every certain number of steps, the weights of the primary network are copied to the target network to maintain its stability. Q-values are calculated by the Dueling DQN network, which outputs Q-values for each action. The specific process is shown in Figure 6.

5. Experiments and Analysis

5.1. Experimental Design

In this section, we conducted numerical experiments on a PC to validate our model’s effectiveness and the proposed solution method’s efficiency. To evaluate the improved Dueling DQN-ALNS algorithm’s performance in solving multi-base station UAV cluster return-voyage communication task scheduling problems, we designed six experimental scenarios with task sizes of 30, 40, 50, 60, 70, and 80. The UAVs were categorized into four classes based on their importance, critical rescue, generally critical rescue, general rescue, and noncritical rescue, with speeds ranging from 30 to 100 km/h. We compared the Dueling DQN-ALNS algorithm with several other metaheuristics, including the basic ALNS, genetic algorithm (GA), particle swarm optimization (PSO), differential evolution (DE), and ant colony optimization (ACO), to assess their effectiveness in addressing the scheduling problem. The simulation environment is outlined in Table 3.

5.2. Hyperparameter Settings

In the design of Deep Reinforcement Learning (DRL), parameters are mainly categorized into model parameters and hyperparameters. Model parameters are the parameters adjusted by the model itself, such as the neural network weights and biases in Dueling Deep Q-Network (Dueling DQN) used in this paper. The model parameters are used to represent the intelligence’s knowledge of the mapping between environment states and action relationships without much human tuning. These model parameters are automatically updated by the algorithm through the interactive learning process of the intelligent body with the designed environment. This is due to the fact that DRL relies on an automated learning mechanism to optimize the model parameters based on the experience accumulated from the interaction with the current environment (state–action–reward sequence) so that the intelligent body can obtain a better performance in the environment and therefore does not need to be adjusted directly by human beings. The hyperparameters include learning rate

α

, discount factor

γ

, exploration rate

ϵ

, exploration decay rate

ϵ_{d e c a y}

, target update rate

τ

, experience playback buffer capacity

M

, and batch size

B

, etc. The hyperparameters are based on specific feedback. Hyperparameters need to be adjusted based on specific feedback because they play a critical and decisive role in the behavior and performance of the entire learning algorithm and are directly related to the learning efficiency of the intelligent body and the quality of the final choice of strategy. The more important of these are the learning rate and the discount factor. The learning rate controls how quickly the model parameters are updated. A high learning rate can cause unstable learning and prevent the agent from converging due to frequent, drastic strategy changes. Conversely, a low learning rate may slow down the learning process, requiring extensive interaction to develop effective strategies. The discount factor balances short-term and long-term rewards. A large discount factor makes the agent focus on future rewards, potentially neglecting immediate ones and causing poor performance in timeliness-required scenarios. A small discount factor leads the agent to focus only on current benefits, failing to utilize future rewards to guide learning. In this study, some common hyperparameter values are empirically set, as shown in Table 4. In the Dueling DQN framework, the adaptive hyperparameter tuning system designed in this study adopts the strategy of combining a dynamic feedback mechanism and an orthogonal experimental design, which significantly improves the solution efficiency of UAV return guidance communication task scheduling.

5.3. Model Task Capacity Experiment

The number of UAV tasks significantly impacts the profit, which depends on multiple factors. These include the gain from successful UAV guidance, energy consumption costs of communication stations, idle energy consumption costs, potential fault costs, and network congestion costs. In multi-base-station UAV-cluster return-voyage communication-task-scheduling issues, more UAV tasks can increase overall value gain. However, during task execution, low-profit tasks may emerge when there are numerous UAV communication tasks in a batch. Including these tasks might not boost overall profit and may even be excluded from the execution list. This indicates that the UAV task-capacity limit has been reached under current communication base station resources, and increasing UAV tasks further is not advisable.

To explore the model’s maximum task capacity, this paper designs experiments using the Dueling DQN-ALNS algorithm. By adding a random factor to the algorithm and increasing the number of tasks from 30 to 150, the optimal objective function values and the number of uncompleted UAV tasks during iterations are recorded, as shown in Table 5.

Table 5 demonstrates task overflow occurring at quantities ≥80 tasks, given fixed communication resources and base station performance. At the 80-task level, six high-value UAVs remain uncompleted, indicating the system has reached its viable capacity limit. This establishes 80 return communication tasks as the maximum batch size that maintains reliable mission completion under current constraints. Additional tasks would compromise operational integrity despite marginal objective value increases at higher quantities.

5.4. Performance of Improved Dueling DQN-ALNS

To evaluate the improved Dueling DQN-ALNS algorithm, this study uses a multi-factor profit-oriented objective function. This function integrates communication profit, energy consumption, fault risk, and network congestion, providing a comprehensive cost-benefit framework for task scheduling in disaster scenarios. In an example with 60 UAV communication tasks, the algorithm achieved a profit of 3056.45, completing all 50 UAV communication tasks. The scheduling results, including communication base station plans and energy consumption, are shown in Figure 7 and Figure 8.

Figure 7 is a Gantt chart. This Gantt chart illustrates the optimal scheduling scheme for 60 UAV return-communication tasks across multiple base stations. The x-axis represents time in seconds, while the y-axis lists multiple base stations. Color-coded bars indicate different drones, showing when and at which station each drone’s communication task is scheduled. For instance, tasks for various drones are allocated to each station at specific time intervals. This visual overview highlights how the scheduling algorithm efficiently distributes tasks across base stations and manages the timeline for different drones.

Figure 8 presents energy consumption data across different stations. The x-axis lists station IDs, while the y-axis indicates energy units. It compares initial energy levels (blue bars) with energy consumed during tasks and idle periods (orange bars). A red dotted line marks the initial energy levels.

The improved algorithm is compared with ALNS, GA, PSO, DE, and ACO on problems with 30, 40, 50, 60, 70, and 80 tasks. Each algorithm is run 10 times independently. As shown in Table 6, the Dueling DQN-ALNS algorithm outperformed the others in most cases, demonstrating better solution quality and stability. Six algorithms were used simultaneously to run experiments of different scales, and variance was calculated through 10 repeated experiments, as shown in Table 7. The variance of all algorithm experiments was between 0.27% and 2.93%, meeting the variance requirements for repeated experiments, proving the repeatability of the results, and ruling out accidental interference.

The experimental results from task sizes ranging from 30 to 80 offer profound insights into the scalability of our approach for larger disaster scenarios. Our proposed method demonstrates significant value when applied to much larger tasks. Its hybrid nature allows for efficient scaling as task size increases. The integration of Dueling DQN with ALNS creates a robust framework capable of handling the complexity and demands of larger-scale disaster scenarios. This makes our approach highly suitable for extensive and intricate UAV return communication task scheduling operations.

Figure 9 illustrates the optimization process of the six algorithms. The Dueling DQN-ALNS algorithm consistently achieved better objective function values within the same number of iterations and acceptable time frames.

Additionally, in disaster scenarios, rescue UAVs need timely return-voyage communication guidance for resupply. Faster scheduling solutions better support these operations, making algorithm runtime a key evaluation factor. In the 60-task example, the performance of each algorithm is assessed by the ratio of the best objective function value to runtime (

Z_{ω} / T_{ω}

). As shown in Table 8, the Dueling DQN-ALNS algorithm outperformed others within acceptable runtime. Specifically, it improved solution quality by 3.75% over ACO, 5.9% over ALNS, and 9.37% over DE. While its runtime is slightly longer than ALNS due to the added Dueling DQN strategy selection, it is still faster than the other algorithms.

In summary, the Dueling DQN-ALNS algorithm adaptively adjusts strategies based on real-time conditions, achieving superior scheduling efficiency across different task scales. Through the self-adjustment of the Dueling DQN Agent module, it effectively selects destruction and repair strategies, providing efficient solutions for UAV communication task scheduling within acceptable time frames.

6. Conclusions

This paper presents a novel task scheduling model for UAV return-voyage communication that incorporates multi-factor profit impacts. By introducing virtual communication bases and tasks, the model transforms the problem into a limited-resource, time-window-constrained vehicle-routing issue with profit considerations. This approach addresses the practical challenges of UAV communication task scheduling in disaster scenarios and simplifies the problem through unified encoding.

To solve this model, an improved Dueling DQN-ALNS algorithm is developed. It integrates Dueling DQN with ALNS for adaptive strategy selection, enabling effective interaction between the agent and the environment. The algorithm updates strategy weights based on local and global information, using offline-centralized training and online parameter tuning to maintain high-efficiency scheduling across different-scale scenarios.

Numerous experiments demonstrate that the Dueling DQN-ALNS algorithm achieves better results than other algorithms. In particular, in the 60-task scenario, it shows superior performance, with a 3.75% improvement over the ACO algorithm, a 5.9% improvement over the basic ALNS algorithm, and a 9.37% improvement over the DE algorithm. The algorithm also has high scheduling efficiency and good calculation performance.

Our work points to two important managerial insights. First, in the disaster relief environment, the global positioning system of UAVs fails. At this time, mobile communication base stations can be deployed to provide effective guidance communication for UAV return to supply. The designed UAV return communication task scheduling model has a certain task capacity limitation. Considering the number of batch return UAVs is necessary to avoid the risk of UAV return failure. Second, the improved Dueling DQN-ALNS algorithm accelerates problem-solving by adopting a two-stage solution method of offline centralized training and online scheduling use. Collecting the historical information of each batch of UAV cluster return for offline centralized training in a disaster relief environment can enhance the model’s adaptability and problem-solving capability.

This paper can be extended in several directions. First, our hypothetical UAV cluster return guidance communication task scheduling problem is based on batch UAV scheduling. In a realistic disaster relief environment, the return of UAVs usually has some temporary joined UAV tasks. Considering the dynamic joining and completion of UAV return tasks would complicate the current model by requiring real-time task information updates and rescheduling capabilities. However, this is crucial for practical disaster rescue operations where UAV return demands may change due to sudden situations. Second, our UAV communication task only assumes the effect of the complete completion of a single base station and ignores the effect of simultaneous cooperative communication of multiple base stations. Incorporating the nonlinear gain from simultaneous communication of multiple base stations would increase model complexity by requiring consideration of inter-base station coordination and potential signal interference. However, this is important for improving communication efficiency and reliability in real-world scenarios where multiple base stations can collaborate to enhance task completion rates.

Author Contributions

Writing—original draft, Z.T.; supervision and resources, Y.J.; project administration, X.W.; data curation, J.P.; validation, X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Postdoctoral Science Foundation (No. 2023M744331), the National Key Laboratories Fund of China (No: 6142101230306), and the Natural Science Foundation of Hunan Province (No. 2025JJ60449).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle;
DQN	Deep Q-Network;
ALNS	Adaptive Large Neighborhood Search.

References

Amirsahami, A.; Barzinpour, F.; Pishvaee, M.S. A fuzzy programming model for decentralization and drone utilization in urban humanitarian relief chains. Transp. Res. Part E Logist. Transp. Rev. 2025, 195, 103949. [Google Scholar] [CrossRef]
Martinez-Alpiste, I.; Golcarenarenji, G.; Wang, Q.; Alcaraz-Calero, J.M. Search and rescue operation using UAVs: A case study. Expert Syst. Appl. 2021, 178, 114937. [Google Scholar] [CrossRef]
Oh, D.; Han, J. Smart Search System of Autonomous Flight UAVs for Disaster Rescue. Sensors 2021, 21, 6810. [Google Scholar] [CrossRef]
Raja, G.; Anbalagan, S.; Ganapathisubramaniyan, A.; Selvakumar, M.S.; Bashir, A.K.; Mumtaz, S. Efficient and Secured Swarm Pattern Multi-UAV Communication. IEEE Trans. Veh. Technol. 2021, 70, 7050–7058. [Google Scholar] [CrossRef]
Katwe, M.; Singh, K.; Sharma, P.K.; Li, C.-P. Energy Efficiency Maximization for UAV-Assisted Full-Duplex NOMA System: User Clustering and Resource Allocation. IEEE Trans. Green Commun. Netw. 2021, 6, 992–1008. [Google Scholar] [CrossRef]
Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. A UAV-Assisted Data Collection for Wireless Sensor Networks: Autonomous Navigation and Scheduling. IEEE Access 2020, 8, 110446. [Google Scholar] [CrossRef]
He, Q.; Chen, W.; Zou, D.; Chai, Z. A novel framework for UAV returning based on FPGA. J. Supercomput. 2021, 77, 4294–4316. [Google Scholar] [CrossRef]
Li, L.; Wang, Z.; Zhu, J.; Ma, S. Smartphone-Based Task Scheduling in UAV Networks for Disaster Relief. Electronics 2024, 13, 2903. [Google Scholar] [CrossRef]
Phalapanyakoon, K.; Siripongwutikorn, P. Route Planning of Unmanned Aerial Vehicles under Recharging and Mission Time Constraints. Int. J. Math. Eng. Manag. Sci. 2021, 6, 1439–1459. [Google Scholar] [CrossRef]
Yue, W.; Zhang, X.; Liu, Z. Distributed Cooperative Task Allocation for Heterogeneous UAV Swarms under Complex Constraints. Comput. Commun. 2025, 231, 108043. [Google Scholar] [CrossRef]
Ye, F.; Chen, J.; Sun, Q.; Tian, Y.; Jiang, T. Decentralized task allocation for heterogeneous multi-UAV system with task coupling constraints. J. Supercomput. 2020, 77, 111–132. [Google Scholar] [CrossRef]
Huang, G.; Hu, M.; Yang, X.; Wang, X.; Wang, Y.; Huang, F. A Review of Constrained Multi-Objective Evolutionary Algorithm-Based Unmanned Aerial Vehicle Mission Planning: Key Techniques and Challenges. Drones 2024, 8, 316. [Google Scholar] [CrossRef]
Kim, D. Performance of UWB Wireless Telecommunication Positioning for Disaster Relief Communication Environment Securing. Sustainability 2018, 10, 3857. [Google Scholar] [CrossRef]
Alkhatib, M.; Nayfeh, M.; Shamaileh, K.A.; Kaabouch, N.; Devabhaktuni, V. A return-to-home unmanned aerial vehicle navigation solution in global positioning system denied environments via bidirectional long short-term memory reverse flightpath prediction. Eng. Appl. Artif. Intell. 2025, 140, 109729. [Google Scholar] [CrossRef]
Li, J.; Yang, X.; Yang, Y.; Liu, X. Cooperative Mapping Task Assignment of Heterogeneous Multi-UAV Using an Improved Genetic Algorithm. Knowl.-Based Syst. 2023, 296, 111830. [Google Scholar] [CrossRef]
Phung, M.D.; Ha, Q.P. Safety-enhanced UAV Path Planning with Spherical Vector-based Particle Swarm Optimization. Appl. Soft Comput. 2021, 107, 107376. [Google Scholar] [CrossRef]
Zeng, H.; Tong, L.; Xia, X. Multi-UAV Cooperative Coverage Search for Various Regions Based on Differential Evolution Algorithm. Biomimetics 2024, 9, 384. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Z.; Sun, Q.; Huang, Y. An improved ant colony algorithm for multiple unmanned aerial vehicles route planning. J. Frankl. Inst. 2024, 361, 107060. [Google Scholar] [CrossRef]
Shi, J.; Mao, H.; Zhou, Z.; Zheng, L. Adaptive large neighborhood search algorithm for the Unmanned aerial vehicle routing problem with recharging. Appl. Soft Comput. 2023, 147, 110831. [Google Scholar] [CrossRef]
Yue, L.; Yang, R.; Zhang, Y.; Yu, L.; Wang, Z. Deep Reinforcement Learning for UAV Intelligent Mission Planning. Complexity 2022, 2022, 3551508. [Google Scholar] [CrossRef]
Wang, L.; Zhao, X.; Zhang, Y.; Wang, X.; Ma, T.; Gao, X. Unmanned aerial vehicle swarm mission reliability modeling and evaluation method oriented to systematic and networked missionChin. J. Aeronaut. 2021, 34, 466–478. [Google Scholar] [CrossRef]
Gong, F.; Hu, G.; Xu, W.; Yang, C.; Li, B. Mission Reliability Modeling and Evaluation for High-Performance UAV Swarm With Mission Reconstruction. Qual. Reliab. Eng. Int. 2025. [Google Scholar] [CrossRef]
Qin, L.; Zhou, Z.; Liu, H.; Yan, Z.; Dai, Y. A Deep Reinforcement Learning-Driven Seagull Optimization Algorithm for Solving Multi-UAV Task Allocation Problem in Plateau Ecological Restoration. Drones 2025, 9, 436. [Google Scholar] [CrossRef]
Dui, H.; Zhang, C.; Bai, G.; Chen, L. Mission reliability modeling of UAV swarm and its structure optimization based on importance measure. Reliab. Eng. Syst. Saf. 2021, 215, 107879. [Google Scholar] [CrossRef]
Shi, M.; Zhang, X.; Chen, J.; Cheng, H. UAV Cluster-Assisted Task Offloading for Emergent Disaster Scenarios. Appl. Sci. 2023, 13, 4724. [Google Scholar] [CrossRef]
Chen, X.; Wu, Y.; Xu, S. Mission Planning of UAVs and UGV for Building Inspection in Rural Area. Algorithms 2024, 17, 177. [Google Scholar] [CrossRef]
Liu, B.; Wang, S.; Li, Q.; Zhao, X.; Pan, Y.; Wang, C. Task Assignment of UAV Swarms Based on Deep Reinforcement Learning. Drones 2023, 7, 297. [Google Scholar] [CrossRef]
Li, K.; Zhu, C.; Pan, X.; Xu, L.; Liu, K. A Mission Planning Method for Long-Endurance Unmanned Aerial Vehicles: Integrating Heterogeneous Ground Control Resource Allocation. Drones 2024, 8, 385. [Google Scholar] [CrossRef]
Poikonen, S.; Golden, B.; Wasil, E.A. A Branch-and-Bound Approach to the Traveling Salesman Problem with a Drone. Inf. J. Comput. 2019, 31, 335–346. [Google Scholar] [CrossRef]
Casbeer, D.W.; Holsapple, R.W. Column generation for a UAV assignment problem with precedence constraints. Int. J. Robust Nonlinear Control 2011, 21, 1421–1433. [Google Scholar] [CrossRef]
Wu, Y.; Gou, J.; Ji, H.; Deng, J. Hierarchical mission replanning for multiple UAV formations performing tasks in dynamic situation. Comput. Commun. 2023, 200, 132–148. [Google Scholar] [CrossRef]
Jing, L.; Qianchao, L.; Hao, L. UAV penetration mission path planning based on improved holonic particle swarm optimization. J. Syst. Eng. Electron. 2023, 34, 197–213. [Google Scholar] [CrossRef]
Wang, Z.-C.; Xu, T.-L.; Liu, F.; Wei, Y.-P. Artificial bee colony based optimization algorithm and its application on multi-drone path planning. AIP Adv. 2025, 15, 055306. [Google Scholar] [CrossRef]
Chen, J.; Ling, F.; Zhang, Y.; You, T.; Liu, Y.; Du, X. Coverage path planning of heterogeneous unmanned aerial vehicles based on ant colony system. Swarm Evol. Comput. 2022, 69, 101005. [Google Scholar] [CrossRef]
Cao, Z. Simulation investigation of autonomous route planning for unmanned aerial vehicles based on an improved genetic algorithm. Neural Comput. Appl. 2024, 37, 3343–3354. [Google Scholar] [CrossRef]
Saadi, A.A.; Soukne, A.; Meraihi, Y.; Gabis, A.B.; Ramdane-Cherif, A. A hybrid Improved Manta Ray Foraging Optimization with Tabu Search Algorithm for solving the UAV placement problem in smart cities. IEEE Access 2023, 11, 24315–24342. [Google Scholar] [CrossRef]
Fan, M.; Liu, H.; Wu, G.; Gunawan, A.; Sartoretti, G. Multi-UAV reconnaissance mission planning via deep reinforcement learning with simulated annealing. Swarm Evol. Comput. 2025, 93, 101858. [Google Scholar] [CrossRef]

Figure 1. Schematic of problem.

Figure 2. Problem representation diagram of virtual communication base stations and task virtual paths.

Figure 3. Dueling DQN-ALNS online training and offline invocation framework.

Figure 4. Encoding schematic diagram.

Figure 5. Dueling DQN network architecture.

Figure 6. Schematic diagram of the process for adjusting and optimizing the strategy of the DQN Agent.

Figure 7. The optimal scheduling scheme under a scale of 60 tasks.

Figure 8. Energy consumption (task + idle) vs. initial energy.

Figure 9. Comparison of optimization algorithms.

Table 1. Summary of related works on VAV task planning methodologies.

Methodology Category	Reference	Solution Approach	Application Problem
Mathematical Planning	Stefan [29]	Branch-and-bound method	Traveler problem for UAVs
Mathematical Planning	Gabbeau [30]	Column generation method	Allocation for multivariate manned machines
Swarm Intelligence	Wu [31]	RRT + improved consensus (MO) algorithm	Task planning for human-machine systems
	Luo [32]	Improved complete PSO (IHPSO)	Trajectory planning for breakthrough processes
	Wang [33]	Improved artificial bee colony algorithm	UAV path planning
Heuristic Algorithms	Chen [34]	Ant colony system (ACS)	Path planning for heterogeneous UAVs
	Cao [35]	Adaptive genetic algorithm	Autonomous UAV path planning
	Saadi [36]	Hybrid IMRFO-TS	UAV deployment in smart cities
Hybrid Metaheuristics	Fan [37]	SA-integrated NNO-DRL	Multi-UAV reconnaissance scheduling (MURMPP)

Table 2. Abbreviation of symbols.

Symbol	Description
Indices and Sets:
$N^{+}$	Set of communication base stations, including virtual base stations, indexed by $i$ , $N^{+} = {1, 2, \dots, n, n + 1}$ , where $n + 1$ is the virtual base station.
$N$	Set of communication base stations, indexed by $i$ , $N^{-} = {1, 2 \dots n}$ .
$A$	A set of nodes indexed by $j$ and $k$ , $A = {0, 1, 2 \dots m, m + 1}$ ; 0 and $m + 1$ represent the virtual start task and end task, respectively.
$A^{-}$	Set of previous tasks that a communication base station may end, $A^{-} = A \ {b + 1} = {0, 1, 2 \dots b}$ .
$A^{+}$	Set of next tasks that a communication base station may start, $A^{+} = A \ {0} = {1, 2 \dots b, b + 1}$ .
$A^{T}$	Set of UAV clusters (tasks), indexed by $j$ and $k$ , $A^{T} = {1, 2, \dots, b}$ .
Parameters:
${\bar{g}}_{i}$	Initial power of communication base station $i$ , $i \in N$ .
$(x_{i}, y_{i})$	Horizontal and vertical coordinates of the $i$ th communication base station.
$t_{i j}^{D}$	Duration for communication base station $i$ to execute task $j$ , $i \in N^{+}$ , $j \in A^{T}$ .
$t_{i j}^{L}$	Latest start time for communication base station $i$ to execute task $j$ , $i \in N$ , $j \in A^{T}$ .
$t_{i j}^{s t a r t}$	Earliest start time for communication base station $i$ corresponding to each task $j$ , $i \in N$ , $j \in A^{T}$ .
$t_{i j}^{e n d}$	Latest end time for communication base station $i$ corresponding to each task $j$ , $i \in N$ , $j \in A^{T}$ .
$t_{i j k}^{F}$	Interval time for communication base station $i$ to execute from node $j$ to node $k$ .
$e_{i}$	Importance coefficient of communication base station $i$ , $i \in N^{+}$ .
$P_{i j}^{w e i g h t}$	Task weight coefficient for communication base station $i$ executing task $j$ , $j \in A^{T}$ .
$W_{i j}$	Utility amount obtained when communication base station $i$ executes task $j$ , $i \in N^{+}$ , $j \in A^{T}$ .
$P_{i j}^{f a u l t}$	Probability of communication base station i causing fault cost when executing task j.
$c_{j}$	Cost of fault caused when communication base station $i$ executes task $j$ .
$C_{i j}^{3}$	The network congestion compensation payment made by communication base station $i$ for performing task $j$ .
$a_{i}^{T}$	Power of communication base station $i$ ’s task unit, i.e., energy consumption per unit time during task execution, $i \in N^{+}$ .
$b_{i}^{T}$	Standby power of communication base station $i$ , i.e., energy consumption during standby time, $i \in N$ .
$f_{i}^{T}$	Unit power cost of communication base station $i$ .
$G$	A sufficiently large positive integer.
Decision Variables:
$τ_{i j}$	Time when communication base station $i$ starts executing task $j$ , $i \in N^{+}, j \in A^{-}$ .
$α_{i j k}$	The transfer node indicates whether the base station $i$ will execute the next task $k$ after completing one task $j$ .
$β_{i j}$	Representing the order of node $j$ in the tasks executed by communication base station $i$ .

Table 3. Details the simulation environment configuration.

Item	Description
Processor	Intel(R) Core (TM) i9-14900HX 2.20 GHz
RAM	16.0 GB
OS	Windows11 (64-bit)
Python version	Python 3.8

Table 4. Hyperparameter settings.

Parameter	Value	Description
Optimization
Learning rate	0.001	Learning rate for the Adam optimizer
Target update	0.001	Update coefficient for the target network
Gradient clipping	0.5	Maximum norm for gradient clipping
Exploration
Initial exploration	1	Initial value of the exploration rate
Exploration decay	0.995	Exponential decay factor for the exploration rate
Minimum exploration	0.01	Minimum value of the exploration rate
Memory
Replay buffer size	10,000	Storage capacity of the replay buffer
Training
Batch size	64	Number of samples per training batch
Discount factor	0.99	Discount rate for future rewards
Network Structure
Hidden layer size	256	Number of units in the neural network hidden layer

Table 5. Task capacity experiment results.

Number of Tasks	Objective Function Value	Number of Uncompleted Tasks	IDs of Uncompleted Tasks
30	1986.90	0	-
40	2223.90	0	-
50	2627.47	0	-
60	3055.33	0	-
70	3649.08	0	-
80	3741.21	6	23, 49, 57, 66, 74, 75
90	3937.72	3	2, 40, 67
100	4080.78	9	3, 8, 49, 55, 60, 64, 74, 94, 95
150	5692.73	24	1, 3, 13, 18, 19, 21, 31, 37, 38, 39, 46, 52, 53, 63, 76, 87, 94, 107, 109, 110, 111, 126, 130, 146

Table 6. Comparison of algorithms on different task scales.

Task Scale (Average Value)	Dueling DQN-ALNS	ALNS	GA	PSO	DE	ACO
30	1956.10	1849.75	1877.11	1852.84	1849.22	1876.99
40	2179.20	2069.48	2160.48	2147.81	2075.50	2093.14
50	2627.90	2393.01	2481.27	2410.36	2308.69	2528.12
60	3048.81	2878.92	2807.51	2804.77	2787.50	2938.53
70	3755.17	3521.32	3540.31	3547.81	3408.96	3577.52
80	3983.58	3616.83	3692.69	3573.30	3569.58	3775.87

Table 7. Variance values and percentages for repeated algorithm experiments.

Task Scale	Variance	Dueling DQN-ALNS	ALNS	GA	PSO	DE	ACO
30	Variance value	5.20	14.04	17.47	12.53	14.00	26.73
30	Percentage	0.27%	0.76%	0.93%	0.68%	0.76%	1.42%
40	Variance value	12.36	21.89	22.27	17.14	22.68	16.32
40	Percentage	0.57%	1.06%	1.03%	0.80%	1.09%	0.78%
50	Variance value	15.88	29.56	20.37	31.44	62.02	21.82
50	Percentage	0.60%	1.24%	0.82%	1.30%	2.69%	0.86%
60	Variance value	35.83	36.36	36.32	82.14	74.66	57.35
60	Percentage	1.18%	1.26%	1.29%	2.93%	2.68%	1.95%
70	Variance value	28.41	53.84	45.15	31.76	33.52	39.18
70	Percentage	0.76%	1.53%	1.28%	0.90%	0.98%	1.10%
80	Variance value	37.94	50.30	46.47	32.35	49.35	37.46
80	Percentage	0.95%	1.39%	1.26%	0.91%	1.38%	0.99%

Table 8. Algorithm performance comparison.

Algorithm	Best Objective Function Value $Z_{ω}$	Runtime $T_{ω}$ (s)	Performance $Z_{ω} / T_{ω}$
Dueling DQN-ALNS	3048.81	9.25	329.60
ALNS	2878.92	5.91	487.12
GA	2807.51	14.61	192.16
PSO	2804.77	23.13	12.26
DE	2787.50	16.88	165.14
ACO	2938.53	117.72	24.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, Z.; Jiao, Y.; Wang, X.; Pan, X.; Peng, J. Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios. Drones 2025, 9, 567. https://doi.org/10.3390/drones9080567

AMA Style

Tang Z, Jiao Y, Wang X, Pan X, Peng J. Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios. Drones. 2025; 9(8):567. https://doi.org/10.3390/drones9080567

Chicago/Turabian Style

Tang, Zhangquan, Yuanyuan Jiao, Xiao Wang, Xiaogang Pan, and Jiawu Peng. 2025. "Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios" Drones 9, no. 8: 567. https://doi.org/10.3390/drones9080567

APA Style

Tang, Z., Jiao, Y., Wang, X., Pan, X., & Peng, J. (2025). Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios. Drones, 9(8), 567. https://doi.org/10.3390/drones9080567

Article Menu

Research on Scheduling Return Communication Tasks for UAV Swarms in Disaster Relief Scenarios

Abstract

1. Introduction

2. Related Work

2.1. Task Scheduling Model with Multiple Constraints

2.2. Scheduling Model Optimization Solution Algorithms

2.3. Comparison and Analysis

3. Problem Description and Modeling

3.1. Problem Description

3.1.1. Basic Problem Description

3.1.2. Multi-Factor Utility Impact Description

3.2. Model Construction

3.2.1. Symbol Settings

3.2.2. Model Establishment

4. An Improved Deep Reinforcement Learning-Driven ALNS Search Algorithm

4.1. Dueling DQN-ALNS Algorithm Framework

4.2. Improved ALNS Main Search Algorithm

4.2.1. Encoding Method

4.2.2. Initial Solution Construction

4.2.3. Destruction Strategies

4.2.4. Repair Strategies

4.2.5. SA and RRT Acceptance Criteria

4.2.6. Strategy Weight Update

4.3. Deep Reinforcement Learning

4.3.1. State Design

4.3.2. Loss Function

4.3.3. Action Space Design

4.3.4. Reward Function Design

4.3.5. Action Selection Policy

4.3.6. DQN Agent Optimization Process

5. Experiments and Analysis

5.1. Experimental Design

5.2. Hyperparameter Settings

5.3. Model Task Capacity Experiment

5.4. Performance of Improved Dueling DQN-ALNS

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI