Next Article in Journal
Neuro-Fuzzy Modeling of Decision-Making in Cyber Defense Exercises Using ANFIS and Synthetic Data Augmentation
Previous Article in Journal
Development, Numerical Simulation and Laboratory Validation of a Load-Cell-Based Mass Flow Rate Measuring Sensor for Dry Fertilizers in Seed Drills
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes

1
Key Laboratory of Grain Information Processing and Control, Ministry of Education, Henan University of Technology, Zhengzhou 450001, China
2
Key Laboratory of Intelligent Perception and Decision Making of Grain Storage Information, Henan University of Technology, Zhengzhou 450001, China
3
School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(13), 6570; https://doi.org/10.3390/app16136570
Submission received: 1 June 2026 / Revised: 24 June 2026 / Accepted: 26 June 2026 / Published: 1 July 2026

Abstract

Coordinating multiple unmanned aerial vehicle (UAV) systems under strict energy and temporal constraints remains a complex scheduling problem. Existing reinforcement learning methods typically rely on fixed-time-step modeling, which struggles to accommodate flight actions of varying durations and often leads to temporal mismatches between task planning and physical execution. To address this limitation, we propose an Adaptive Hierarchical Semi-Markov Decision Process (AH-SMDP) framework. This architecture decouples task allocation from execution by modeling variable-length actions via an SMDP. An event-driven synchronization mechanism is introduced to align the swarm’s decision-making rhythm with actual task completion times. Additionally, a state-aware reward formulation and a dynamic action space pruning strategy are designed to help UAVs balance energy efficiency with deadline compliance. Simulation results in multi-constraint environments demonstrate that the AH-SMDP framework effectively improves scheduling performance compared to standard MAPPO and PPO algorithms. Under the evaluated experimental settings, the proposed method yields improvements of approximately 30% in average task completion rate, 40% in energy reduction, and 60% in convergence stability. Ablation studies further suggest that this integrated framework offers a viable and effective approach for multi-UAV scheduling.

1. Introduction

Systems comprising multiple unmanned aerial vehicles (UAVs) are increasingly utilized in applications such as smart logistics, emergency response, and environmental monitoring. While single UAVs are limited in operational scope, cooperative multi-UAV swarms can expand coverage and enhance execution efficiency [1,2]. However, in practical deployments, these systems must often manage complex task sequences under limited battery capacities and strict temporal constraints. This operational reality necessitates a trade-off between the collective performance of the swarm and the energy endurance of individual drones. Consequently, developing effective cooperative decision-making strategies under such multiple constraints remains a significant challenge in multi-agent system research [3].
In this study, the “scheduling” problem is defined as the dynamic allocation of spatially distributed macro-tasks to individual UAVs, coupled with the execution of micro-actions (e.g., variable-duration flight maneuvers and charging behaviors). The primary purpose of this scheduling is to maximize the swarm’s global mission completion rate within specified temporal constraints, while autonomously minimizing total energy consumption and avoiding individual battery depletion.
Despite these defined objectives, achieving effective scheduling in dynamic environments highlights several limitations in existing methodologies. To address these challenges, this study focuses on the following three specific research problems (RPs):
1.
RP1 (Temporal Modeling Mismatch): How can a modeling framework be established to integrate high-level discrete task allocation with low-level, variable-duration physical flight execution to mitigate temporal misalignment?
2.
RP2 (Asynchronous Swarm Execution): How can the decision-making process be synchronized across the swarm to maintain cooperation when individual UAVs complete their assigned actions at varying times?
3.
RP3 (Dynamic Constraint Adaptation): How can the scheduling policy balance competing objectives (e.g., task timeliness versus energy conservation) given dynamic drone states and operational boundaries?
From a modeling perspective, tasks and flight actions in multi-UAV scheduling inherently require variable execution times. As a result, traditional fixed-time-step models often struggle to accurately represent the physical execution process. Relying on a single timescale can lead to a temporal mismatch between high-level planning and low-level control. A viable approach to address this limitation is to decouple the decision-making process into a hierarchical structure: an upper layer for macro-task allocation and a lower layer for micro-action execution, thereby establishing a dual-timescale framework. During mission execution, individual UAV states—such as remaining battery capacity, task progress, and potential delays—evolve dynamically and affect subsequent decisions. The interactions among these variable-length actions further increase environmental non-stationarity. Additionally, competing objectives, including task completion, energy conservation, and temporal constraints, complicate the application of standard unified optimization methods in these scenarios [4,5].
Current research addresses these challenges primarily through heuristic optimization and multi-agent reinforcement learning (MARL). While heuristic methods can yield near-optimal solutions in static or relatively stable environments, their computational overhead often limits their applicability in real-time operations. MARL approaches, conversely, offer enhanced adaptability through continuous learning. However, standard MARL frameworks generally rely on Markov Decision Processes (MDPs) with fixed time-steps. This formulation can struggle to accurately represent actions of varying durations. Consequently, this approach may lead to a temporal gap between high-level planning and physical execution, potentially introducing instability during multi-agent cooperation. Furthermore, many existing methods employ static reward functions. These fixed mechanisms typically lack the flexibility to accommodate dynamic state changes, which can restrict performance in tightly constrained environments.
To address these limitations, this paper investigates the multi-UAV scheduling problem using temporal modeling and adaptive decision-making. By incorporating a Semi-Markov Decision Process (SMDP), variable-duration actions are modeled and integrated into a hierarchical architecture. This structure decouples high-level task allocation from low-level flight execution. Instead of traditional fixed-interval updates, the framework utilizes an event-driven synchronization mechanism. A new global decision step is triggered when UAVs complete their assigned operations, thereby alleviating the temporal mismatch caused by asynchronous execution. Furthermore, a state-aware regulation mechanism is designed that combines dynamic reward weighting with action-space pruning. This approach enables the system to dynamically balance energy consumption and temporal constraints. Together, these components constitute the Adaptive Hierarchical Semi-Markov Decision Process (AH-SMDP) framework for swarm scheduling in tightly constrained environments.
The primary contributions of this paper are summarized as follows:
1.
Dual-Timescale Modeling for Variable-Duration Actions: A hierarchical SMDP framework is developed to mitigate the limitations of fixed time-step models. By decoupling task allocation from physical execution, this approach facilitates cooperative optimization across two distinct timescales, aligning more closely with actual flight dynamics.
2.
Event-Driven Temporal Alignment: An event-triggered synchronization mechanism is designed to accommodate the asynchronous execution of multiple UAVs. By utilizing task completion as the trigger for global updates, this mechanism coordinates the decision-making steps of the swarm. This approach helps mitigate the environmental non-stationarity typically introduced by uncoordinated actions, thereby promoting stable cooperation.
3.
State-Aware Adaptive Regulation: To manage the trade-off between temporal constraints and energy consumption, an adaptive regulation mechanism is introduced. By combining dynamic reward weighting with an action space pruning strategy, the policy adjusts its optimization focus across different mission phases. This mechanism improves overall scheduling efficiency while operating within defined safety boundaries.

2. Related Work

Existing research on multi-UAV cooperative scheduling can be broadly categorized into heuristic optimization, multi-agent reinforcement learning (MARL), and hierarchical decision-making with temporal modeling. This section reviews these methodologies, with a particular focus on their applicability and limitations in dynamic, multi-constraint environments.

2.1. Traditional Resource Scheduling and Heuristic Optimization

Early efforts in multi-UAV task allocation frequently relied on swarm intelligence and heuristic search techniques. To address temporal constraints, Yan et al. [6] introduced an enhanced particle swarm optimization (PSO) algorithm that incorporates a golden-sine strategy to balance exploration and exploitation. While this approach can be effective for smaller configurations, the computational overhead of such heuristic methods often scales poorly with the number of agents, limiting their applicability for real-time scheduling in large-scale deployments.
Distributed auction mechanisms have been proposed to enhance system scalability. For instance, Gao et al. [7] utilized a conditional probability-based method to resolve assignment conflicts via a distributed consensus protocol. Building on auction frameworks, Ye et al. [8] adapted the consensus-based bundle auction (CBBA-TCC) to manage temporal coupling constraints and mitigate deadlocks in heterogeneous clusters. Despite these structural improvements, auction-based and heuristic planners typically assume static or quasi-static conditions. In highly uncertain environments where execution times fluctuate or disturbances occur, the requirement for continuous re-planning can introduce substantial communication and computational overhead. Consequently, reliance on static assumptions or predefined planning mechanisms can restrict the adaptability of these methods in highly dynamic, tightly constrained scenarios.
Beyond heuristic search and distributed auctions, resource scheduling has been extensively studied through analytical allocation schemes. To optimize resource utilization in communication networks, Tu [9] proposed a traditional non-learning-based scheme for efficient resource utilization in multi-flow wireless multicasting transmissions. By theoretically analyzing flow scheduling and channel aggregation policies, this method aims to maximize concurrent transmissions based on dynamic channel states. Recent studies have advanced UAV resource management from multiple perspectives. For instance, Zeng and Zhang [10] employed continuous trajectory optimization to reduce energy consumption during UAV communications. Additionally, Tu [11] proposed resource-efficient seamless transitions to sustain performance in dynamic multi-hop UAV multicasting. While these approaches primarily address communication routing and continuous kinematic planning, the proposed AH-SMDP framework extends optimization to the discrete-continuous hybrid scheduling of multi-task swarm operations. Generally, these classical analytical approaches rely on rigorous mathematical programming—such as convex optimization, integer linear programming, or Lyapunov optimization—to derive optimal schedules under the assumption of known or quasi-static environmental parameters.
However, applying these conventional analytical paradigms directly to multi-UAV cooperative operations can introduce computational bottlenecks, thereby motivating the adoption of reinforcement learning (RL). Traditional mathematical models typically rely on explicit state-transition models and may struggle to accommodate highly dynamic, non-stationary environmental feedback in real time. Specifically, in scenarios involving variable-duration flight maneuvers and unpredictable energy degradation, recalculating exact analytical solutions at every time step can impose a prohibitive computational burden. RL offers a viable paradigm to address this intractability. By shifting the computational overhead of solving complex constraints to an offline training phase, RL enables UAVs to execute rapid, adaptive scheduling decisions online, significantly reducing the reliance on static environmental assumptions and predefined mathematical bounds.

2.2. Multi-Agent Reinforcement Learning Cooperative Methods

In recent years, multi-agent reinforcement learning (MARL) utilizing the centralized training with decentralized execution (CTDE) paradigm has emerged as a widely adopted approach for dynamic scheduling. To address the credit assignment challenge in cooperative settings, Rashid et al. [12] developed the QMIX framework, which enforces global monotonicity via non-linear value decomposition. Extending these concepts to UAV operations, Yang et al. [13] formulated the cooperative problem as a Dec-POMDP and applied an adapted MAPPO algorithm to navigate partially observable edge-computing environments.
While MARL offers considerable adaptability, standard formulations typically rely on Markov Decision Processes (MDPs) structured around fixed time-steps. This conventional modeling assumes uniform action durations, which can struggle to accommodate the varying temporal requirements of actual flight maneuvers—ranging from rapid obstacle avoidance to extended cruising. When applied to complex task sequences, such temporal simplifications can lead to a misalignment between high-level planning and physical execution. This discrepancy can exacerbate environmental non-stationarity during asynchronous multi-agent interactions, potentially hindering policy convergence and overall cooperative stability.

2.3. Hierarchical Decision-Making and Semi-Markov Temporal Modeling

Temporal abstraction offers a pathway to address the limitations of fixed-step modeling. Sutton et al. [14] established the foundational options framework to formally represent variable-duration actions. Extending this concept to UAV navigation, Hurst et al. [15] utilized the Semi-Markov Decision Process (SMDP) to link state transitions directly to physical time costs, effectively mitigating timescale distortion. Adopting a structural approach, Liu et al. [16] developed an option-based hierarchical deep reinforcement learning (OHDRL) strategy that decouples micro-level flight control from high-level scheduling, thereby improving system energy management. Despite these modeling advancements, coordinating asynchronous multi-agent execution remains challenging. Existing hierarchical frameworks often lack an event-triggered alignment mechanism, complicating state synchronization across a UAV swarm. Furthermore, these methods typically rely on static reward formulations. This rigid design can limit the system’s ability to dynamically adapt when agents encounter critical operational boundaries, such as depleted battery capacity or strict temporal constraints.

2.4. Gap Analysis

In summary, existing studies on multi-UAV cooperative scheduling generally exhibit the following limitations:
1.
Timescale Integration: Many current models lack a unified framework to concurrently process macro-level task assignment and micro-level flight execution across different timescales.
2.
Execution Misalignment: The absence of synchronization mechanisms for variable-duration actions can lead to temporal mismatches between high-level planning and physical execution.
3.
Rigid Reward Structures: Reliance on static reward formulations restricts the system’s ability to dynamically balance temporal constraints and energy consumption in fluctuating environments.
To address these limitations, this study proposes the Adaptive Hierarchical Semi-Markov Decision Process (AH-SMDP). By integrating an event-triggered synchronization mechanism and a state-aware reward strategy, this framework is designed to improve multi-UAV scheduling performance under defined energy and temporal constraints.

3. Proposed Method

3.1. Methodology Overview

To achieve the primary goal of maximizing the global mission completion rate while autonomously minimizing energy consumption under defined operational and temporal constraints, we propose the Adaptive Hierarchical Semi-Markov Decision Process (AH-SMDP) framework. Prior to presenting the mathematical formulations, this subsection provides a high-level overview of how the designed components interconnect to form a unified optimization process.
As illustrated in Figure 1, the AH-SMDP framework adopts a “macro-decision and micro-execution” architecture. It is composed of four core components designed to address the research problems (RPs) outlined in the Introduction:
1.
Macro-Level Task Allocation (Addressing RP1): As depicted in the upper section of Figure 1, acting as the central coordinator, this layer utilizes a QMIX network to process the global swarm state and assign macro-tasks to individual UAVs at discrete decision steps.
2.
Micro-Level Variable-Duration Execution (Addressing RP1): Corresponding to the lower execution block in Figure 1, upon receiving task assignments, individual UAVs execute local flight actions. By modeling this underlying execution as an SMDP, the framework represents the variable-time costs associated with physical flight maneuvers (illustrated by τ 1 , τ 2 , τ 3 ).
3.
Event-Driven Temporal Alignment Mechanism (Addressing RP2): Illustrated by the dashed feedback loop on the right side of Figure 1, this component serves as the temporal bridge between the macro and micro layers. Rather than updating at fixed intervals, it utilizes task completion events as triggers. It pauses the macro-level global update until the UAV with the longest execution time completes its assigned action ( Δ T K = max ( t i ) ), thereby coordinating the swarm’s decision steps and mitigating asynchronous mismatches.
4.
State-Aware Adaptive Regulation & Action Pruning (Addressing RP3): Conceptually embedded within the micro-level action execution shown at the bottom of Figure 1, this mechanism operates to maintain operational safety boundaries. It monitors each UAV’s remaining battery capacity and temporal constraints, adaptively adjusting reward weights and masking high-risk flight paths to significantly reduce the likelihood of individual battery depletion during execution.
Initially, the macro-level allocates tasks via a top-down mapping operator. UAVs subsequently execute these assignments at the micro-level over variable durations, operating within the safety parameters maintained by the adaptive regulation mechanism. Upon reaching the event-driven temporal boundary, the actual time and energy costs incurred at the micro-level are aggregated into a cumulative return via a bottom-up feedback operator. This return is then used to update the global macro policy. Together, these interconnected components facilitate a coherent transition from high-level planning to low-level execution.

3.2. Hierarchical SMDP-Based Dual-Timescale Framework Modeling

At the macro-level, the system utilizes the global state S K (including UAV positions, remaining battery capacities, task specifications, and temporal constraints) to generate a joint action A K via a QMIX network, facilitating multi-UAV task assignment. This layer operates in discrete decision steps, primarily handling global coordination and resource allocation.
At the micro-level, each UAV receives its allocation and independently executes the corresponding task within its local environment. Unlike traditional approaches based on fixed time-steps, the execution process is formulated as an SMDP to accommodate variable-duration actions. This representation aligns more closely with the temporal dynamics of physical flight and control processes.
To manage hierarchical interactions, the framework incorporates a bidirectional information exchange mechanism. The macro-level translates global task assignments into local objectives for individual UAVs via a downward mapping operator ( M d o w n ). Concurrently, the cumulative rewards generated during micro-level execution are aggregated and fed back to the macro-level via an upward feedback operator ( M u p ) to support subsequent policy updates.
Furthermore, the framework operates under the centralized training with decentralized execution (CTDE) paradigm: the global state is accessible during the training phase, whereas individual UAVs rely on local observations for independent decision-making during the execution phase. It is assumed that the system can access critical status updates, such as remaining battery capacity and task progression, from onboard sensors.

3.2.1. Variable Definitions

To support reproducibility and state-space verification, Table 1 summarizes the formal definitions of the variables across different system layers. Specifically, τ i , K denotes the action duration, representing the number of physical time-steps elapsed during execution under discrete-time modeling. Consequently, the actual physical time cost is calculated as τ i , K · Δ t , where Δ t represents the single-step control period. The composite action u i , t = { θ i , t , v i , t , m i , t } incorporates continuous control variables (e.g., heading angle θ and flight velocity v) alongside discrete mode variables (e.g., hovering or charging status m).

3.2.2. Mathematical Definition of Cross-Layer Interaction Mechanisms

To manage information exchange across hierarchical layers, this framework defines top-down and bottom-up mapping operators. Top-Down Mapping Operator ( M down ): The macro-level action A K is mapped to local target features for the underlying SMDP via an objective mapping function. The initial state s i , 0 of micro-agent i is formulated as follows:
s i , 0 = proj i ( S K ) Φ ( a i , K )
where proj i ( S K ) denotes the localized information extracted from the global state for agent i, and Φ ( a i , K ) represents the intention feature vector. This operator translates macro-level task assignments into micro-level control objectives, guiding the low-level policy during local execution. Bottom-Up Feedback Operator ( M up ): The actual execution time τ i , K and energy consumption incurred at the lower level are aggregated to compute the cumulative return for the macro-level. The cumulative return R i , K acquired by UAV i during the execution of macro-action a i , K is calculated as follows:
R i , K = t = 0 τ i , K 1 γ t r l o w e r ( s i , t , u i , t ) + γ τ i , K P ( s i , τ i , K )
where r l o w e r is a state-aware multi-objective immediate feedback function that evaluates factors such as micro-level energy consumption, temporal constraints, and safety boundary penalties (detailed in Section 3.5). This return serves as an individual value input for computing the global value function Q t o t , which is modeled by the QMIX network (detailed in Section 3.3). Consequently, this operator quantifies micro-level execution outcomes into macro-level returns, establishing a closed-loop temporal optimization process.

3.2.3. Event-Driven Temporal Alignment Mechanism

Because the actual physical duration τ i , K of actions at the micro-execution layer varies across UAVs, we introduce an event-triggered mechanism to facilitate multi-agent state synchronization [17]. The macro-level step size Δ T K is dynamically anchored to the UAV requiring the longest execution time in the current decision round, defined as Δ T K = max i ( τ i , K ) . An update of the global state S K + 1 is triggered upon the completion of this most time-consuming action. This synchronization mechanism helps align the swarm at a unified temporal boundary, thereby mitigating the environmental non-stationarity and state inconsistencies typically introduced by asynchronous execution.
Under these definitions, the macro-level state transition can be formalized as a joint transition probability distribution P ( S K + 1 , Δ T K S K , A K ) . Jointly determined by the underlying physical environmental dynamics and the micro-level control policy, this distribution characterizes the state evolution of macro-decisions given variable-duration actions. Because this transition depends explicitly on the action duration in addition to the current state and action, it satisfies the mathematical definition of a Semi-Markov Decision Process (the proof is provided in Appendix A). For UAVs that complete their tasks prior to Δ T K , the lower level triggers a hovering or standby command to maintain operational positioning. The energy and time costs incurred during this standby state are evaluated via the immediate reward function r l o w e r , which are subsequently incorporated as penalty terms within the global policy’s optimization process. If the execution time of a single UAV exceeds the maximum tolerance threshold T m a x (i.e., t t K T m a x ), the system interrupts the current round and initiates task reallocation. Thus, the proposed modeling framework establishes a coherent cross-layer optimization process across both temporal and structural dimensions. Building upon this foundation, the subsequent section details the multi-agent value decomposition network at the macro-level.

3.3. Macro-Level: Global Cooperative Decision-Making Based on Monotonic Value Decomposition

Following the formulation of the hierarchical framework, the macro-allocation layer determines the optimal task assignment policy within the joint action space by leveraging the cumulative returns R i , K provided by the micro-execution layer. Given that the dimensionality of the joint action space scales exponentially with the number of UAVs, we employ the QMIX value decomposition architecture, which operates under the centralized training with decentralized execution (CTDE) paradigm [18]. The topological structure of this architecture is illustrated in Figure 2.

3.3.1. Local Utility and Global Value Decomposition

During the decentralized execution phase, each UAV i maintains an independent local value network. Because local observations provide only partial information at each time step and cannot fully capture task execution progress or historical state dependencies, this paper integrates a Gated Recurrent Unit (GRU) to temporally encode the sequence of local observations. This process approximates the hidden state h i , K , thereby enhancing the stability of local value estimation in partially observable environments. Based on h i , K , the network outputs the individual utility value Q i ( h i , K , a i , K ) for a given macro-instruction  a i , K .
To achieve global cooperation during the offline training phase, the system integrates these local utility values through a non-linear mixing network to approximate the global joint value function Q tot ( S K , A K ) . To ensure that the combination of greedy actions from individual agents during decentralized execution is equivalent to the global optimal action, the mixing network must satisfy the structural monotonicity constraint, defined as:
Q tot Q i 0 , i { 1 , 2 , , N }
This constraint theoretically ensures that the global optimal action can be derived by maximizing the local utility functions of all agents independently. To implement this condition, the weights of the mixing network are generated by a hypernetwork conditioned on the global state S K . Furthermore, non-negative constraints—such as an absolute value or softplus transformation—are applied to the mixing network weights to maintain the monotonicity requirement.

3.3.2. Value Network Update Incorporating Temporal Characteristics

Unlike traditional fixed time-step models, macro-state evolution in this framework explicitly depends on the duration of micro-actions. In the centralized training phase, the system utilizes an experience replay buffer D to optimize the network parameters θ by minimizing a temporal difference (TD) loss function L ( θ ) that incorporates SMDP characteristics, expressed as
L ( θ ) = E D y K Q tot ( S K , A K ; θ ) 2
where the target value y K extends the SMDP Bellman equation within the value decomposition framework, calculated as follows:
y K = i = 1 N R i , K + γ Δ T K max A K + 1 Q tot ( S K + 1 , A K + 1 ; θ )
Here, i = 1 N R i , K represents the global joint return for the current macro-step. This aggregation method assumes that individual returns are approximately decomposable at the macro level, meaning the global cooperative reward can be represented as a weighted combination of the execution costs and benefits across all UAVs. In this task-scheduling context, because task completion rate, energy consumption, and timeout penalties can be independently evaluated at the individual level and linearly superimposed at the system level, this assumption provides a practical engineering approximation.
Additionally, γ Δ T K denotes the dynamic discount factor for variable-duration actions. This formulation applies a heavier discount penalty to decisions that require extended execution times, encouraging the policy to balance task rewards with temporal efficiency. This discount mechanism not only affects the value estimation but also backpropagates to the policy network via the TD error, thereby explicitly incorporating a temporal constraint into the parameter optimization process. Together, these mechanisms help mitigate modeling deviations arising from temporal misalignments between macro-level planning and micro-level execution. Building upon this macro-level cooperative architecture, the subsequent section details the adaptive reward mechanism designed for the micro-execution layer.

3.4. Micro-Level: SMDP Execution Modeling Incorporating Temporal Characteristics

Upon receiving the macro-instruction a i , K and conditional objectives via the intention projection operator M d o w n , the micro-execution layer performs action optimization within its local physical space. Given the variance in the durations of different micro-actions (e.g., rapid obstacle avoidance versus extended cross-regional flight), standard fixed-step MDPs can lead to an underestimation of physical time costs. The proposed approach mitigates the time-scale distortion introduced by forcibly discretizing variable-duration actions into fixed time steps. Therefore, this paper formulates the underlying single-agent control process as a local Semi-Markov Decision Process (SMDP) [19,20]. For any given UAV i, its local control process is formalized as a 5-tuple S i , U i , P i , R i , F i . Here, S i and U i denote the local state and action spaces defined in Section 3.2.1. The term P i ( s i , t + τ s i , t , u i , t ) represents the environmental-state transition probability, capturing the cumulative evolution of the environment induced by physical dynamics during the execution of the action. Concurrently, a temporal distribution F i ( τ s i , t , u i , t ) is introduced to describe the probability of action u i , t lasting for a duration of τ given state s i , t . This distribution is determined by the physical environmental dynamics and is empirically estimated through sampled trajectories in a model-free manner during algorithm implementation. Under this SMDP framework, the optimization objective for the micro-agent is to maximize the expected local cumulative discounted return. Because the action u i , t is executed continuously for τ steps, the local action-value function Q lower ( s i , t , u i , t ) is extended via the Bellman optimality equation over a variable time domain as follows:
Q lower ( s i , t , u i , t ) = E τ , s k = 0 τ 1 γ k r lower ( s i , t + k , u i , t ) + γ τ max u Q lower ( s i , t + τ , u )
In this equation, the summation term evaluates the continuous accumulation of immediate rewards during the execution of action u i , t . Furthermore, γ τ acts as a dynamic time-discount factor, encouraging the policy to favor time-efficient behaviors during the optimization process.
The local SMDP process triggers a state transition and truncation when any of the following conditions are met: (1) the macro-task objective is achieved; (2) safety boundary modes, such as hovering or emergency charging, are triggered; or (3) the individual execution time reaches the tolerance threshold T m a x . Upon truncation, the cumulative results of this interval are transformed into the execution cost R i , K , which is evaluated by the macro-layer via the bottom-up operator M u p . During implementation, the micro-level policy is approximated using a value-based deep reinforcement learning method, enabling experience reuse through parameter sharing. Consequently, micro-level outcomes are integrated into the macro-layer via cumulative returns, facilitating cross-layer consistency in temporal modeling and decision optimization.

3.5. Micro-Level: State-Aware Adaptive Penalty and Safety Bounds

Following the formulation of the micro-level SMDP value equation, the convergence direction of the underlying policy π lower is primarily guided by the reward function. When confronting dynamic physical boundaries, such as battery depletion and strict temporal constraints, conventional static-weight reward mechanisms can struggle to maintain operational safety under extreme conditions. Consequently, this study develops a state-aware multi-objective adaptive penalty function [21] and introduces an action-masking mechanism based on state-space constraints to serve as an operational safety measure.

3.5.1. Distributed Multi-Objective Adaptive Penalty Function

To mitigate non-convergent exploration potentially induced by positive reward loops, this approach explicitly separates single-step execution costs from terminal task rewards. The foundational incentive for macro-tasks, denoted as R b a s e , along with critical boundary penalties, is encapsulated within the terminal state return P ( s i , τ i , K ) defined in Section 3.2.2. Conversely, during the micro-execution period at time step t, the immediate feedback signal r lower ( s i , t , u i , t ) evaluates the continuous costs incurred by physical actions, formulated as
r lower ( s i , t , u i , t ) = ω e ( s i , t ) · c energy ( u i , t ) + ω t ( s i , t ) · c time
where c energy ( · ) and c time denote the estimated energy consumption rate and the base time cost within a single control period, respectively. Both terms are normalized to a range of [ 0 , 1 ] via a maximum-value mapping to reduce the impact of numerical scale discrepancies on gradient updates. The time cost term represents the immediate temporal expenditure of local actions, whereas the discount factor γ τ adjusts the present value of long-term returns. Operating across local and global timescales, respectively, these two elements are designed to mitigate modeling redundancy. To facilitate a dynamic trade-off between competing objectives across varying operational phases, the energy efficiency weight ω e and the temporal weight ω t are structured as continuously differentiable state-mapping functions. Given the non-linear discharge characteristics of lithium batteries, the energy efficiency weight is formulated as an exponential decay function of the remaining battery ratio E ratio ( 0 , 1 ] , expressed as follows:
ω e ( s i , t ) = ω e , min + ( ω e , max ω e , min ) exp ( κ e · E ratio )
This formulation approximates the non-linear performance degradation of the battery in low-power states while maintaining gradient continuity. As the battery capacity decreases, ω e ( s i , t ) increases, thereby encouraging the policy to prioritize energy-conserving behaviors. To manage strict temporal constraints, this approach utilizes a relative temporal consumption rate, defined as ρ time = t elapsed / T deadline , and employs a Sigmoid function to structure the penalty curve. This relationship is formulated as follows:
ω t ( s i , t ) = ω t , min + ω t , max ω t , min 1 + exp ( κ t ( ρ time ρ threshold ) )
These mapping parameters are determined via grid search. The dynamic evolution of these multi-objective adaptive weights is illustrated in Figure 3.
This joint mechanism allows the system to accommodate exploratory behaviors aimed at discovering improved cooperative strategies during the initial mission phases, provided that battery capacity and temporal margins are sufficient. Conversely, as the system approaches critical battery thresholds or temporal limits, the corresponding penalty weights increase rapidly according to their respective exponential and sigmoid formulations. This dynamic response encourages the UAV policy to transition toward conservative operational modes, prioritizing energy conservation and timely execution.

3.5.2. Dynamic Action Space Pruning Under Safety Constraints

To address inefficient exploration caused by gradient delays near critical boundaries (e.g., battery depletion), we introduce a dynamic action masking operator [22] alongside the adaptive reward formulation. By explicitly nullifying the probabilities of unsafe actions, this operator restricts the search space to a safe domain, thereby accelerating convergence under strict constraints. Given the original action space U i , a binary mask vector M ( s i , t ) { 0 , 1 } | U i | is generated based on local states and safety thresholds (Figure 4, Algorithm 1). The network’s output distribution π is then renormalized as follows:
π ( u | s ) = π ( u | s ) · M ( u , s ) u π ( u | s ) · M ( u , s )
Applying this mask directly to the output ensures that updates to the local value function Q lower are strictly confined to the feasible subspace. Consequently, the dynamic contraction of this safety space is structured into three progressive tiers:
Algorithm 1 Execution Decision Logic based on State-Aware Action Masking
Require: 
Local state s i , t of UAV i at time t (including remaining battery ratio E ratio , relative deadline consumption rate ρ time ), adaptive energy penalty weight ω e ( s i , t ) , predefined battery safety thresholds E critical and E low .
Ensure: 
Micro-execution action u i , t .
1:
// Phase 1: Safety Shield Masking (Survival First)
2:
if  E ratio E critical  then
3:
    u i , t Recharge                // Action mask M collapses, only recharge permitted
4:
else if  E ratio E low  and  ω e ( s i , t ) > ω e , threshold  then
5:
    u i , t Recharge                               // Action space shrinks under high penalty
6:
   // Phase 2: Deadline Avoidance Masking (Time Priority)
7:
else if  ρ time > 0.9  then
8:
    u i , t Fast_Path                // Facing severe timeout risk, masking high-duration
9:
   // Phase 3: Standard SMDP Policy Optimization under Normal Conditions
10:
else
11:
    u i , t arg max u U i Q lower ( s i , t , u )                      // Optimize within pruned space
12:
end if
13:
return  u i , t
Building upon this mechanism, the dynamic safety-constrained action masking process operates across three progressive stages:
Emergency Safety Constraint: If the remaining battery ratio E ratio E critical , the system identifies a critical state of energy depletion. The action space is strictly restricted to charging behaviors, prioritizing the operational survival of the UAV platform.
Temporal Boundary Constraint: If the temporal constraint indicator ρ time > 0.9 , the system initiates a temporal warning state. By masking highly time-consuming actions (e.g., extended exploration), the policy is directed to favor rapid execution paths.
Optimization within the Feasible Region: When neither of the specified physical or temporal boundaries is triggered, the UAV performs standard optimization within the unmasked action subspace. This allows the system to maximize task returns while operating within established safety margins.
These operational stages are triggered by continuous state variables, facilitating a gradual policy adjustment during execution and mitigating decision oscillations near operational boundaries. Furthermore, to prevent the complete masking of the action space, the mechanism is configured to retain at least one viable action under any given state, thereby preserving decision continuity. By restricting high-risk exploratory behaviors via explicit state-space constraints, this approach can effectively accelerate the convergence of the swarm policy in highly constrained environments.

3.6. Complexity and Theoretical Consistency Analysis

Following the formulation of the micro-level adaptive penalty and dynamic action masking mechanisms, this section examines their theoretical compatibility with the macro-level QMIX value decomposition architecture. Furthermore, it analyzes the computational and communication overhead of the AH-SMDP framework.

3.6.1. IGM Theoretical Consistency Under Constrained Action Spaces

To facilitate the convergence stability of the multi-agent cooperative policy, the AH-SMDP framework should satisfy the Individual–Global–Max (IGM) condition, implying that the optimality of the global joint action aligns with the composition of individual optimal actions. Incorporating the dynamic weights ω e , ω t , and the mask vector M ( s i , t ) , we establish the following assumption: the local reward function r lower is separable with respect to action dependencies, formulated as
r lower ( s , u ) = ω e · c energy ( u ) + ω t · c time ( u )
where the weighting coefficients ω e and ω t depend solely on the state s and remain independent of the action u. The action mask M ( s ) is determined by environmental physical boundaries and temporal constraints, remaining independent of value network parameter updates. At any decision state, the restricted feasible action set maintains | U i | 1 .
Theorem 1 (IGM Consistency under Constrained Action Spaces). 
Under the specified assumptions, the value mapping generated by the micro-level adaptive mechanism preserves monotonicity. Consequently, the IGM condition remains applicable for the optimal control problem defined over the constrained joint action space U N , indicating that this constrained formulation shares an equivalent optimality structure with the original problem within the feasible domain.
Proof. 
First, we establish the monotonic mapping of the local value function. For any fixed state s i , t , the weights ω e and ω t remain constant according to the established assumptions. Consequently, there exists a monotonically increasing transformation f ( · ) such that Q lower ( s , u ) = f ( Q lower ( s , u ) ) . This transformation arises from applying state-dependent proportional scaling and translation operations to the original reward function, while maintaining a monotonic mapping for each state. Given that the transformation f is independent of the specific action u, this mapping preserves the partial ordering of actions, formulated as
arg max u U i Q lower ( s , u ) = arg max u U i Q lower ( s , u )
Second, we analyze the consistency of the subspace contraction. The macro-level mixing network satisfies Q tot Q i 0 via non-negative weight constraints. Because this monotonicity holds over any subset of the original action space U N , when the optimization domain is projected from U N to the feasible subspace i = 1 N U safe , i , the global optimal joint action can still be derived by combining the local optimal actions. This decomposition is expressed as
arg max u U N Q tot ( s , u ) = arg max u 1 U 1 Q 1 ( s 1 , u 1 ) arg max u N U N Q N ( s N , u N )
Finally, although the masking mechanism limits the available action set and indirectly influences the sampling distribution of the policy, it does not directly participate in the parameter gradient computation. Therefore, it facilitates stable policy convergence within the feasible domain manifold without disrupting the structural monotonicity of the framework. This concludes the proof. □

3.6.2. Computational and Communication Complexity Analysis

1.
Time Complexity
During the decentralized execution phase, each UAV performs local inference at every physical time step t. Let d s denote the local observation dimension, d h the GRU hidden layer dimension, and | U i | the output action space size. The time complexity of a single decision step is dominated by matrix operations, approximated as O ( d s d h + d h 2 + | U i | d h ) . Assuming parallel computation across the swarm, the total computational workload scales linearly with respect to the number of agents, yielding an overall complexity of O ( N ( d s d h + d h 2 + | U i | d h ) ) . This linear scaling facilitates real-time control in highly dynamic environments.
2.
Communication Complexity
Compared to centralized approaches or methods relying on complex graph structures—where communication overhead typically ranges from O ( N ) to O ( N 2 ) —the AH-SMDP framework avoids continuous state broadcasting among UAVs during the micro-execution phase. The swarm limits state synchronization and instruction dispatch to O ( N ) operations, occurring only at the event-triggered temporal alignment boundaries (i.e., at the Δ T K nodes). This structure restricts communication requirements to linear complexity at discrete intervals, thereby improving system applicability in communication-constrained scenarios.

4. Experiments and Results

To evaluate the effectiveness of the AH-SMDP framework in multi-UAV cooperative scheduling and to examine the alignment between empirical performance and theoretical derivations, this section presents a series of simulation experiments. Evaluations were conducted across five independent random seeds to account for environmental stochasticity and to assess the algorithm’s generalization capabilities. The results are reported as mean values with standard deviation bands, illustrating convergence stability and statistical consistency. The empirical findings indicate that the proposed framework maintains consistent performance across the evaluated randomized scenarios.

4.1. Experimental Setup and Baselines

The simulation environment models a multi-UAV cooperative scheduling task within an urban setting, as depicted in Figure 5. Tests were conducted across three representative scenarios, recording four primary evaluation metrics: makespan, task completion rate, energy consumption, and failure rate.
1.
Baseline Scenario: Characterized by sufficient resources and the absence of strict temporal constraints. This scenario primarily evaluates the fundamental scheduling performance of the algorithms under ideal conditions.
2.
Temporally Constrained Scenario: Imposes strict task deadlines to assess the adaptability of the algorithms under critical temporal limits.
3.
Multi-Constraint Scenario: Incorporates coupled operational constraints, such as battery degradation and charging station congestion, to examine the capacity of the algorithms to maintain operational safety.

4.1.1. Baseline Algorithms and Mechanisms

The baselines selected for this study encompass representative paradigms, including centralized, decentralized, temporal-aware, and heuristic optimization methods. We compare the AH-SMDP framework against MAPPO [23], which employs the Centralized Training with Decentralized Execution (CTDE) paradigm to train a centralized value function based on global states while updating decentralized policy networks at fixed time steps. This comparison assesses how the proposed framework mitigates the temporal mismatches typically observed in conventional MARL. To represent standard independent learning, PPO is utilized as a baseline [24,25]. In PPO, each agent optimizes its policy using only local observations without global state sharing, serving to evaluate the impact of macro-level coordination in the hierarchical design. Furthermore, an SMDP variant is included as a structural baseline [26]. By removing hierarchical cooperation and retaining only variable-duration action modeling, this variant isolates the performance gains attributable to the dual-timescale hierarchical strategy. Finally, a Genetic Algorithm (GA) is introduced as a heuristic reference. As a population-based search method for static sequences, GA provides a benchmark for illustrating the limitations of static planning in dynamic environments, compared to the proposed adaptive approach. Detailed parameter configurations are provided in Appendix B.

4.1.2. Evaluation Metrics and Justifications

To evaluate the scheduling policies against the research problems defined in Section 2, we explicitly quantify the scheduling performance of the proposed framework using the following metrics:
  • Makespan ( M ): This metric evaluates the temporal efficiency of the swarm, defined as the maximum physical completion time among all N UAVs:
    M = max i { 1 , , N } { t i end }
    where t i end represents the time step at which UAV i completes its final assigned task. A lower makespan indicates higher efficiency in event-driven temporal alignment.
  • Task Completion Rate ( CR ): This metric assesses the framework’s ability to satisfy strict temporal constraints, defined as the ratio of successfully completed tasks to the total number of generated tasks:
    C R = N completed N total × 100 %
    where N completed is the number of tasks finished before their respective deadlines.
  • Total Energy Consumption ( E total ): This metric aggregates the physical energy expended by the swarm over the entire mission duration:
    E total = i = 1 N t = 0 T max e i ( t )
    where e i ( t ) denotes the instantaneous energy consumed by UAV i at time step t, encompassing both flight maneuvers and hover/standby states.
  • Failure Rate ( FR ): The failure rate quantifies the frequency of task terminations caused by boundary violations:
    F R = N failed N total × 100 %
    where N failed represents the number of tasks aborted due to operational constraint violations, such as battery depletion or critical deadline breaches.

4.2. Experimental Results and Analysis

The empirical results are illustrated in Figure 6 and Figure 7. Under highly constrained scenarios, the AH-SMDP framework maintains consistent performance, demonstrating a strategic trade-off that prioritizes operational safety.
In the Baseline Scenario, MAPPO achieved a makespan of 13.5 min, whereas the AH-SMDP framework recorded 17.8 min. This difference reflects the strategic trade-off inherent in safety-aware optimization: scheduling efficiency is partially compromised to prioritize operational safety. Because the adaptive weights ω e and ω t maintain a foundational penalty even when resources are sufficient, the policy favors slightly longer, safer navigation paths to avert potential constraint violations. This conservative approach establishes the foundation for maintaining high task completion rates in more constrained environments.
Under the Temporally Constrained Scenario, the task completion rates for MAPPO and PPO decreased to 70.2% and 60.5%, respectively. In contrast, AH-SMDP maintained a completion rate of 98.7% (±0.6%). This performance advantage is primarily attributable to the combined effects of the temporal alignment mechanism (Section 3.2.3) and the constraint-aware action masking (Section 3.5.2). By structurally addressing the decision-execution misalignment caused by variable-duration actions, the framework effectively handles strict temporal limits.
In the Multi-Constraint Scenario, the AH-SMDP framework recorded zero task failures across all evaluated test runs. This operational safety is largely driven by the dynamic action masking mechanism (Section 3.5.2): as the remaining battery capacity approaches the critical threshold E critical , the mask vector M ( s i , t ) restricts the selection of unsafe execution paths. However, an observed limitation is that frequent triggering of the masking mechanism by coupled physical constraints can induce highly conservative behavior, potentially leading to a degradation in global scheduling efficiency under severe edge-case conditions.

4.3. Scalability Analysis in Large-Scale Swarms

To assess the scalability of the framework, the Multi-Constraint Scenario was expanded by scaling the number of UAVs (N) from 10 to 50. Task density was increased proportionally to maintain a consistent environmental load. The comparative results concerning task completion rate, total energy consumption, and average computation time are presented in Figure 8.
As the joint action space expands with the swarm size, baseline methods exhibit varying degrees of performance degradation. As illustrated in Figure 8A, at N = 50 , independent learning (PPO) and heuristic methods (GA) fail to navigate the coupled constraints, with task completion rates dropping below 5%. While MAPPO achieves optimal completion rates at smaller scales ( N 30 ), its performance declines to 84.0% at N = 50 . More importantly, it incurs a high total energy consumption of 2228 units, suggesting inefficient exploration and severe local congestion within the expanded state space (Figure 8B).
In contrast, the AH-SMDP framework structures the expanded state space through its macro-micro decoupling architecture. Although its completion rate at smaller scales is marginally lower than MAPPO due to its conservative energy-saving policy, it achieves a comparable completion rate (84.0%) at N = 50 while reducing total energy consumption by approximately 29% (1590 units). Furthermore, as shown in Figure 8C, while independent PPO exhibits the lowest inference time due to its lack of global coordination, its inability to satisfy task constraints renders it ineffective. The dynamic action masking mechanism enables the average inference time of AH-SMDP to remain in the sub-millisecond range (∼0.48 ms), aligning with the theoretical linear complexity trend ( O ( N ) ). Conversely, the heuristic approach (GA) exhibits an exponential increase in computational delay. These findings suggest that the AH-SMDP framework possesses robust scalability and effective energy management capabilities for large-scale UAV deployments.

4.4. Ablation Study of Core Mechanisms

To isolate the contributions of individual modules within the AH-SMDP framework and analyze their interactions, this section presents a comparative evaluation using three model variants. The evaluated variants include: H-SMDP (omitting the adaptive reward mechanism), A-SMDP (excluding the hierarchical architecture), and AH-MDP (replacing the event-driven temporal alignment with a traditional fixed-step decision process). The empirical results and performance comparisons for these configurations are detailed in Figure 9 and Table 2.
First, this study examines the impact of temporal processing on system stability. By omitting the event-triggered mechanism and reverting the framework to a traditional fixed-step formulation (AH-MDP), the overall scheduling performance experiences a notable decline. Because the physical execution times of different UAV actions vary significantly, forcibly discretizing these actions into fixed intervals introduces temporal asynchrony among the agents. This misalignment results in a decision-execution mismatch. Quantitatively, the training fluctuation variance of AH-MDP increases to 94.5 (a more than six-fold increase compared to the full model), and the total energy consumption rises to 280 units. These results indicate that synchronizing the decision rhythm via event-driven temporal alignment facilitates the reduction of multi-agent coordination errors and preserves system stability.
Regarding architectural design, the A-SMDP variant omits the upper-level macro-allocation network, requiring UAVs to optimize actions directly within the full global state space. Without macro-level coordination, the agents exhibit an increased tendency for resource contention and operational conflicts, as illustrated in the radar chart in Figure 9. Quantitatively, the makespan extends to 42.8 min (a 140.4% increase), and the lack of coordinated operations reduces the task completion rate to 65%. This comparison demonstrates the structural advantages of the hierarchical architecture: decomposing the global scheduling problem into manageable local tasks effectively mitigates congestion and resource contention during multi-UAV cooperation.
At the micro-execution level, the H-SMDP variant omits the state-aware adaptive reward mechanism. Consequently, the system struggles to adapt to constrained conditions, such as energy depletion. Empirical data indicate that the total energy consumption of this variant exceeds that of the full model by approximately 85% (reaching 185 units), alongside a notable reduction in the task completion rate. Theoretically, the dynamic action masking and adaptive reward mechanisms operate synergistically to establish a structural safety bound. Action masking serves as a foundational constraint by strictly restricting hazardous selections (e.g., preventing long-distance exploration under critical battery levels). Concurrently, within the unmasked feasible domain, the adaptive reward formulation encourages the UAVs to dynamically balance execution speed and energy conservation. These integrated components are critical for facilitating the reliable operation of the UAV swarm across diverse physical constraints.Building upon the performance advantages established in the preceding sections, further examination of the framework’s sensitivity to hyperparameter variations and scale expansion is warranted.

4.5. Sensitivity and Scalability Analysis

The stability and scalability of the AH-SMDP framework were evaluated through a sensitivity analysis of the energy efficiency decay coefficient ( κ e ) and the temporal-constraint inflection threshold ( ρ threshold ). Figure 10 illustrates the impact of these parameters on the task completion rate and total energy consumption for swarm scales of N = 10 and N = 50 .
The analysis of κ e indicates a general trend where higher values correlate with more conservative energy policies, aligning with the exponential decay mechanism established in Equation (8). For smaller swarms ( N = 10 ), this conservatism consistently reduces both total energy consumption and the task completion rate. For larger swarms ( N = 50 ), the system maintains the completion rate more steadily across moderate values ( κ e 8 ). However, a localized fluctuation in energy consumption is observed at κ e = 8 for N = 50 . This increase can be attributed to intensified local congestion and uncoordinated detours when a large number of agents simultaneously adopt highly conservative flight strategies. The empirical data suggest that the default setting ( κ e = 5.0 ) achieves a practical trade-off between energy conservation and task completion across the evaluated scales.
In parallel, evaluating the temporal constraint inflection threshold ( ρ threshold ) indicates that the Sigmoid mapping in Equation 9 regulates the decision rhythm. When ρ threshold exceeds the default value of 0.6, the swarm exhibits a delayed response to approaching temporal limits. While this delay causes a steady decrease in completion rates for N = 10 , the larger swarm ( N = 50 ) effectively sacrifices energy efficiency to maintain a relatively stable completion rate, triggering a sharp increase in total energy consumption. This phenomenon highlights an underlying scalability dynamic: in high-density environments, delayed urgency responses exacerbate cascading conflicts and redundant collision-avoidance maneuvers, thereby increasing the collective energy cost. Consequently, these results indicate that maintaining ρ threshold at 0.6 helps mitigate the impact of scale variance, supporting the framework’s applicability in constrained scheduling environments.

4.6. Training Stability and Convergence Analysis

Figure 11 and Figure 12 illustrate the learning behaviors of the evaluated algorithms. The empirical results indicate that standard reinforcement learning methods exhibit limitations when processing actions of variable duration. For instance, although MAPPO demonstrates rapid improvement during the initial training phase, its reward curve experiences significant fluctuations in the middle and late stages. This instability primarily arises because traditional fixed-step models struggle to accurately capture the actual physical time consumed by different actions. Consequently, this discrepancy introduces temporal misalignment during policy evaluation, complicating the accurate attribution of rewards to effective actions. This observation aligns with the credit assignment challenges of conventional deep reinforcement learning in variable-duration trajectory optimization, as noted in recent literature [27].
In contrast, the AH-SMDP framework demonstrates steady performance improvements during the middle and late stages of training. Because the hierarchical architecture requires initial interaction overhead to align the macro and micro policies, the early convergence process is relatively gradual. However, after approximately 200 episodes, the system enters a stable convergence phase. This stability is primarily attributable to the event-driven temporal alignment mechanism. By standardizing the decision rhythm of each agent, this mechanism mitigates the temporal inconsistencies caused by asynchronous execution, thereby facilitating the stability of global cooperation.
Observing the standard deviation bands in Figure 12, the performance fluctuations of the AH-SMDP framework remain well-constrained. This reflects the structural role of the dynamic action-masking mechanism during the exploration phase. By restricting high-risk selections that could lead to task failure, this mechanism directs policy updates exclusively within the unmasked feasible region. This empirical observation aligns with the theoretical derivations regarding convergence stability presented in Section 3.6.1. Concurrently, mitigating invalid exploration refines the optimization trajectory, preventing the performance degradation typically associated with frequently violating constraint boundaries.
From an overall convergence perspective, the AH-SMDP framework prioritizes long-term systemic stability over the absolute efficiency of individual task executions. Although its makespan slightly trails that of MAPPO in the Baseline Scenario, its training process exhibits notably reduced variance. The empirical data indicate that the adaptive reward mechanism encourages the policy to select conservative, reliable trajectories as it approaches physical limits. This strategic trade-off mitigates the performance degradation commonly observed in baseline algorithms upon violating safety constraints, demonstrating that the proposed method maintains consistent performance in highly constrained environments.

5. Conclusions and Future Work

To address the cooperative scheduling challenges of large-scale UAV swarms operating under coupled spatio-temporal constraints, this study proposes the AH-SMDP framework. The performance of this framework has been evaluated through theoretical analysis and empirical simulations across multiple scenarios. The primary findings are summarized as follows:
1.
Impact of the Hierarchical Architecture and Temporal Alignment: The proposed “macro-allocation, micro-execution” hierarchical architecture decouples task assignment from underlying execution, thereby mitigating the dimensionality of the joint action space. Building upon this structure, the integration of semi-Markov modeling with event-driven temporal alignment structurally synchronizes variable-duration physical actions with discrete decision steps. This alignment mitigates the environmental non-stationarity typically induced by asynchronous multi-agent interactions.
2.
Impact of Constraint Awareness and Resource Management: Empirical data indicate a complementary relationship between the state-aware adaptive reward and the dynamic action masking mechanisms. Through physical boundary restrictions, action masking restricts high-risk explorations under critical operating conditions. Concurrently, within the unmasked feasible domain, the adaptive reward function encourages the policy to maintain a dynamic balance between task timeliness and energy conservation. In multi-constraint scenarios, this combined strategy facilitates operational safety while demonstrating improved energy efficiency compared to baseline algorithms.
3.
Alignment of Theoretical Compatibility and Performance Trade-offs: The convergence trajectories observed in the simulation environments align with the theoretical derivations presented in Section 3.6.1. This supports the premise that policy optimization conducted within a restricted action subspace preserves the monotonicity structure of the value decomposition network. Furthermore, although the integration of multi-objective safety constraints incurs a partial efficiency trade-off in the Baseline Scenario, this strategic approach improves the system’s capacity to maintain operational safety under highly constrained scheduling conditions, illustrating a necessary trade-off to enhance overall swarm adaptability.
While the AH-SMDP framework demonstrates performance stability in highly constrained environments, several limitations warrant further investigation. Future work will incorporate practical operational factors, including communication delays, packet losses, and perception noise. By integrating graph-based methodologies, subsequent research aims to address challenges in decentralized cooperation and credit assignment under partially observable conditions. Furthermore, to mitigate the current reliance on manually configured hyperparameter thresholds, future studies will explore meta-learning-based adaptive regulation mechanisms. This approach would enable the framework to dynamically adjust constraint intensities based on varying task distributions, thereby facilitating a more resilient trade-off between operational safety and global scheduling efficiency.

Author Contributions

Conceptualization, F.W. and B.D.; methodology, F.W. and B.D.; software, B.D. and Z.G.; validation, B.D., F.T., and W.M.; formal analysis, B.D.; investigation, B.D., Z.G., and W.M.; resources, F.W.; data curation, B.D.; writing—original draft preparation, B.D.; writing—review and editing, F.W., F.T., Z.G., and W.M.; visualization, B.D.; supervision, F.W. and F.T.; project administration, F.W.; funding acquisition, F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Henan Province, grant number 251111210400.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Semi-Markov Consistency for Transition Probabilities Under the Event-Triggered Mechanism

Theorem A1 (Semi-Markov Consistency). 
Within the dual-layer architecture of AH-SMDP… Within the dual-layer architecture of the AH-SMDP framework, defining the macro-step time interval as Δ T K = max i N τ i , K ensures that the upper-level joint transition probability distribution P ( S K + 1 , Δ T K S K , A K ) depends exclusively on the current global state S K and the joint action A K . Facilitated by this event-driven temporal alignment, the transition dynamics remain independent of historical state trajectories, thereby satisfying the fundamental memoryless property of a Semi-Markov Decision Process (SMDP).
Proof. 
Assuming the underlying micro-physical environment operates as a standard Markov Decision Process (MDP), it is characterized by the state transition probability P env ( s t + 1 s t , u t ) . Upon the issuance of the global command A K = { a 1 , K , , a N , K } by the macro-allocation layer, each agent within the swarm initiates its micro-execution policy π i lower ( u i , t s i , t , Φ ( a i , K ) ) , which is conditioned on the mapped local target Φ ( a i , K ) .
Considering the system’s state evolution from macro-step K (corresponding to physical time t K ) to the subsequent step K + 1 (corresponding to physical time t K + 1 ), the event-driven temporal alignment detailed in Section 3.2.3 ensures that global state resampling occurs strictly after all agents have concluded their respective actions. Consequently, the duration of this macro-step is formulated as
Δ T K = t K + 1 t K = max i { 1 , , N } τ i , K
At macro-step K + 1 , the joint state S K + 1 of the swarm is defined as the collection of individual micro-states at time t K + 1 , specifically:
S K + 1 = { s 1 , t K + Δ T K , s 2 , t K + Δ T K , , s N , t K + Δ T K }
Because the micro-execution policy π lower remains stationary during execution and the mapped local target Φ ( a i , K ) is held constant over the interval [ t K , t K + 1 ] , the probability distribution of the micro-evolution trajectories is fully determined given the initial joint state S K and macro-action A K . Hence, the upper-level joint transition probability can be expanded as follows:
P ( S K + 1 , Δ T K S 0 , A 0 , S 1 , A 1 , , S K , A K )
Across the micro-time steps, the expanded probability distribution can be represented as the summation of the products of single-step transition probabilities over all feasible trajectories:
P ( S K + 1 , Δ T K | S 0 , A 0 , S 1 , A 1 , , S K , A K ) = all   feasible   paths t = t K t K + Δ T K 1 P e n v s t + 1 | s t , π l o w e r ( s t , A K )
Because the underlying environmental transition P env satisfies the Markov property (i.e., s t + 1 depends solely on s t and the current action), and the temporal duration Δ T K functions as a stopping time strictly determined by the state evolution within the interval [ t K , t K + 1 ] , any historical information conveyed by the macro-trajectory sequence ( S 0 , A 0 , , S K 1 , A K 1 ) is entirely encapsulated within the current global state S K .
Consequently, the upper-level transition probability simplifies to
P ( S K + 1 , Δ T K S K , A K )
This derivation establishes that, despite individual variations and the asynchronous nature of micro-execution times τ i , K across the UAV swarm, the integration of event-driven temporal alignment Δ T K = max i τ i , K systematically structures the state transition process. Therefore, the global scheduling process formulated by the upper-level QMIX network strictly satisfies the mathematical definition of a Semi-Markov Decision Process (SMDP). This completes the proof. □

Appendix B

Table A1. Hyperparameter settings for the adaptive reward mechanism.
Table A1. Hyperparameter settings for the adaptive reward mechanism.
Parameter CategorySymbolValueDescription
Energy Penalty w e , m i n 1.0Base energy penalty weight under sufficient battery capacity.
w e , m a x 10.0Peak energy penalty weight at critical battery levels.
κ e 5.0Energy sensitivity decay coefficient.
Deadline Urgency w t , m i n 1.0Baseline time penalty weight during early task phases.
w t , m a x 5.0Peak time penalty weight near task deadlines.
κ t 10.0Deadline sensitivity scaling coefficient.
ρ t h r e s h o l d 0.6Inflection threshold for deadline urgency mapping.

References

  1. Nguyen, A.C.; Pamuklu, T.; Syed, A.; Kennedy, W.S.; Erol-Kantarci, M. Reinforcement learning-based deadline and battery-aware offloading in smart farm IoT-UAV networks. In ICC 2022-IEEE International Conference on Communications; IEEE: New York, NY, USA, 2022; pp. 189–194. [Google Scholar] [CrossRef]
  2. Zhang, M.; Yan, C.; Dai, W.; Xiang, X.; Low, K.H. Tactical conflict resolution in urban airspace for unmanned aerial vehicles operations using attention-based deep reinforcement learning. Green Energy Intell. Transp. 2023, 2, 100107. [Google Scholar] [CrossRef]
  3. Kong, X.; Zhou, Y.; Li, Z.; Wang, S. Multi-UAV simultaneous target assignment and path planning based on deep reinforcement learning in dynamic multiple obstacles environments. Front. Neurorobot. 2024, 17, 1302898. [Google Scholar] [CrossRef] [PubMed]
  4. Liu, R.; Shin, H.S.; Tsourdos, A. Edge-enhanced attentions for drone delivery in presence of winds and recharging stations. J. Aerosp. Inf. Syst. 2023, 20, 216–228. [Google Scholar] [CrossRef]
  5. Lee, S.; Lim, S.; Chae, S.H.; Jung, B.C.; Park, C.Y.; Lee, H. Optimal frequency reuse and power control in multi-UAV wireless networks: Hierarchical multi-agent reinforcement learning perspective. IEEE Access 2022, 10, 39555–39565. [Google Scholar] [CrossRef]
  6. Meng, F.; Yan, K. Multi-UAV task allocation based on improved particle swarm optimization. In 2024 4th International Symposium on Computer Technology and Information Science (ISCTIS); IEEE: New York, NY, USA, 2024; pp. 768–773. [Google Scholar] [CrossRef]
  7. Gao, X.; Wang, L.; Yu, X.; Su, X.; Ding, Y.; Lu, C.; Peng, H.; Wang, X. Conditional probability based multi-objective cooperative task assignment for heterogeneous UAVs. Eng. Appl. Artif. Intell. 2023, 123, 106404. [Google Scholar] [CrossRef]
  8. Ye, F.; Chen, J.; Sun, Q.; Tian, Y.; Jiang, T. Decentralized task allocation for heterogeneous multi-UAV system with task coupling constraints. J. Supercomput. 2021, 77, 111–132. [Google Scholar] [CrossRef]
  9. Tu, W. Efficient resource utilization for multi-flow wireless multicasting transmissions. IEEE J. Sel. Areas Commun. 2012, 30, 1246–1258. [Google Scholar] [CrossRef]
  10. Zeng, Y.; Zhang, R. Energy-efficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
  11. Tu, W. Resource-efficient seamless transitions for high-performance multi-hop UAV multicasting. Comput. Netw. 2022, 213, 109051. [Google Scholar] [CrossRef]
  12. Du, Y.; Qi, N.; Li, X.; Xiao, M.; Boulogeorgos, A.A.A.; Tsiftsis, T.A.; Wu, Q. Distributed multi-UAV trajectory planning for downlink transmission: A GNN-enhanced DRL approach. IEEE Wirel. Commun. Lett. 2024, 13, 3578–3582. [Google Scholar] [CrossRef]
  13. Yang, L.; Zheng, J.; Zhang, B. An MARL-based Task Scheduling Algorithm for Cooperative Computation in Multi-UAV-Assisted MEC Systems. In 2023 International Conference on Future Communications and Networks (FCN); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
  14. Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef]
  15. Hurst, W.; Mostofi, Y. Optimal dynamic trajectories for UAVs in mobility-enabled relay systems. In 2023 62nd IEEE Conference on Decision and Control (CDC); IEEE: New York, NY, USA, 2023; pp. 7451–7456. [Google Scholar] [CrossRef]
  16. Diallo, E.M.; Chai, R.; Adam, A.B.; Liang, C.; Chen, Q. OHDRL-Based Energy Consumption Optimization for Joint Content Fetching and Trajectory Design of UAVs. In 2024 IEEE 29th Asia Pacific Conference on Communications (APCC); IEEE: New York, NY, USA, 2024; pp. 32–38. [Google Scholar] [CrossRef]
  17. Hu, G.; Zhu, Y.; Zhao, D.; Zhao, M.; Hao, J. Event-triggered communication network with limited-bandwidth constraint for multi-agent reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3966–3978. [Google Scholar] [CrossRef] [PubMed]
  18. Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar] [CrossRef]
  19. Liu, D.; Dou, L.; Zhang, R.; Zhang, X.; Zong, Q. Multi-agent reinforcement learning-based coordinated dynamic task allocation for heterogenous UAVs. IEEE Trans. Veh. Technol. 2022, 72, 4372–4383. [Google Scholar] [CrossRef]
  20. Anthony, S.M.; Kumar, T.P. Three-dimensional mobility management of unmanned aerial vehicles in flying ad-hoc networks. IEEE Access 2024, 12, 190102–190119. [Google Scholar] [CrossRef]
  21. Seerangan, K.; Nandagopal, M.; Govindaraju, T.; Manogaran, N.; Balusamy, B.; Selvarajan, S. A novel energy-efficiency framework for UAV-assisted networks using adaptive deep reinforcement learning. Sci. Rep. 2024, 14, 22188. [Google Scholar] [CrossRef] [PubMed]
  22. Rizvi, D.; Boyle, D. Multi-agent reinforcement learning with action masking for UAV-enabled mobile communications. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 3, 117–132. [Google Scholar] [CrossRef]
  23. Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar] [CrossRef]
  24. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  25. Singh, P.; Hazarika, B.; Singh, K.; Pan, C.; Huang, W.J.; Li, C.P. DRL-based federated learning for efficient vehicular caching management. IEEE Internet Things J. 2024, 11, 34156–34171. [Google Scholar] [CrossRef]
  26. De Alba, A.; Flores, A.; García Maya, B.; Abaunza, H. Optimizing UAV Task Allocation with Enhanced Battery Efficiency Using Semi-Markov Decision Processes. J. Intell. Robot. Syst. 2025, 111, 86. [Google Scholar] [CrossRef]
  27. Zhang, B.; Yang, K. Multi-UAV searching trajectory optimization algorithm based on deep reinforcement learning. In 2023 IEEE 23rd International Conference on Communication Technology (ICCT); IEEE: New York, NY, USA, 2023; pp. 640–644. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of AH-SMDP with dual-scale temporal interaction. The blue boxes represent the neural network modules, the orange arrows denote the data flow between levels, and the dashed lines indicate the event-triggered synchronization mechanism across macro and micro levels.
Figure 1. Overall architecture of AH-SMDP with dual-scale temporal interaction. The blue boxes represent the neural network modules, the orange arrows denote the data flow between levels, and the dashed lines indicate the event-triggered synchronization mechanism across macro and micro levels.
Applsci 16 06570 g001
Figure 2. QMIX multi-agent cooperative value decomposition structure with monotonicity constraint. The blue arrows denote the data flow, colorful circles within the mixing network represent weight parameters, and the orange dashed lines indicate the application of the monotonic constraint to the network parameters.
Figure 2. QMIX multi-agent cooperative value decomposition structure with monotonicity constraint. The blue arrows denote the data flow, colorful circles within the mixing network represent weight parameters, and the orange dashed lines indicate the application of the monotonic constraint to the network parameters.
Applsci 16 06570 g002
Figure 3. Dynamic evolution mechanism of multi-objective adaptive reward weights under different environmental states.
Figure 3. Dynamic evolution mechanism of multi-objective adaptive reward weights under different environmental states.
Applsci 16 06570 g003
Figure 4. Dynamic action space pruning and decision flow based on multi-level safety boundaries.
Figure 4. Dynamic action space pruning and decision flow based on multi-level safety boundaries.
Applsci 16 06570 g004
Figure 5. Topology example of the multi-UAV cooperative simulation scenario in a multi-resource environment.
Figure 5. Topology example of the multi-UAV cooperative simulation scenario in a multi-resource environment.
Applsci 16 06570 g005
Figure 6. Comprehensive performance comparison between AH-SMDP and baselines across typical scenarios.
Figure 6. Comprehensive performance comparison between AH-SMDP and baselines across typical scenarios.
Applsci 16 06570 g006
Figure 7. Comprehensive comparison of time, completion rate, energy consumption, and robustness among different algorithms in complex scenarios. The green dashed lines indicate the performance baseline achieved by the proposed AH-SMDP framework.
Figure 7. Comprehensive comparison of time, completion rate, energy consumption, and robustness among different algorithms in complex scenarios. The green dashed lines indicate the performance baseline achieved by the proposed AH-SMDP framework.
Applsci 16 06570 g007
Figure 8. Scalability performance of AH-SMDP and baselines under varying UAV swarm sizes ( N { 10 , 20 , 30 , 50 } ). (A) Task completion rate. (B) Total energy consumption (PPO and GA are excluded due to high failure rates). (C) Average computation time in log scale.
Figure 8. Scalability performance of AH-SMDP and baselines under varying UAV swarm sizes ( N { 10 , 20 , 30 , 50 } ). (A) Task completion rate. (B) Total energy consumption (PPO and GA are excluded due to high failure rates). (C) Average computation time in log scale.
Applsci 16 06570 g008
Figure 9. Comparison of stability, scalability, and conflict handling capabilities among different models in the ablation study.
Figure 9. Comparison of stability, scalability, and conflict handling capabilities among different models in the ablation study.
Applsci 16 06570 g009
Figure 10. Sensitivity analysis of the AH-SMDP framework regarding hyperparameters κ e and ρ t h r e s h o l d across different swarm scales ( N = 10 and N = 50 ). (a) Performance variations with respect to κ e . (b) Performance variations with respect to ρ t h r e s h o l d . Solid lines represent mean values, and shaded areas represent the variance.
Figure 10. Sensitivity analysis of the AH-SMDP framework regarding hyperparameters κ e and ρ t h r e s h o l d across different swarm scales ( N = 10 and N = 50 ). (a) Performance variations with respect to κ e . (b) Performance variations with respect to ρ t h r e s h o l d . Solid lines represent mean values, and shaded areas represent the variance.
Applsci 16 06570 g010
Figure 11. Comparison of convergence behavior and training stability among algorithms in multi-UAV task allocation.
Figure 11. Comparison of convergence behavior and training stability among algorithms in multi-UAV task allocation.
Applsci 16 06570 g011
Figure 12. Stability advantage analysis of AH-SMDP during reward and makespan convergence with variance bands.
Figure 12. Stability advantage analysis of AH-SMDP during reward and makespan convergence with variance bands.
Applsci 16 06570 g012
Table 1. Detailed definition of the hierarchical state and action spaces in AH-SMDP.
Table 1. Detailed definition of the hierarchical state and action spaces in AH-SMDP.
LayerVariable TypeSymbolDimensionPhysical Meaning and Description
Macro-Global State S K R N × 3 Includes UAV coordinates P U A V , energy levels E U A V , task coordinates L T a s k , and deadlines D T a s k .
Allocation LayerJoint Action A K Z N × 1 Discrete task assignment indices, a i , K { 1 , 2 , , M } , representing the ID of the task or charging station assigned to UAV i.
Micro-Local State s i , t R 1 × 3 Includes relative distance Δ p i , t , real-time energy level e i , t , local wind disturbance v w i n d , i , t , and task urgency t u r g e n c y .
Execution LayerLocal Action u i , t [ π , π ] Includes continuous heading angle θ i , t , flight velocity v i , t , and discrete suspend/charging mode bit m c h a r g e .
Table 2. Quantitative analysis of the ablation study for AH-SMDP and its variants in the extreme multi-resource scenario.
Table 2. Quantitative analysis of the ablation study for AH-SMDP and its variants in the extreme multi-resource scenario.
Model VariantRemoved Module DescriptionMakespan
(min)
Makespan
Degrad.
Ratio
Comp.
Rate (%)
Total Energy
Consum.
(Units)
Conv.
Variance
( σ 2 )
Variance
Deter.
Ratio
AH-SMDP (Ours)Complete dual-layer adaptive model17.8-100%10012.4-
H-SMDP (w/o A)Removed adaptive reward mechanism24.5+37.6%85%18518.2+46.7%
A-SMDP (w/o H)Removed macro-level QMIX hierarchy42.8+140.4%65%24045.6+267.7%
AH-MDP (w/o SMDP)Degraded to standard discrete MDP38.5+116.2%70%28094.5+662.1%
Note: The data presented above are derived from the post-convergence means across five independent random-seed experiments. The degradation ratios are calculated using the performance of AH-SMDP as the baseline.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, F.; Ding, B.; Tian, F.; Guo, Z.; Ma, W. An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes. Appl. Sci. 2026, 16, 6570. https://doi.org/10.3390/app16136570

AMA Style

Wang F, Ding B, Tian F, Guo Z, Ma W. An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes. Applied Sciences. 2026; 16(13):6570. https://doi.org/10.3390/app16136570

Chicago/Turabian Style

Wang, Feng, Bingwei Ding, Fangchao Tian, Zhaohua Guo, and Wenshuo Ma. 2026. "An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes" Applied Sciences 16, no. 13: 6570. https://doi.org/10.3390/app16136570

APA Style

Wang, F., Ding, B., Tian, F., Guo, Z., & Ma, W. (2026). An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes. Applied Sciences, 16(13), 6570. https://doi.org/10.3390/app16136570

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop