An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes
Abstract
1. Introduction
- 1.
- RP1 (Temporal Modeling Mismatch): How can a modeling framework be established to integrate high-level discrete task allocation with low-level, variable-duration physical flight execution to mitigate temporal misalignment?
- 2.
- RP2 (Asynchronous Swarm Execution): How can the decision-making process be synchronized across the swarm to maintain cooperation when individual UAVs complete their assigned actions at varying times?
- 3.
- RP3 (Dynamic Constraint Adaptation): How can the scheduling policy balance competing objectives (e.g., task timeliness versus energy conservation) given dynamic drone states and operational boundaries?
- 1.
- Dual-Timescale Modeling for Variable-Duration Actions: A hierarchical SMDP framework is developed to mitigate the limitations of fixed time-step models. By decoupling task allocation from physical execution, this approach facilitates cooperative optimization across two distinct timescales, aligning more closely with actual flight dynamics.
- 2.
- Event-Driven Temporal Alignment: An event-triggered synchronization mechanism is designed to accommodate the asynchronous execution of multiple UAVs. By utilizing task completion as the trigger for global updates, this mechanism coordinates the decision-making steps of the swarm. This approach helps mitigate the environmental non-stationarity typically introduced by uncoordinated actions, thereby promoting stable cooperation.
- 3.
- State-Aware Adaptive Regulation: To manage the trade-off between temporal constraints and energy consumption, an adaptive regulation mechanism is introduced. By combining dynamic reward weighting with an action space pruning strategy, the policy adjusts its optimization focus across different mission phases. This mechanism improves overall scheduling efficiency while operating within defined safety boundaries.
2. Related Work
2.1. Traditional Resource Scheduling and Heuristic Optimization
2.2. Multi-Agent Reinforcement Learning Cooperative Methods
2.3. Hierarchical Decision-Making and Semi-Markov Temporal Modeling
2.4. Gap Analysis
- 1.
- Timescale Integration: Many current models lack a unified framework to concurrently process macro-level task assignment and micro-level flight execution across different timescales.
- 2.
- Execution Misalignment: The absence of synchronization mechanisms for variable-duration actions can lead to temporal mismatches between high-level planning and physical execution.
- 3.
- Rigid Reward Structures: Reliance on static reward formulations restricts the system’s ability to dynamically balance temporal constraints and energy consumption in fluctuating environments.
3. Proposed Method
3.1. Methodology Overview
- 1.
- Macro-Level Task Allocation (Addressing RP1): As depicted in the upper section of Figure 1, acting as the central coordinator, this layer utilizes a QMIX network to process the global swarm state and assign macro-tasks to individual UAVs at discrete decision steps.
- 2.
- Micro-Level Variable-Duration Execution (Addressing RP1): Corresponding to the lower execution block in Figure 1, upon receiving task assignments, individual UAVs execute local flight actions. By modeling this underlying execution as an SMDP, the framework represents the variable-time costs associated with physical flight maneuvers (illustrated by ).
- 3.
- Event-Driven Temporal Alignment Mechanism (Addressing RP2): Illustrated by the dashed feedback loop on the right side of Figure 1, this component serves as the temporal bridge between the macro and micro layers. Rather than updating at fixed intervals, it utilizes task completion events as triggers. It pauses the macro-level global update until the UAV with the longest execution time completes its assigned action (), thereby coordinating the swarm’s decision steps and mitigating asynchronous mismatches.
- 4.
- State-Aware Adaptive Regulation & Action Pruning (Addressing RP3): Conceptually embedded within the micro-level action execution shown at the bottom of Figure 1, this mechanism operates to maintain operational safety boundaries. It monitors each UAV’s remaining battery capacity and temporal constraints, adaptively adjusting reward weights and masking high-risk flight paths to significantly reduce the likelihood of individual battery depletion during execution.
3.2. Hierarchical SMDP-Based Dual-Timescale Framework Modeling
3.2.1. Variable Definitions
3.2.2. Mathematical Definition of Cross-Layer Interaction Mechanisms
3.2.3. Event-Driven Temporal Alignment Mechanism
3.3. Macro-Level: Global Cooperative Decision-Making Based on Monotonic Value Decomposition
3.3.1. Local Utility and Global Value Decomposition
3.3.2. Value Network Update Incorporating Temporal Characteristics
3.4. Micro-Level: SMDP Execution Modeling Incorporating Temporal Characteristics
3.5. Micro-Level: State-Aware Adaptive Penalty and Safety Bounds
3.5.1. Distributed Multi-Objective Adaptive Penalty Function
3.5.2. Dynamic Action Space Pruning Under Safety Constraints
| Algorithm 1 Execution Decision Logic based on State-Aware Action Masking |
|
3.6. Complexity and Theoretical Consistency Analysis
3.6.1. IGM Theoretical Consistency Under Constrained Action Spaces
3.6.2. Computational and Communication Complexity Analysis
- 1.
- Time ComplexityDuring the decentralized execution phase, each UAV performs local inference at every physical time step t. Let denote the local observation dimension, the GRU hidden layer dimension, and the output action space size. The time complexity of a single decision step is dominated by matrix operations, approximated as . Assuming parallel computation across the swarm, the total computational workload scales linearly with respect to the number of agents, yielding an overall complexity of . This linear scaling facilitates real-time control in highly dynamic environments.
- 2.
- Communication ComplexityCompared to centralized approaches or methods relying on complex graph structures—where communication overhead typically ranges from to —the AH-SMDP framework avoids continuous state broadcasting among UAVs during the micro-execution phase. The swarm limits state synchronization and instruction dispatch to operations, occurring only at the event-triggered temporal alignment boundaries (i.e., at the nodes). This structure restricts communication requirements to linear complexity at discrete intervals, thereby improving system applicability in communication-constrained scenarios.
4. Experiments and Results
4.1. Experimental Setup and Baselines
- 1.
- Baseline Scenario: Characterized by sufficient resources and the absence of strict temporal constraints. This scenario primarily evaluates the fundamental scheduling performance of the algorithms under ideal conditions.
- 2.
- Temporally Constrained Scenario: Imposes strict task deadlines to assess the adaptability of the algorithms under critical temporal limits.
- 3.
- Multi-Constraint Scenario: Incorporates coupled operational constraints, such as battery degradation and charging station congestion, to examine the capacity of the algorithms to maintain operational safety.
4.1.1. Baseline Algorithms and Mechanisms
4.1.2. Evaluation Metrics and Justifications
- Makespan (): This metric evaluates the temporal efficiency of the swarm, defined as the maximum physical completion time among all N UAVs:where represents the time step at which UAV i completes its final assigned task. A lower makespan indicates higher efficiency in event-driven temporal alignment.
- Task Completion Rate (): This metric assesses the framework’s ability to satisfy strict temporal constraints, defined as the ratio of successfully completed tasks to the total number of generated tasks:where is the number of tasks finished before their respective deadlines.
- Total Energy Consumption (): This metric aggregates the physical energy expended by the swarm over the entire mission duration:where denotes the instantaneous energy consumed by UAV i at time step t, encompassing both flight maneuvers and hover/standby states.
- Failure Rate (): The failure rate quantifies the frequency of task terminations caused by boundary violations:where represents the number of tasks aborted due to operational constraint violations, such as battery depletion or critical deadline breaches.
4.2. Experimental Results and Analysis
4.3. Scalability Analysis in Large-Scale Swarms
4.4. Ablation Study of Core Mechanisms
4.5. Sensitivity and Scalability Analysis
4.6. Training Stability and Convergence Analysis
5. Conclusions and Future Work
- 1.
- Impact of the Hierarchical Architecture and Temporal Alignment: The proposed “macro-allocation, micro-execution” hierarchical architecture decouples task assignment from underlying execution, thereby mitigating the dimensionality of the joint action space. Building upon this structure, the integration of semi-Markov modeling with event-driven temporal alignment structurally synchronizes variable-duration physical actions with discrete decision steps. This alignment mitigates the environmental non-stationarity typically induced by asynchronous multi-agent interactions.
- 2.
- Impact of Constraint Awareness and Resource Management: Empirical data indicate a complementary relationship between the state-aware adaptive reward and the dynamic action masking mechanisms. Through physical boundary restrictions, action masking restricts high-risk explorations under critical operating conditions. Concurrently, within the unmasked feasible domain, the adaptive reward function encourages the policy to maintain a dynamic balance between task timeliness and energy conservation. In multi-constraint scenarios, this combined strategy facilitates operational safety while demonstrating improved energy efficiency compared to baseline algorithms.
- 3.
- Alignment of Theoretical Compatibility and Performance Trade-offs: The convergence trajectories observed in the simulation environments align with the theoretical derivations presented in Section 3.6.1. This supports the premise that policy optimization conducted within a restricted action subspace preserves the monotonicity structure of the value decomposition network. Furthermore, although the integration of multi-objective safety constraints incurs a partial efficiency trade-off in the Baseline Scenario, this strategic approach improves the system’s capacity to maintain operational safety under highly constrained scheduling conditions, illustrating a necessary trade-off to enhance overall swarm adaptability.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Proof of Semi-Markov Consistency for Transition Probabilities Under the Event-Triggered Mechanism
Appendix B
| Parameter Category | Symbol | Value | Description |
|---|---|---|---|
| Energy Penalty | 1.0 | Base energy penalty weight under sufficient battery capacity. | |
| 10.0 | Peak energy penalty weight at critical battery levels. | ||
| 5.0 | Energy sensitivity decay coefficient. | ||
| Deadline Urgency | 1.0 | Baseline time penalty weight during early task phases. | |
| 5.0 | Peak time penalty weight near task deadlines. | ||
| 10.0 | Deadline sensitivity scaling coefficient. | ||
| 0.6 | Inflection threshold for deadline urgency mapping. |
References
- Nguyen, A.C.; Pamuklu, T.; Syed, A.; Kennedy, W.S.; Erol-Kantarci, M. Reinforcement learning-based deadline and battery-aware offloading in smart farm IoT-UAV networks. In ICC 2022-IEEE International Conference on Communications; IEEE: New York, NY, USA, 2022; pp. 189–194. [Google Scholar] [CrossRef]
- Zhang, M.; Yan, C.; Dai, W.; Xiang, X.; Low, K.H. Tactical conflict resolution in urban airspace for unmanned aerial vehicles operations using attention-based deep reinforcement learning. Green Energy Intell. Transp. 2023, 2, 100107. [Google Scholar] [CrossRef]
- Kong, X.; Zhou, Y.; Li, Z.; Wang, S. Multi-UAV simultaneous target assignment and path planning based on deep reinforcement learning in dynamic multiple obstacles environments. Front. Neurorobot. 2024, 17, 1302898. [Google Scholar] [CrossRef] [PubMed]
- Liu, R.; Shin, H.S.; Tsourdos, A. Edge-enhanced attentions for drone delivery in presence of winds and recharging stations. J. Aerosp. Inf. Syst. 2023, 20, 216–228. [Google Scholar] [CrossRef]
- Lee, S.; Lim, S.; Chae, S.H.; Jung, B.C.; Park, C.Y.; Lee, H. Optimal frequency reuse and power control in multi-UAV wireless networks: Hierarchical multi-agent reinforcement learning perspective. IEEE Access 2022, 10, 39555–39565. [Google Scholar] [CrossRef]
- Meng, F.; Yan, K. Multi-UAV task allocation based on improved particle swarm optimization. In 2024 4th International Symposium on Computer Technology and Information Science (ISCTIS); IEEE: New York, NY, USA, 2024; pp. 768–773. [Google Scholar] [CrossRef]
- Gao, X.; Wang, L.; Yu, X.; Su, X.; Ding, Y.; Lu, C.; Peng, H.; Wang, X. Conditional probability based multi-objective cooperative task assignment for heterogeneous UAVs. Eng. Appl. Artif. Intell. 2023, 123, 106404. [Google Scholar] [CrossRef]
- Ye, F.; Chen, J.; Sun, Q.; Tian, Y.; Jiang, T. Decentralized task allocation for heterogeneous multi-UAV system with task coupling constraints. J. Supercomput. 2021, 77, 111–132. [Google Scholar] [CrossRef]
- Tu, W. Efficient resource utilization for multi-flow wireless multicasting transmissions. IEEE J. Sel. Areas Commun. 2012, 30, 1246–1258. [Google Scholar] [CrossRef]
- Zeng, Y.; Zhang, R. Energy-efficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
- Tu, W. Resource-efficient seamless transitions for high-performance multi-hop UAV multicasting. Comput. Netw. 2022, 213, 109051. [Google Scholar] [CrossRef]
- Du, Y.; Qi, N.; Li, X.; Xiao, M.; Boulogeorgos, A.A.A.; Tsiftsis, T.A.; Wu, Q. Distributed multi-UAV trajectory planning for downlink transmission: A GNN-enhanced DRL approach. IEEE Wirel. Commun. Lett. 2024, 13, 3578–3582. [Google Scholar] [CrossRef]
- Yang, L.; Zheng, J.; Zhang, B. An MARL-based Task Scheduling Algorithm for Cooperative Computation in Multi-UAV-Assisted MEC Systems. In 2023 International Conference on Future Communications and Networks (FCN); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef]
- Hurst, W.; Mostofi, Y. Optimal dynamic trajectories for UAVs in mobility-enabled relay systems. In 2023 62nd IEEE Conference on Decision and Control (CDC); IEEE: New York, NY, USA, 2023; pp. 7451–7456. [Google Scholar] [CrossRef]
- Diallo, E.M.; Chai, R.; Adam, A.B.; Liang, C.; Chen, Q. OHDRL-Based Energy Consumption Optimization for Joint Content Fetching and Trajectory Design of UAVs. In 2024 IEEE 29th Asia Pacific Conference on Communications (APCC); IEEE: New York, NY, USA, 2024; pp. 32–38. [Google Scholar] [CrossRef]
- Hu, G.; Zhu, Y.; Zhao, D.; Zhao, M.; Hao, J. Event-triggered communication network with limited-bandwidth constraint for multi-agent reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3966–3978. [Google Scholar] [CrossRef] [PubMed]
- Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar] [CrossRef]
- Liu, D.; Dou, L.; Zhang, R.; Zhang, X.; Zong, Q. Multi-agent reinforcement learning-based coordinated dynamic task allocation for heterogenous UAVs. IEEE Trans. Veh. Technol. 2022, 72, 4372–4383. [Google Scholar] [CrossRef]
- Anthony, S.M.; Kumar, T.P. Three-dimensional mobility management of unmanned aerial vehicles in flying ad-hoc networks. IEEE Access 2024, 12, 190102–190119. [Google Scholar] [CrossRef]
- Seerangan, K.; Nandagopal, M.; Govindaraju, T.; Manogaran, N.; Balusamy, B.; Selvarajan, S. A novel energy-efficiency framework for UAV-assisted networks using adaptive deep reinforcement learning. Sci. Rep. 2024, 14, 22188. [Google Scholar] [CrossRef] [PubMed]
- Rizvi, D.; Boyle, D. Multi-agent reinforcement learning with action masking for UAV-enabled mobile communications. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 3, 117–132. [Google Scholar] [CrossRef]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Singh, P.; Hazarika, B.; Singh, K.; Pan, C.; Huang, W.J.; Li, C.P. DRL-based federated learning for efficient vehicular caching management. IEEE Internet Things J. 2024, 11, 34156–34171. [Google Scholar] [CrossRef]
- De Alba, A.; Flores, A.; García Maya, B.; Abaunza, H. Optimizing UAV Task Allocation with Enhanced Battery Efficiency Using Semi-Markov Decision Processes. J. Intell. Robot. Syst. 2025, 111, 86. [Google Scholar] [CrossRef]
- Zhang, B.; Yang, K. Multi-UAV searching trajectory optimization algorithm based on deep reinforcement learning. In 2023 IEEE 23rd International Conference on Communication Technology (ICCT); IEEE: New York, NY, USA, 2023; pp. 640–644. [Google Scholar] [CrossRef]












| Layer | Variable Type | Symbol | Dimension | Physical Meaning and Description |
|---|---|---|---|---|
| Macro- | Global State | Includes UAV coordinates , energy levels , task coordinates , and deadlines . | ||
| Allocation Layer | Joint Action | Discrete task assignment indices, , representing the ID of the task or charging station assigned to UAV i. | ||
| Micro- | Local State | Includes relative distance , real-time energy level , local wind disturbance , and task urgency . | ||
| Execution Layer | Local Action | Includes continuous heading angle , flight velocity , and discrete suspend/charging mode bit . |
| Model Variant | Removed Module Description | Makespan (min) | Makespan Degrad. Ratio | Comp. Rate (%) | Total Energy Consum. (Units) | Conv. Variance () | Variance Deter. Ratio |
|---|---|---|---|---|---|---|---|
| AH-SMDP (Ours) | Complete dual-layer adaptive model | 17.8 | - | 100% | 100 | 12.4 | - |
| H-SMDP (w/o A) | Removed adaptive reward mechanism | 24.5 | +37.6% | 85% | 185 | 18.2 | +46.7% |
| A-SMDP (w/o H) | Removed macro-level QMIX hierarchy | 42.8 | +140.4% | 65% | 240 | 45.6 | +267.7% |
| AH-MDP (w/o SMDP) | Degraded to standard discrete MDP | 38.5 | +116.2% | 70% | 280 | 94.5 | +662.1% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, F.; Ding, B.; Tian, F.; Guo, Z.; Ma, W. An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes. Appl. Sci. 2026, 16, 6570. https://doi.org/10.3390/app16136570
Wang F, Ding B, Tian F, Guo Z, Ma W. An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes. Applied Sciences. 2026; 16(13):6570. https://doi.org/10.3390/app16136570
Chicago/Turabian StyleWang, Feng, Bingwei Ding, Fangchao Tian, Zhaohua Guo, and Wenshuo Ma. 2026. "An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes" Applied Sciences 16, no. 13: 6570. https://doi.org/10.3390/app16136570
APA StyleWang, F., Ding, B., Tian, F., Guo, Z., & Ma, W. (2026). An Adaptive Scheduling Algorithm Integrating Hierarchical Reinforcement Learning and Semi-Markov Decision Processes. Applied Sciences, 16(13), 6570. https://doi.org/10.3390/app16136570

