D3PG-Light: A Lightweight and Stable Resource Scheduling Framework for UAV-Integrated Sensing, Communication, and Computation Systems
Abstract
1. Introduction
2. Related Work
2.1. Algorithm-Based Resource Scheduling
2.2. Machine-Learning-Based Resource Scheduling
3. System Model and Problem Formulation
3.1. System and Resource Description
3.2. Delay Model
3.2.1. Communication Delay
3.2.2. Computation Delay
3.2.3. Sensing Delay
3.3. Reward Function Design
- Total delay penalty: We impose a penalty on the total latency, using a log compression, , where 100 is a normalization scale to keep delays comparable and is the weight for the delay term.
- Sensing accuracy reward: Better sensing performance is achieved by allocating more sensing resources under good channel conditions. We introduce an accuracy term that rewards the agent for allocating sensing resources when the channel is favorable. This term is designed to saturate as it approaches 1, diminishing returns for very high resource allocation or SNR. We add this accuracy term to the reward with a positive weight .
- Communication efficiency reward: To encourage efficient use of communication resources, we include an efficiency term defined as the log of the fraction of incoming data successfully transmitted. Specifically, where is the amount of data (bits) transmitted in slot t and is the data arrival in that slot. This term is weighted by .
- Resource usage penalty: We apply a small penalty proportional to the total fraction of resources used. The purpose is to discourage the agent from always pushing all resources to their maximum limits. is a small penalty coefficient.
- Energy consumption penalty: To promote sustainable operation, we incorporate an energy-aware penalty term into the reward structure. The total energy consumption of the UAV-ISCC platform in slot t is modeled as the summation of the hardware overhead from communication, computation, and active sensing modules:where , , and denote the normalized allocation fractions for communication bandwidth, computing, and sensing, respectively, and is the slot duration. denotes the transmit power in slot t, defined as , where is the maximum transmit power and is the bandwidth–power coupling exponent. denotes the sensing-module power coefficient. The term models the dynamic power consumption of the onboard CPU under a DVFS-based model, where is the effective switched-capacitance coefficient and is the peak CPU operating frequency. This holistic energy model captures the multi-dimensional hardware costs, encouraging the DRL agent to optimize resource allocation while avoiding excessive power depletion.
- Cascaded extreme penalty : To prevent the agent from entering danger zones where both latency and energy consumption spike beyond system tolerances, we introduce a threshold-based cascaded penalty:where is the total latency in milliseconds, is the slot energy consumption, and is the energy normalization coefficient used in the reward. This piecewise design provides strong corrective signals only when the system approaches unsafe operating regimes, thereby improving robustness while avoiding overly restrictive penalties in normal operating conditions.
- Action smoothing penalty : To suppress high-frequency mechanical oscillations and ensure stable transitions between scheduling decisions, a smoothing penalty is imposed on the action variation:where denotes the resource allocation vector at slot t. This term encourages the learned policy to maintain temporal continuity, which is essential for preserving the lifespan of UAV onboard actuators. The smoothing coefficient is chosen to be sufficiently small so that it regularizes extreme oscillations without dominating the primary optimization objective.
3.4. MDP Formulation and Problem Description
4. Improved D3PG-Light Algorithm Design and Implementation
4.1. Design Goals and Overview
- Optimized neural network capacity: We adopt an appropriately sized network architecture and apply layer normalization at each layer output to enhance representation capability for high-dimensional state features while suppressing gradient explosion.
- Innovative Feature Fusion (IFF) module: Considering the heterogeneity of the state vector, which consists of channel-related features and queue-related features, we design specialized sub-networks to process each part separately and then fuse them at a higher level, enhancing the ability to jointly perceive different categories of information.
- Adaptive Gradient Stabilization (AGS) mechanisms: We employ a series of gradient stabilization strategies to ensure numerical stability during training and reduce the risk of gradient explosion or divergence.
4.2. Neural Network Architecture Design
4.2.1. Innovative Feature Fusion (IFF) Module
4.2.2. Actor Network Design
4.2.3. Critic Network Design
4.2.4. LSTM Extension for Temporal Features
4.3. Adaptive Gradient Stabilization Mechanism
- Gradient norm clipping: When updating the actor or critic network parameters, we impose an upper threshold on the gradient norm. If the norm exceeds 1.0, we clip it to that maximum. This hard clipping prevents occasional gradient spikes from destabilizing the network’s convergence.
- Exploration noise scheduling: We combine different types of noise to improve exploration efficiency. D3PG-Light uses a two-stage noise decay strategy: in early training, we use Ornstein–Uhlenbeck (OU) noise with temporal correlation for exploration; as training progresses, the OU noise is gradually reduced and we switch to Gaussian noise in later stages. We can also experiment with Beta-distributed noise to enhance stable boundary exploration. By dynamically adjusting the noise type and intensity across training stages, the agent can explore effectively while avoiding excessive oscillation.
- Target network soft update: For both the actor and critic, we maintain a set of target network parameters that slowly track the learned network parameters. Specifically, after each update, we performwhere is the soft-update coefficient; we typically use .
4.4. Training Procedure and Implementation Details
| Algorithm 1. Training Procedure of D3PG-Light Framework |
| Input: UAV-ISCC environment ε; number of training episodes E; maximum steps per episode T; discount factor γ; soft-update coefficient τ; batch size N; sequence length L; policy update delay d; actor and critic learning rates ημ and ηQ; gradient clipping threshold Cclip; action bounds [amin,amax]. |
| Initialization: |
| 1: Initialize the actor network with the IFF module and LSTM structure. |
| 2: Initialize the critic networks |
| 3: Initialize the target networks by copying parameters: |
| 4: Initialize the replay buffer D with capacity 105. |
| 5: Initialize the exploration noise scheduler Nt |
| 6: Set the global training step counter tglobal ← 0. |
| Training Loop: |
| 7: For episode e = 1 to E, perform the following: |
| 8: Reset the environment ε and obtain the initial state . |
| 9: Initialize the LSTM hidden states for the actor and critic networks. |
| 10: For step t = 1 to T, perform the following: |
| 11: Increment the global step counter: + 1. |
| 12: Embed the current state using the IFF module: . |
| 13: Generate a deterministic action using the actor network: |
| 14: Sample exploration noise nt from |
| 15: Apply exploration and clip the action: |
| 16: Execute action at in the environment |
| 17: Observe the reward |
| 18: Store the transition |
| 19: If the size of D , then |
| 20: Randomly sample N state–action sequences of length L from |
| 21: For each sampled sequence : |
| 22: Embed the next state using IFF: . |
| 23: Compute the target action using the target actor: |
| 24: Compute the target Q-value with clipped Double Q-learning (Equation (17)): |
| 25: Update the critic networks by minimizing the Huber loss (Equation (16)): |
| . |
| 26: Update critic parameters using gradient descent with gradient clipping: |
| 27: |
| 28: Update the actor network by maximizing the expected Q-value (Equation (18)): |
| . |
| 29: Update actor parameters with gradient clipping: |
| 30: Soft-update the target networks (Equation (15)): |
| , |
| 31: End if. |
| 32: End if. |
| 33: Update the current state: . |
| 34: , then break. |
| 35: End for. |
| 36: End for. |
| Output |
| Trained actor policy μ(S|θμ). |
5. Experiments and Results Analysis
5.1. Experiment Setup and Environment Description
5.2. Performance Analysis
5.3. Ablation Study
5.4. Sensitivity Analysis and Reward Mechanism Validation
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Letaief, K.B.; Chen, W.; Shi, Y.; Zhang, J.; Zhang, Y.-J.A. The Roadmap to 6G: AI Empowered Wireless Networks. IEEE Commun. Mag. 2019, 57, 84–90. [Google Scholar] [CrossRef]
- Saad, W.; Bennis, M.; Chen, M. A Vision of 6G Wireless Systems: Applications, Trends, Technologies, and Open Research Problems. IEEE Netw. 2020, 34, 134–142. [Google Scholar] [CrossRef]
- Wen, D.; Zhou, Y.; Li, X.; Shi, Y.; Huang, K.; Letaief, K.B. A Survey on Integrated Sensing, Communication, and Computation. IEEE Commun. Surv. Tutor. 2025, 27, 3058–3098. [Google Scholar] [CrossRef]
- Huawei Technologies. 6G: The Next Horizon—White Paper; Huawei Technologies Co., Ltd.: Shenzhen, China, 2021; Available online: https://www.huawei.com/en/huaweitech/future-technologies/6g-white-paper (accessed on 25 January 2026).
- Mao, Y.; Yu, X.; Huang, K.; Zhang, Y.-J.A.; Zhang, J. Green Edge AI: A Contemporary Survey. Proc. IEEE 2024, 112, 880–911. [Google Scholar] [CrossRef]
- Amodu, O.A.; Althumali, H.; Mohd Hanapi, Z.; Jarray, C.; Raja Mahmood, R.A.; Adam, M.S.; Bukar, U.A.; Abdullah, N.F.; Luong, N.C. A Comprehensive Survey of Deep Reinforcement Learning in UAV-Assisted IoT Data Collection. Veh. Commun. 2025, 55, 100949. [Google Scholar] [CrossRef]
- Amodu, O.A.; Jarray, C.; Raja Mahmood, R.A.; Althumali, H.; Bukar, U.A.; Nordin, R.; Abdullah, N.F.; Luong, N.C. Deep Reinforcement Learning for AoI Minimization in UAV-Aided Data Collection for WSN and IoT Applications: A Survey. IEEE Access 2024, 12, 108000–108040. [Google Scholar] [CrossRef]
- Mao, K.; Li, H.; Zhu, Q.; Xu, H.; Ma, Z.; Hua, B.; Bithas, P.S.; Wu, Q. Channel Measurements and Characterizations for Low-Altitude Communications via an AI-Empowered Multi-Node Sounding System. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 4404–4416. [Google Scholar] [CrossRef]
- Hua, B.; Ni, H.; Zhu, Q.; Wang, C.-X.; Zhou, T.; Mao, K.; Bao, J.; Zhang, X. Channel Modeling for UAV-to-Ground Communications with Posture Variation and Fuselage Scattering Effect. IEEE Trans. Commun. 2023, 71, 3103–3116. [Google Scholar] [CrossRef]
- Zhou, Y.; Liu, X.; Zhai, X.; Zhu, Q.; Durrani, T.S. UAV-Enabled Integrated Sensing, Computing, and Communication for Internet of Things: Joint Resource Allocation and Trajectory Design. IEEE Internet Things J. 2023, 11, 12717–12727. [Google Scholar] [CrossRef]
- Huda, S.M.A.; Moh, S. Survey on Computation Offloading in UAV-Enabled Mobile Edge Computing. J. Netw. Comput. Appl. 2022, 201, 103341. [Google Scholar] [CrossRef]
- Mao, Y.; Zhang, J.; Letaief, K.B. Dynamic Computation Offloading for Mobile-Edge Computing with Energy Harvesting Devices. IEEE J. Sel. Areas Commun. 2016, 34, 3590–3605. [Google Scholar] [CrossRef]
- Chen, L.; Kuang, X.; Zhu, F.; Xia, J. Intelligent Mobile Edge Computing Networks for Internet of Things. IEEE Access 2021, 9, 95665–95674. [Google Scholar] [CrossRef]
- Chen, R.; Cui, L.; Wang, M.; Zhang, Y.; Yao, K.; Yang, Y.; Yao, C. Joint Computation Offloading, Channel Access and Scheduling Optimization in UAV Swarms: A Game-Theoretic Learning Approach. IEEE Open J. Comput. Soc. 2021, 2, 308–320. [Google Scholar] [CrossRef]
- Lin, J.; Huang, L.; Zhang, H.; Yang, X.; Zhao, P. A Novel Lyapunov-Based Dynamic Resource Allocation for UAVs-Assisted Edge Computing. Comput. Netw. 2022, 205, 108710. [Google Scholar] [CrossRef]
- Dai, X.; Xiao, Z.; Jiang, H.; Lui, J.C.S. UAV-Assisted Task Offloading in Vehicular Edge Computing Networks. IEEE Trans. Mob. Comput. 2024, 23, 2520–2534. [Google Scholar] [CrossRef]
- Tu, W. Efficient Wireless Multimedia Multicast in Multi-Rate Multi-Channel Mesh Networks. IEEE Trans. Signal Inf. Process. Over Netw. 2016, 2, 376–390. [Google Scholar] [CrossRef]
- Liu, F.; Cui, Y.; Masouros, C.; Xu, J.; Han, T.X.; Eldar, Y.C.; Buzzi, S. Integrated Sensing and Communications: Toward Dual-Functional Wireless Networks for 6G and Beyond. IEEE J. Sel. Areas Commun. 2022, 40, 1728–1767. [Google Scholar] [CrossRef]
- Zhu, F.; Huang, F.; Yu, Y.; Liu, G.; Huang, T. Task Offloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing. Sensors 2024, 25, 175. [Google Scholar] [CrossRef] [PubMed]
- Ding, Y.; Feng, Y.; Lu, W.; Zheng, S.; Zhao, N.; Meng, L.; Nallanathan, A.; Yang, X. Online Edge Learning Offloading and Resource Management for UAV-Assisted MEC Secure Communications. IEEE J. Sel. Top. Signal Process. 2022, 17, 54–65. [Google Scholar] [CrossRef]
- Liu, X.; Liu, Y.; Zhang, N.; Wu, W.; Liu, A. Optimizing Trajectory of Unmanned Aerial Vehicles for Efficient Data Acquisition: A Matrix Completion Approach. IEEE Internet Things J. 2019, 6, 1829–1840. [Google Scholar] [CrossRef]
- El Haber, E.; Alameddine, H.A.; Assi, C.; Sharafeddine, S. UAV-Aided Ultra-Reliable Low-Latency Computation Offloading in Future IoT Networks. IEEE Trans. Commun. 2021, 69, 6838–6851. [Google Scholar] [CrossRef]
- Attalah, M.A.; Zaidi, S.; Mellal, N.; Calafate, C.T. Task-Offloading Optimization Using a Genetic Algorithm in Hybrid Fog Computing for the Internet of Drones. Sensors 2025, 25, 1383. [Google Scholar] [CrossRef]
- Perera, M.; Fattah, S.; Mistry, S.; Krishna, A. Reinforcement Learning Controlled Adaptive PSO for Task Offloading in IIoT Edge Computing. In Proceedings of the Companion Proceedings of the ACM Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 1249–1253. [Google Scholar]
- Zhang, H.; Song, L.; Han, Z. Radio Resource Allocation for Device-to-Device Underlay Communication Using Hypergraph Theory. IEEE Trans. Wirel. Commun. 2016, 15, 4852–4861. [Google Scholar] [CrossRef]
- Cadambe, V.R.; Jafar, S.A. Interference Alignment and Degrees of Freedom of the K-User Interference Channel. IEEE Trans. Inf. Theory 2008, 54, 3425–3441. [Google Scholar] [CrossRef]
- Baidya, T.; Nabi, A.; Moh, S. Trajectory-Aware Offloading Decision in UAV-Aided Edge Computing: A Comprehensive Survey. Sensors 2024, 24, 1837. [Google Scholar] [CrossRef]
- Zhang, J.; Zhou, L.; Tang, Q.; Ngai, E.C.H.; Hu, X.; Zhao, H.; Wei, J. Stochastic Computation Offloading and Trajectory Scheduling for UAV-Assisted Mobile Edge Computing. IEEE Internet Things J. 2018, 6, 3688–3699. [Google Scholar] [CrossRef]
- Darchini-Tabrizi, M.; Pakdaman-Donyavi, A.; Entezari-Maleki, R.; Sousa, L. Performance Enhancement of UAV-Enabled MEC Systems through Intelligent Task Offloading and Resource Allocation. Comput. Netw. 2025, 264, 111280. [Google Scholar] [CrossRef]
- Li, S.; Hu, X.; Du, Y. Deep Reinforcement Learning for Computation Offloading and Resource Allocation in Unmanned-Aerial-Vehicle Assisted Edge Computing. Sensors 2021, 21, 6499. [Google Scholar] [CrossRef]
- Ju, T.; Li, L.; Liu, S.; Zhang, Y. A Multi-UAV Assisted Task Offloading and Path Optimization for Mobile Edge Computing via Multi-Agent Deep Reinforcement Learning. J. Netw. Comput. Appl. 2024, 229, 103919. [Google Scholar] [CrossRef]
- Xue, K.; Zhai, L.; Li, Y.; Lu, Z.; Zhou, W. Task Offloading and Multi-Cache Placement Based on DRL in UAV-Assisted MEC Networks. Veh. Commun. 2025, 53, 100900. [Google Scholar] [CrossRef]
- Li, D.; Du, B.; Bai, Z. Deep Reinforcement Learning-Enabled Trajectory and Bandwidth Allocation Optimization for UAV-Assisted Integrated Sensing and Covert Communication. Drones 2025, 9, 160. [Google Scholar] [CrossRef]
- Wang, L.; Shen, B.; Ma, L.; Zhang, Y.; Zhao, Y.; Guo, H.; Yu, Z.; Guo, B. Joint Task Offloading and Migration Optimization in UAV-Enabled Dynamic MEC Networks. IEEE Trans. Serv. Comput. 2025, 18, 2143–2157. [Google Scholar] [CrossRef]
- Xiong, Y.; Liu, F.; Cui, Y.; Yuan, W.; Han, T.X.; Caire, G. On the Fundamental Tradeoff of Integrated Sensing and Communications under Gaussian Channels. IEEE Trans. Inf. Theory 2023, 69, 5723–5751. [Google Scholar] [CrossRef]
- Wang, M.; Chen, P.; Cao, Z.; Chen, Y. Reinforcement Learning-Based UAV Resource Allocation for Integrated Sensing and Communication Systems. Electronics 2022, 11, 441. [Google Scholar] [CrossRef]
- Orikumhi, I.; Bae, J.; Kim, S. Mobility-Aware Resource Allocation in UAV-Assisted ISAC Networks. In Proceedings of the International Conference on ICT Convergence, Jeju Island, Republic of Korea, 11–13 November 2023; pp. 1042–1044. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. U.S. Patent 10,776,692, 2020. [Google Scholar]
- Barth-Maron, G.; Hoffman, M.W.; Budden, D.; Dabney, W.; Horgan, D.; Tb, D.; Muldal, A.; Heess, N.; Lillicrap, T. Distributed Distributional Deterministic Policy Gradients. arXiv 2018, arXiv:1804.08617. [Google Scholar] [CrossRef]
- Liu, Z.; Liu, X.; Liu, Y.; Leung, V.C.M.; Durrani, T.S. UAV Assisted Integrated Sensing and Communications for Internet of Things: 3D Trajectory Optimization and Resource Allocation. IEEE Trans. Wirel. Commun. 2024, 23, 8654–8667. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Hausknecht, M.J.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In AAAI Fall Symposium Series; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2015. [Google Scholar]








| Parameter | Symbol | Value |
|---|---|---|
| Total System Bandwidth | ||
| Minimum Bandwidth Floor | ||
| Sensing Resource Coupling Factor | 0.15 | |
| Sensing SNR Gain Factor | 0.8 | |
| Noise Power Spectral Density | ||
| Max Computing Capacity | ||
| Task Arrival Rate (Data) | ||
| Task Arrival Rate (Comp) | ||
| Time Slot Duration | ||
| Discount Factor | 0.99 | |
| Actor Learning Rate | ||
| Critic Learning Rate | ||
| Replay Buffer Size | 100,000 | |
| Update Rate | τ | 0.005 |
| Algorithm | Model Parameters | Storage Size (MB) |
|---|---|---|
| D3PG-Light | 48,008 | 0.4578 |
| D3PG | 178,184 | 1.6993 |
| DDPG | 281,096 | 2.6807 |
| TD3 | 421,898 | 4.0235 |
| Weight Setting | P95 (ms) | Accuracy | Energy (J) |
|---|---|---|---|
| Baseline | 24.54 | 0.8506 | 4.084 |
| 28.46 | 0.8404 | 4.628 | |
| 21.14 | 0.8498 | 4.764 | |
| 26.22 | 0.8198 | 4.134 | |
| 28.81 | 0.8664 | 4.666 | |
| 21.14 | 0.8503 | 4.478 | |
| 29.41 | 0.7497 | 3.913 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Cheng, Q.; Wu, W.; Zhou, Y. D3PG-Light: A Lightweight and Stable Resource Scheduling Framework for UAV-Integrated Sensing, Communication, and Computation Systems. Sensors 2026, 26, 1829. https://doi.org/10.3390/s26061829
Cheng Q, Wu W, Zhou Y. D3PG-Light: A Lightweight and Stable Resource Scheduling Framework for UAV-Integrated Sensing, Communication, and Computation Systems. Sensors. 2026; 26(6):1829. https://doi.org/10.3390/s26061829
Chicago/Turabian StyleCheng, Qing, Wenwen Wu, and Yebo Zhou. 2026. "D3PG-Light: A Lightweight and Stable Resource Scheduling Framework for UAV-Integrated Sensing, Communication, and Computation Systems" Sensors 26, no. 6: 1829. https://doi.org/10.3390/s26061829
APA StyleCheng, Q., Wu, W., & Zhou, Y. (2026). D3PG-Light: A Lightweight and Stable Resource Scheduling Framework for UAV-Integrated Sensing, Communication, and Computation Systems. Sensors, 26(6), 1829. https://doi.org/10.3390/s26061829

