Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms
Highlights
- The proposed heterogeneous multi-UAV deep reinforcement learning (HMUDRL) algorithm achieves high-precision localization and wide-area monitoring of an Ultra High Frequency (UHF) radiation source through a hierarchical architecture within heterogeneous unmanned aerial vehicle (UAV) swarms, improving localization success rate and reducing localization error.
- We demonstrate that the HMUDRL algorithm substantially reduces communication overhead and enhances system efficiency through an intra-cluster information aggregation mechanism.
- The proposed algorithm provides a practical technical solution to address key challenges faced by UAVs in UHF radiation source localization, including payload constraints, limited endurance, and adaptability to dynamic environments, thereby advancing the development of intelligent spectrum sensing technologies.
- The application of this algorithm enables electromagnetic spectrum monitoring scenarios involving a broader range of frequency bands and diverse types of radiation sources.
Abstract
1. Introduction
- We propose a role-based heterogeneous architecture that explicitly distinguishes CH, which are responsible for network coordination, from CM, which are dedicated to signal sensing, thereby enabling task-specific policy learning.
- We adopt a two-level hierarchical decision-making structure in which CHs aggregate intra-cluster information. This design reduces inter-agent communication by over 80% compared to homogeneous baselines while maintaining strong collaborative performance.
- We fuse angle-of-arrival (AOA) and received signal strength (RSS) measurements from multiple CMs to jointly optimize triangulation geometry and refine position estimates. Experimental results show a 96.1% localization success rate and an 87.3% reduction in RMSE compared to state-of-the-art MARL baselines, demonstrating the effectiveness and practicality of our method for spectrum monitoring tasks.
2. Related Work
2.1. Clustering Strategies
2.2. Heuristic Algorithms
2.3. Reinforcement Learning Algorithms
2.4. Research Gaps and Our Approach
- We establish a large-scale, distributed cooperative sensing and localization framework for heterogeneous UAV swarms to enhance collaborative localization efficiency.
- The HMUDRL algorithm integrates a branched network architecture to improve computational and communication efficiency, while jointly leveraging received signal power differences and AOA measurements to enhance localization accuracy.
- HMUDRL employs a multi-faceted reward mechanism that effectively balances monitoring coverage, localization accuracy, and operational efficiency.
3. System Model
3.1. CH–CM Link Establishment and Maintenance Model
3.2. UAV Kinematic Model
3.3. Spectrum Sensing Model
3.4. Communication Model
4. Problem Formulation and Analysis
5. Distributed Heterogeneous Multi-Agent Deep Reinforcement Learning Algorithm for UAV Clustering
5.1. HMUDRL Framework
5.2. HMPPO Design
5.2.1. Observation Set
5.2.2. Action Set
5.2.3. Reward
5.3. DDMPPO Design
5.3.1. Observation Set
5.3.2. Action Set
5.3.3. Reward
5.4. Training Process
| Algorithm 1 Training Process of HMUDRL Algorithm | |
| 1: | Initialize policy networks for CH agents and for CM agents, value networks for CH agents and for CM agents, for CH agents and for CM agents |
| 2: | for iteration do |
| 3: | for episode do |
| 4: | for each CH agent do |
| 5: | Observe state |
| 6: | Select action |
| 7: | end for |
| 8: | for each CM agent do |
| 9: | Observe state |
| 10: | Select action |
| 11: | end for |
| 12: | Execute actions in environment |
| 13: | Observe rewards and next states |
| 14: | Store transition and in and |
| 15: | end for |
| 16: | Compute advantages and using Equations (46) and (48) |
| 17: | Compute target values and |
| 18: | for do |
| 19: | for each CH agent do |
| 20: | Compute policy losses using Equation (50) |
| 21: | Compute entropy bonus using Equation (53) |
| 22: | Compute value loss using Equation (56) |
| 23: | Update and |
| 24: | end for |
| 25: | for each CM agent do |
| 26: | Compute policy losses using Equation (51) |
| 27: | Compute entropy bonus using Equation (54) |
| 28: | Compute value loss using Equation (57) |
| 29: | Update and |
| 30: | end for |
| 31: | Compute total loss using Equation (59) |
| 32: | end for |
| 33: | Update for all CH agents |
| 34: | Update for all CM agents |
| 35: | end for |
6. Simulation Results
6.1. Experimental Setup and Parameters
6.2. Experimental Results and Analysis
- HMUDRL w/o Hierarchical Control (HC): disables the hierarchical control mechanism, allowing agents to act solely based on local observations without high-level coordination;
- HMUDRL w/o Dynamic Clustering (DC): replaces dynamic clustering with static, pre-defined clusters, thereby removing the adaptability to source mobility;
- HMUDRL w/o AOA: excludes AOA measurements from agent observations, relying only on RSS.
6.3. Comparative Analysis of Model Complexity
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| HMUDRL | Heterogeneous Multi-UAV Deep Reinforcement Learning |
| UHF | Ultra High Frequency |
| UAV | Unmanned Aerial Vehicle |
| CH | Cluster-Head UAVs |
| CM | Cluster-Monitoring UAVs |
| RMSE | Root Mean Square Error |
| RF | Radio Frequency |
| MARL | Multi-Agent Reinforcement Learning |
| AOA | Angle of Arrival |
| RSS | Received Signal Strength |
| DBSCAN | Density-Based Spatial Clustering of Applications with Noise |
| SNR | Signal-to-Noise Ratio |
| SOM | Self-Organizing Map |
| PSO | Particle Swarm Optimization |
| HCBGSO | Hybrid Colliding Bodies Galaxy Swarm Optimization |
| GA | Genetic Algorithms |
| DE | Differential Evolution |
| TDOA | Time Difference of Arrival |
| HPSO | Hybrid Particle Swarm Optimization |
| RL | Reinforcement Learning |
| DQN | Deep Q-Networks |
| DDPG | Deep Deterministic Policy Gradient |
| PPO | Proximal Policy Optimization |
| TS-DRL | Token-Specific Deep Reinforcement Learning |
| MADDPG | Multi-Agent Deep Deterministic Policy Gradient |
| MASAC | Multi-Agent Soft Actor-Critic |
| COMA | Counterfactual Multi-Agent |
| MAA2C | Multi-Agent Advantage Actor-Critic |
| Probability Density Function | |
| CRLB | Cramér–Rao Lower Bound |
| BER | Bit Error Rate |
| DRSS | Dynamic Range-Sensitive Source Localization |
| MAPPO | Multi-Agent PPO |
| QMIX | Q-Mixing |
| HMPPO | Hierarchical Multi-Agent PPO |
| DDMPPO | Decentralized Distributed Multi-Agent PPO |
| GAE | Generalized Advantage Estimation |
| TD | Temporal Difference |
| IQR | Interquartile Range |
| MLP | Multilayer Perceptron |
| CTDE | Centralized Training Decentralized Execution |
| HC | Hierarchical Control |
| DC | Dynamic Clustering |
| CDF | Cumulative Distribution Function |
| MAE | Mean Absolute Error |
References
- Kalatzis, D.; Ploussi, A.; Spyratou, E.; Panagiotakopoulos, T.; Efstathopoulos, E.P.; Kiouvrekis, Y. Explainable AI for Spectral Analysis of Electromagnetic Fields. IEEE Access 2025, 13, 113407–113427. [Google Scholar] [CrossRef]
- Al Mahmud, K.; Kurum, M. SDR-Based S-Band Radiometer for UAS Platforms with Spectrum Monitoring and Dynamic Allocation. In Proceedings of the 2025 United States National Committee of URSI National Radio Science Meeting (USNC-URSI NRSM), Boulder, CO, USA, 7–10 January 2025; pp. 244–245. [Google Scholar]
- Chen, Y.; Zhu, Q.; Wang, J.; Jia, Z.; Wang, X.; Lin, Z.; Huang, Y.; Wu, Q.; Briso-Rodríguez, C. UAV-Aided Efficient Informative Path Planning for Autonomous 3D Spectrum Mapping. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 1664–1677. [Google Scholar] [CrossRef]
- Wang, Y.; An, J.; Shao, M.; Wu, J.; Zhou, D.; Yao, X.; Zhang, X.; Cao, W.; Jiang, C.; Zhu, Y. A Comprehensive Review of Proximal Spectral Sensing Devices and Diagnostic Equipment for Field Crop Growth Monitoring. Precis. Agric. 2025, 26, 54. [Google Scholar] [CrossRef]
- Testi, E.; Giorgetti, A. Wireless Network Analytics for the New Era of Spectrum Patrolling and Monitoring. IEEE Wirel. Commun. 2024, 31, 230–236. [Google Scholar] [CrossRef]
- Fang, Z.; Savkin, A.V. Strategies for Optimized UAV Surveillance in Various Tasks and Scenarios: A Review. Drones 2024, 8, 193. [Google Scholar] [CrossRef]
- Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-Art and Future Research Challenges in UAV Swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
- Chen, J.; Zhang, Z.; Fan, D.; Hou, C.; Zhang, Y.; Hou, T.; Zou, X.; Zhao, J. Distributed Decision Making for Electromagnetic Radiation Source Localization Using Multi-Agent Deep Reinforcement Learning. Drones 2025, 9, 216. [Google Scholar] [CrossRef]
- Gao, R.; Yan, G.; Niu, R.; Chang, W.; Yan, T.; Tang, C. A Novel Spectrum Sensing Method for Multiple Unknown Signal Sources Using Frequency Domain Energy Detection and DBSCAN. IEEE Access 2025, 13, 76811–76837. [Google Scholar] [CrossRef]
- Radhi, A.A.; Abdullah, H.N.; Akkar, H.A.R. Denoised Jarque-Bera Features-Based K-Means Algorithm for Intelligent Cooperative Spectrum Sensing. Digit. Signal Process. 2022, 129, 103659. [Google Scholar] [CrossRef]
- Fouda, H.S.; Farghaly, S.I.; Dawood, H.S. Weighted Joint LRTs for Cooperative Spectrum Sensing Using K-Means Clustering. Phys. Commun. 2024, 67, 102528. [Google Scholar] [CrossRef]
- Tao, B.; Wu, J.; Dou, X.; Wang, J.; Xu, Y. Memorial K-Means Clustering for Cooperative Spectrum Sensing in Cognitive Wireless Sensor Networks at Low SNR Regimes. Sens. Rev. 2025, 45, 443–452. [Google Scholar] [CrossRef]
- Konink-Donner, E.; Ruen, A.; Jha, R. Clustering RF Signals with the Growing Self-Organizing Map for Dynamic Spectrum Access. In Proceedings of the NAECON 2023—IEEE National Aerospace and Electronics Conference, Dayton, OH, USA, 28–31 August 2023; pp. 249–253. [Google Scholar]
- Zhang, W.; Zhang, W. An Efficient UAV Localization Technique Based on Particle Swarm Optimization. IEEE Trans. Veh. Technol. 2022, 71, 9544–9557. [Google Scholar] [CrossRef]
- Wang, K.; Kooistra, L.; Pan, R.; Wang, W.; Valente, J. UAV-based Simultaneous Localization and Mapping in Outdoor Environments: A Systematic Scoping Review. J. Field Robot. 2024, 41, 1617–1642. [Google Scholar] [CrossRef]
- Dixit, A.; Devi, M.N.N.; Gazi, F.; Hussain, M.M. OAL-HMT: Optimized AAV Localization Using Hybrid Metaheuristic Techniques. IEEE J. Indoor Seamless Position. Navig. 2025, 3, 142–151. [Google Scholar] [CrossRef]
- Bandari, S.; Nirmala Devi, L. A Multi-Objective Approach for Optimal Target Coverage UAV Placement: Hybrid Heuristic Formulation. J. Control Decis. 2025, 12, 551–567. [Google Scholar] [CrossRef]
- Chen, F.; Li, H.; Lin, Z.; Zhu, Q.; Zhong, W.; Chen, X.; Zhou, J.; Li, H. Optimized Genetic Algorithm-Based Multi-UAV Cooperative TDOA Localization for Complex Multipath Scenarios. In Proceedings of the 2024 IEEE 24th International Conference on Communication Technology (ICCT), Chengdu, China, 18–20 October 2024; pp. 1283–1287. [Google Scholar]
- Arafat, M.Y.; Moh, S. Localization and Clustering Based on Swarm Intelligence in UAV Networks for Emergency Communications. IEEE Internet Things J. 2019, 6, 8958–8976. [Google Scholar] [CrossRef]
- Ebrahimi, D.; Sharafeddine, S.; Ho, P.-H.; Assi, C. Autonomous UAV Trajectory for Localizing Ground Objects: A Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2021, 20, 1312–1324. [Google Scholar] [CrossRef]
- Shurrab, M.; Mizouni, R.; Singh, S.; Otrok, H. Reinforcement Learning Framework for UAV-Based Target Localization Applications. Internet Things 2023, 23, 100867. [Google Scholar] [CrossRef]
- Guan, Q.; Cao, H.; Tan, J.; Jia, L.; Yan, D.; Chen, B. Token-Specific Deep Reinforcement Learning for Energy-Efficient Capacitated Electric Vehicle Routing Problems. Appl. Energy 2025, 396, 126314. [Google Scholar] [CrossRef]
- Hou, Y.; Zhao, J.; Zhang, R.; Cheng, X.; Yang, L. UAV Swarm Cooperative Target Search: A Multi-Agent Reinforcement Learning Approach. IEEE Trans. Intell. Veh. 2024, 9, 568–578. [Google Scholar] [CrossRef]
- Qin, Y.; Zhang, Z.; Li, X.; Huangfu, W.; Zhang, H. Deep Reinforcement Learning Based Resource Allocation and Trajectory Planning in Integrated Sensing and Communications UAV Network. IEEE Trans. Wirel. Commun. 2023, 22, 8158–8169. [Google Scholar] [CrossRef]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. arXiv 2024, arXiv:1705.08926v3. [Google Scholar] [CrossRef]
- Wang, Q.; Xu, W.; Chen, H.-H. A Heterogeneous-Agent Deep Reinforcement Learning Approach for Dynamic Spectrum Access in Cognitive Wireless Networks. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 2221–2235. [Google Scholar] [CrossRef]
- Ekechi, C.C.; Elfouly, T.; Alouani, A.; Khattab, T. A Survey on UAV Control with Multi-Agent Reinforcement Learning. Drones 2025, 9, 484. [Google Scholar] [CrossRef]
- Wang, M.; Chen, P.; Cao, Z.; Chen, Y. Reinforcement Learning-Based UAVs Resource Allocation for Integrated Sensing and Communication (ISAC) System. Electronics 2022, 11, 441. [Google Scholar] [CrossRef]
- Jiang, K.; Tian, K.; Feng, H.; Zhao, Y.; Wang, D.; Gao, J.; Cao, S.; Zhang, X.; Li, Y.; Yuan, J.; et al. Distributed UAV Swarm Augmented Wideband Spectrum Sensing Using Nyquist Folding Receiver. IEEE Trans. Wirel. Commun. 2024, 23, 14171–14184. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Liao, X.; Wang, Y.; Han, Y.; Li, Y.; Lin, C.; Zhu, X. Heterogeneous Multi-Agent Deep Reinforcement Learning for Cluster-Based Spectrum Sharing in UAV Swarms. Drones 2025, 9, 377. [Google Scholar] [CrossRef]
- Vo, V.N.; Nguyen, L.-M.-D.; Tran, H.; Dang, V.-H.; Niyato, D.; Cuong, D.N.; Luong, N.C.; So-In, C. Outage Probability Minimization in Secure NOMA Cognitive Radio Systems With UAV Relay: A Machine Learning Approach. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 435–451. [Google Scholar] [CrossRef]
- Mustafa Abro, G.E.; Abdallah, A.M. Graph Attention Networks For Anomalous Drone Detection: RSSI-Based Approach with Real-World Validation. Expert Syst. Appl. 2025, 273, 126913. [Google Scholar] [CrossRef]
- Chen, L.; You, C.; Wang, Y.; Li, X. Variable-Speed UAV Path Optimization Based on the CRLB Criterion for Passive Target Localization. Sensors 2025, 25, 5297. [Google Scholar] [CrossRef]
- Vinh Hien, D.; Le Hoang Anh, N. Nonlinear Scalarizations in Set Optimization with Variable Ordering Structures and Applications. J. Appl. Math. Comput. 2025, 71, 1609–1630. [Google Scholar] [CrossRef]
- Liang, J.; Ban, X.; Yu, K.; Qu, B.; Qiao, K.; Yue, C.; Chen, K.; Tan, K.C. A Survey on Evolutionary Constrained Multiobjective Optimization. IEEE Trans. Evol. Comput. 2023, 27, 201–221. [Google Scholar] [CrossRef]
- Yu, K.; Yang, Z.; Liang, J.; Qiao, K.; Qu, B.; Suganthan, P.N. An Individual Adaptive Evolution and Regional Collaboration Based Evolutionary Algorithm for Large-Scale Constrained Multiobjective Optimization Problems. Swarm Evol. Comput. 2025, 95, 101925. [Google Scholar] [CrossRef]
- Li, K.; Lai, G.; Yao, X. Interactive Evolutionary Multiobjective Optimization via Learning to Rank. IEEE Trans. Evol. Comput. 2023, 27, 749–763. [Google Scholar] [CrossRef]
- Apaza, R.D.; Han, R.; Li, H.; Knoblock, E.J. Intelligent Spectrum and Airspace Resource Management for Urban Air Mobility Using Deep Reinforcement Learning. IEEE Access 2024, 12, 164750–164766. [Google Scholar] [CrossRef]
- Xing, N.; Zong, Q.; Dou, L.; Tian, B.; Wang, Q. A Game Theoretic Approach for Mobility Prediction Clustering in Unmanned Aerial Vehicle Networks. IEEE Trans. Veh. Technol. 2019, 68, 9963–9973. [Google Scholar] [CrossRef]









| Algorithm | Supports Heterogeneous Agents | Supports Hierarchical Architecture | Communication Mechanism | Scalability |
|---|---|---|---|---|
| MADDPG [23] | Yes Limited support | Yes with modifications | Centralized critic; all-to-all message passing during training | Poor Critic input scales quadratically |
| MASAC [24] | Yes Limited support | Theoretically feasible but inefficient | Centralized training with shared critics; dense inter-agent messaging | Moderate High parameter overhead |
| COMA [25] | Yes Limited support | Yes with modifications | Centralized critic with counterfactual baseline; full joint-action enumeration | Poor Exponential in agents |
| MAA2C [26] | Yes Limited support | Yes with modifications | Fully decentralized; no explicit coordination | Moderate Weak collaboration |
| MAPPO [8] | Yes Highly flexible | Yes with modifications | Global state for centralized critic | Moderate Global critic causes memory bottleneck |
| HMUDRL (Ours) | Yes | Yes | Cluster-based aggregation; CH broadcasts fused info | Good Complexity decoupled from swarm size |
| System Parameters | Numerical Settings | Description |
|---|---|---|
| 3 | number of CH | |
| 14 | number of CM | |
| episodes | 1000 | number of training episodes |
| 100 | time steps per episode | |
| 0.0001 | actor learning rate | |
| 0.001 | critic learning rate | |
| 0.99 | discount factor | |
| 0.95 | GAE lambda | |
| 0.1 | policy update clipping parameter | |
| 0.6 | weight of CH reward in joint reward | |
| 0.4 | weight of CM reward in joint reward | |
| 0.6 | weight of localization confidence reward | |
| 0.2 | weight of coverage quality reward | |
| 0.2 | weight of exploration diversity reward | |
| 0.5 | weight of RSS gain reward | |
| 0.2 | weight of distance-based reward to CH | |
| 0.3 | weight of valid AOA reward | |
| -20 | reference RSS value at 1 m | |
| 2.0 | path loss exponent | |
| 2.0 | RSS measurement standard deviation (dB) | |
| 5.0 | AOA measurement noise standard deviation (°) | |
| −70.0 | minimum RSS threshold for AOA estimation (dBm) | |
| 20 | localization success threshold (m) | |
| 1 | safe distance (m) |
| Algorithm Name | RMSE Mean (m) | RMSE Median (m) | RMSE Max (m) | RMSE Min (m) | Localization Success Rate (%) | Steps to First Detection |
|---|---|---|---|---|---|---|
| HMUDRL | 39.34 | 22.83 | 678.11 | 0.89 | 96.1% | 6 |
| MADDPG [23] | 620.43 | 666.15 | 1125.30 | 12.06 | 93.7% | 5 |
| MASAC [24] | 395.88 | 297.76 | 1774.58 | 13.15 | 94.6% | 8 |
| COMA [25] | 135.02 | 30.07 | 890.99 | 0.36 | 96.3% | 6 |
| MAA2C [26] | 82.70 | 28.89 | 992.92 | 0.76 | 94.6% | 8 |
| Different Numbers of Clusters | Exchange Frequency (Times/Episode) | Communication Savings Ratio (%) | |||
|---|---|---|---|---|---|
| Heterogeneous (CH + CM) | Homogeneous (k-NN, k = 5) | Homogeneous (Full Connected) | vs. k-NN, k = 5 | vs. Full Connected | |
| 1CH + 8CM | 900 | 4500 | 7200 | 80.0% | 87.5% |
| 3CH + 14CM | 1500 | 8500 | 27,200 | 82.4% | 94.5% |
| 5CH + 20CM | 2100 | 12,500 | 60,000 | 83.2% | 96.5% |
| 7CH + 30CM | 3100 | 18,500 | 133,200 | 83.2% | 97.7% |
| Method | Median Error (m) | 90% Error (m) | MAE (m) |
|---|---|---|---|
| HMUDRL | 22.83 | 66.86 | 43.54 |
| HMUDRL w/o HC | 32.11 | 97.54 | 67.95 |
| HMUDRL w/o DC | 22.75 | 66.96 | 50.12 |
| HMUDRL w/o AOA | 574.91 | 833.92 | 559.53 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sun, Y.; Zhang, X.; Wang, M.; Yang, Y.; Xia, T.; Zhu, X.; Cui, T. Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms. Drones 2026, 10, 54. https://doi.org/10.3390/drones10010054
Sun Y, Zhang X, Wang M, Yang Y, Xia T, Zhu X, Cui T. Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms. Drones. 2026; 10(1):54. https://doi.org/10.3390/drones10010054
Chicago/Turabian StyleSun, Yuanqiang, Xueqing Zhang, Menglin Wang, Yangqiang Yang, Tao Xia, Xuan Zhu, and Tonghe Cui. 2026. "Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms" Drones 10, no. 1: 54. https://doi.org/10.3390/drones10010054
APA StyleSun, Y., Zhang, X., Wang, M., Yang, Y., Xia, T., Zhu, X., & Cui, T. (2026). Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms. Drones, 10(1), 54. https://doi.org/10.3390/drones10010054

