Next Article in Journal
The Survey of Evolutionary Deep Learning-Based UAV Intelligent Power Inspection
Previous Article in Journal
Satellite-Aided Multi-UAV Secure Collaborative Localization via Spatio-Temporal Anomaly Detection and Diagnosis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms

1
Graduate School, National University of Defense Technology, Wuhan 430035, China
2
Staff Department, Naval Aviation University, Yantai 264001, China
3
Department of Electromagnetic Spectrum Management and Cybersecurity, Information Support Force Engineering University, Wuhan 430035, China
*
Authors to whom correspondence should be addressed.
Drones 2026, 10(1), 54; https://doi.org/10.3390/drones10010054
Submission received: 19 November 2025 / Revised: 5 January 2026 / Accepted: 9 January 2026 / Published: 12 January 2026
(This article belongs to the Special Issue Cooperative Perception, Planning, and Control of Heterogeneous UAVs)

Highlights

What are the main findings?
  • The proposed heterogeneous multi-UAV deep reinforcement learning (HMUDRL) algorithm achieves high-precision localization and wide-area monitoring of an Ultra High Frequency (UHF) radiation source through a hierarchical architecture within heterogeneous unmanned aerial vehicle (UAV) swarms, improving localization success rate and reducing localization error.
  • We demonstrate that the HMUDRL algorithm substantially reduces communication overhead and enhances system efficiency through an intra-cluster information aggregation mechanism.
What are the implications of the main finding?
  • The proposed algorithm provides a practical technical solution to address key challenges faced by UAVs in UHF radiation source localization, including payload constraints, limited endurance, and adaptability to dynamic environments, thereby advancing the development of intelligent spectrum sensing technologies.
  • The application of this algorithm enables electromagnetic spectrum monitoring scenarios involving a broader range of frequency bands and diverse types of radiation sources.

Abstract

With the continuous proliferation of radio frequency devices, electromagnetic environments in various regions are becoming increasingly complex. Effective monitoring of the electromagnetic environment and identification of interference sources have thus become critical tasks for maintaining order in the electromagnetic spectrum. In recent years, rapid advances in UAV technology have spurred exploration of UAV-based electromagnetic spectrum monitoring as a novel approach. However, the limited payload capacity and endurance of UAVs constrain their monitoring capabilities. To address these challenges, we propose HMUDRL, a distributed heterogeneous multi-agent deep reinforcement learning algorithm. By leveraging cooperative operation between cluster-head UAVs (CH) and cluster-monitoring UAVs (CM) within a heterogeneous UAV swarm, HMUDRL enables high-precision detection and wide-area localization of UHF radiation source. Furthermore, we integrate a minimum-gap localization algorithm that exploits the spatial distribution of multiple CM to accurately pinpoint anomalous radiation sources. Simulation results validate the effectiveness of HMUDRL: in the later stages of training, the success rate of localizing target radiation sources converges to 96.1%, representing an average improvement of 1.8% over baseline algorithms; localization accuracy, measured by root mean square error (RMSE), is enhanced by approximately 87.3% compared to baselines; and communication overhead is reduced by more than 80% relative to homogeneous architectures. These results demonstrate that HMUDRL effectively addresses the challenges of data transmission control and sensing-localization performance faced by UAVs in UHF spectrum monitoring.

1. Introduction

With the continuous proliferation of radio frequency (RF) devices, electromagnetic environments in various regions have become increasingly complex. Effectively monitoring these environments and identifying interference sources has thus become a critical task for maintaining electromagnetic spectrum order [1,2]. The UHF band is widely used in aviation navigation, maritime communications, railway dispatching, and emergency broadcasting due to its favorable propagation characteristics, yet it is also highly vulnerable to illegal transmissions and malicious interference. Conventional ground-based monitoring systems often struggle to promptly localize interference sources in complex terrains such as urban or mountainous areas due to line-of-sight blockages and limited coverage [3,4].
UAV-based spectrum monitoring has emerged as a powerful alternative due to its flexible deployment and extensive spatial coverage [5]. However, existing multi-agent reinforcement learning (MARL) methods for UAV-based source localization typically adopt homogeneous agent designs or centralized critic architectures [6,7,8], which face several practical bottlenecks in real-world monitoring scenarios. Most current studies focus on small-scale monitoring setups with homogeneous UAV swarms; as the number of UAVs and monitoring targets increases, information exchange via fully connected or dense k-NN communication topologies incurs prohibitively high communication overhead, making system scalability challenging. Moreover, constrained by limited onboard power and payload capacity, a single UAV platform struggles to simultaneously perform spectrum sensing, data processing, and relaying tasks, thereby restricting its operational capabilities in practice. Additionally, few approaches jointly coordinate motion planning with high-precision geometric estimation, leading to suboptimal localization accuracy.
To address these challenges, this paper proposes HMUDRL, a hierarchical heterogeneous MARL framework specifically designed for UHF radiation source localization. Compared with prior works, our approach introduces fundamental innovations in three aspects:
  • We propose a role-based heterogeneous architecture that explicitly distinguishes CH, which are responsible for network coordination, from CM, which are dedicated to signal sensing, thereby enabling task-specific policy learning.
  • We adopt a two-level hierarchical decision-making structure in which CHs aggregate intra-cluster information. This design reduces inter-agent communication by over 80% compared to homogeneous baselines while maintaining strong collaborative performance.
  • We fuse angle-of-arrival (AOA) and received signal strength (RSS) measurements from multiple CMs to jointly optimize triangulation geometry and refine position estimates. Experimental results show a 96.1% localization success rate and an 87.3% reduction in RMSE compared to state-of-the-art MARL baselines, demonstrating the effectiveness and practicality of our method for spectrum monitoring tasks.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the system model. Section 4 formulates and analyzes the problem. Section 5 details the proposed HMUDRL algorithm. Simulation results are provided in Section 6. Finally, Section 7 concludes the paper.

2. Related Work

Current approaches to electromagnetic spectrum monitoring using heterogeneous UAVs primarily rely on clustering strategies combined with heuristic and reinforcement learning algorithms for sensing and detection. This section reviews the key advancements in these methods and identifies existing research gaps.

2.1. Clustering Strategies

A single-node UAV exhibits limitations in terms of coverage, timeliness, and accuracy in electromagnetic spectrum monitoring. With the continuous improvement of communication network capacity, the number of UAVs operating simultaneously within a network can be significantly increased. Clustering strategies effectively mitigate channel congestion during monitoring data transmission among multiple UAVs, ensuring high-quality communication links between CH and CM, thereby facilitating rapid and reliable transmission of localization data. Furthermore, clustering enables CM within each subgroup to achieve favorable spatial geometric distributions, which enhances collaborative coverage over larger monitoring areas and supports sustained tracking of mobile or sporadic radiation sources. By leveraging heterogeneous CH to aggregate sensing data from CM, redundant data transmission is reduced, improving overall system efficiency. Gao et al. [9] integrated frequency-domain energy detection, frequency-domain noise estimation, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to develop a spectrum sensing method capable of accurately detecting multiple unknown signals. Various clustering strategies have been shown to enhance network performance, scalability, and robustness in UAV swarms. Conventional distance-based clustering algorithms, such as K-means, have demonstrated effectiveness in addressing cooperative spectrum sensing under low signal-to-noise ratio (SNR) conditions [10,11,12]. Elliott et al. [13] employed a Growing Self-Organizing Map (SOM) algorithm to adaptively mine spectrum data, enabling more accurate awareness of dynamic spectrum situations.

2.2. Heuristic Algorithms

Traditional heuristic algorithms have demonstrated notable effectiveness in addressing monitoring and localization problems in specific environments [14,15]. Heuristic algorithms are optimization methods based on empirical rules or simulations of natural phenomena. Their core principle is to rapidly obtain acceptable solutions through approximate-optimal search strategies when a complete problem model is unavailable or computational resources are limited. Compared with exact algorithms (e.g., exhaustive search or dynamic programming), heuristics generally exhibit lower time complexity and greater robustness, making them particularly suitable for high-dimensional, nonlinear, and multi-constrained complex optimization tasks, such as UAV path planning, spectrum sensing scheduling, and radiation source localization. Awadhesh et al. [16] integrated Particle Swarm Optimization (PSO) with fuzzy logic to develop a bio-inspired metaheuristic hybrid model that dynamically adjusts UAV flight trajectories based on real-time RSS and AOA, thereby accelerating localization convergence while reducing energy consumption. Spandana et al. [17] proposed a Hybrid Colliding Bodies Galaxy Swarm Optimization (HCBGSO) algorithm that simulates celestial gravitational and collision mechanisms to dynamically reposition UAVs in three-dimensional space, significantly improving coverage efficiency and localization stability for sporadic radiation sources. Other studies have employed Genetic Algorithms (GA) or Differential Evolution (DE) to post-process Time Difference of Arrival (TDOA) or AOA measurements, enhancing localization accuracy. For instance, Chen et al. [18] introduced a GA-TDOA algorithm that searches the solution space of nonlinear equations via genetic operations to estimate optimal radiation source coordinates, effectively mitigating localization drift caused by multipath interference. Moreover, heuristic algorithms naturally support multi-objective optimization frameworks. For example, the Hybrid Particle Swarm Optimization (HPSO) algorithm [19] simultaneously optimizes clustering structure, flight paths, and sensing schedules in UAV swarms, significantly extending system endurance while maintaining high localization success rates. Despite their strong performance in specific scenarios, heuristic algorithms suffer from several limitations: their efficacy is highly sensitive to parameter tuning and initial solution quality, and they generally lack theoretical convergence guarantees. More critically, conventional heuristic approaches are typically designed for “one-shot” problem solving and struggle to adapt to dynamically changing electromagnetic environments, such as those involving mobile emitters or sudden interference and cannot effectively leverage historical experience for knowledge transfer.

2.3. Reinforcement Learning Algorithms

In recent years, Reinforcement Learning (RL) has emerged as a powerful tool for addressing autonomous control and cooperative optimization problems in complex dynamic environments, owing to its remarkable performance in sequential decision-making tasks. Electromagnetic spectrum monitoring and radiation source localization involve highly uncertain environments, characterized by fluctuating signal strength, mobile interference sources, terrain occlusion, among others—where traditional model-based or rule-based approaches often fall short. In contrast, RL enables agents to learn optimal policies through continuous interaction with the environment by maximizing cumulative rewards, without requiring an accurate prior model of the environment. Single-agent RL algorithms, such as Deep Q-Networks (DQN), Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO), have been applied to path planning and spectrum sensing scheduling for individual UAVs. Ebrahimi et al. [20] employed Q-learning to improve localization accuracy for multiple targets while minimizing time and path length. Shurrab et al. [21] developed a DQN-based UAV target localization framework, leveraging DQN as a scalable extension of Q-learning that replaces the conventional Q-table with a neural network to handle large or continuous state spaces, thereby enhancing model generalization and adaptability in dynamic environments. Recently, Guan et al. [22] proposed a token-specific deep reinforcement learning (TS-DRL) framework for the capacitated electric vehicle routing problem, achieving significant energy savings by dynamically adapting routing policies to node-specific characteristics. As monitoring tasks evolve toward larger coverage, higher precision, and greater dynamism, single-agent frameworks are increasingly inadequate for multi-UAV cooperation. Consequently, Multi-Agent Reinforcement Learning (MARL) has become a focal point of research. Its key advantage lies in enabling complex tasks beyond the capability of a single agent through distributed decision-making and coordinated interaction among multiple agents—particularly well-suited for cooperative monitoring and localization using heterogeneous UAV swarms. Hou et al. [23] adopted the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, which employs a centralized Critic network that utilizes the states and actions of all agents to evaluate policy value, while each agent maintains its own decentralized Actor network to generate actions. This architecture mitigates the non-stationarity caused by multi-agent interactions while preserving decentralized execution—aligning well with the requirement for collaborative decision-making and independent operation in UAV swarm monitoring. Qin et al. [24] proposed the Multi-Agent Soft Actor-Critic (MASAC) algorithm based on soft policy gradient theory. Building upon MADDPG, MASAC incorporates entropy regularization and utilizes multiple Critic networks to reduce value estimation bias. By maximizing policy entropy, the algorithm encourages exploration of potentially optimal actions, thereby enhancing robustness and generalization. In [25], the Counterfactual Multi-Agent Policy Gradients (COMA) algorithm was employed, which introduces a counterfactual baseline to address the credit assignment problem in multi-agent settings. This mechanism effectively differentiates the contribution of each agent to the collective task outcome. Wang et al. [26] applied the Multi-Agent Advantage Actor-Critic (MAA2C) algorithm, which combines independent Actor networks for each agent with a shared Critic network. This design integrates distributed policy learning with centralized value estimation. By leveraging advantage function estimation to reduce variance in value assessment and supporting parallel training, MAA2C achieves high training efficiency, making it suitable for rapid deployment in large-scale UAV swarms.

2.4. Research Gaps and Our Approach

Compared with traditional clustering and heuristic algorithms, data-driven approaches such as reinforcement learning offer stronger generalization and online adaptation capabilities [27]. However, existing MARL methods still face critical challenges when applied to electromagnetic source localization tasks using heterogeneous UAV swarms. To address these limitations, we propose the HMUDRL framework for radiation source localization. Unlike the homogeneous multi-agent setup in [8], which assumes identical agent capabilities and employs centralized training, HMUDRL explicitly incorporates role-based heterogeneous modeling by introducing dedicated CH and CM agents, each equipped with distinct observation spaces, action spaces, and reward structures. The centralized critic network in [23] suffers from input dimensionality that grows linearly with swarm size, leading to poor scalability. In contrast, HMUDRL adopts a decentralized two-level PPO architecture that eliminates dependence on global state information, thereby supporting large-scale deployment. Similarly, while the approach in [24] requires each agent to maintain multiple policy networks, resulting in high computational overhead, our method assigns only one lightweight PPO network per agent type, significantly improving efficiency. Furthermore, unlike [28], which relies on a single UAV or a centrally coordinated swarm for radio source localization, HMUDRL enables fully distributed decision-making without access to global information, a crucial advantage in communication-constrained environments. The COMA algorithm used in [25] requires complete action logs from all agents to compute counterfactual baselines, incurring substantial communication costs. By comparison, HMUDRL’s cluster-level information aggregation mechanism is designed to reduce inter-agent messaging by more than 80%. Moreover, the shared critic network in [26] enforces a uniform value function across all agents, which fails to accommodate diverse objectives in heterogeneous teams and often biases optimization toward a single goal. Our differentiated policy learning mechanism effectively avoids this issue. Recent work such as [29] focuses on wideband spectrum sensing but does not address localization policy learning. Meanwhile, the survey in [27] identifies hierarchical coordination and communication efficiency as open challenges in MARL-based UAV control, both of which are directly addressed by HMUDRL’s hierarchical architecture and role-specific reward mechanisms. Built upon the flexible PPO framework [8,30] and enhanced with a tailored two-level policy structure, HMUDRL uniquely balances localization accuracy, communication efficiency, and scalability in dynamic UHF monitoring scenarios.
The limitations summarized and systematized in Table 1 expose a critical gap in current MARL research. Although existing methods may theoretically accommodate heterogeneity or hierarchical structures through ad hoc modifications, none natively support role-aware and scalable coordination under the stringent communication, computational, and functional constraints inherent in UAV-based spectrum monitoring. In contrast, HMUDRL is explicitly designed around the operational realities of heterogeneous UAV swarms: CH and CM agents exhibit distinct sensing capabilities, action spaces, and mission objectives. To align with this role-based decomposition, our framework co-designs the architecture, communication protocol, and learning mechanisms, thereby eliminating the need for global state sharing or fully connected message passing. Information within each cluster is aggregated locally at the CH, and only compact fused estimates are broadcast, significantly reducing bandwidth consumption while preserving collaborative accuracy. Furthermore, by decoupling the policy networks for CH and CM, HMUDRL enables independent optimization of distinct reward signals, effectively overcoming the convergence bias commonly observed in shared-critic approaches. This native integration of heterogeneity, hierarchical structure, and communication efficiency not only resolves the scalability bottlenecks present in prior methods but also establishes a new paradigm for multi-agent reinforcement learning in physically constrained, multi-role autonomous systems.
  • We establish a large-scale, distributed cooperative sensing and localization framework for heterogeneous UAV swarms to enhance collaborative localization efficiency.
  • The HMUDRL algorithm integrates a branched network architecture to improve computational and communication efficiency, while jointly leveraging received signal power differences and AOA measurements to enhance localization accuracy.
  • HMUDRL employs a multi-faceted reward mechanism that effectively balances monitoring coverage, localization accuracy, and operational efficiency.

3. System Model

Here, we consider establishing a heterogeneous UAV swarm to perform spectrum monitoring tasks, aiming to overcome the inherent limitations of individual UAVs, including constraints on payload weight and type, limited onboard power supply duration, and the difficulty of simultaneously achieving long operational endurance, wide monitoring coverage, and high-quality communication performance. As shown in Figure 1, to enhance monitoring efficiency, the UAV swarm is organized into multiple subgroups, comprising an information processing station, several CH, and a large number of CM. The information processing station maintains stable communication links with all CH and is responsible for overall swarm coordination and integrated data processing. CH possess strong data processing and transmission capabilities; they manage their respective CM within each subgroup, aggregate monitoring data, and relay it to the central base station. In contrast, CM are equipped with advanced sensing capabilities; through cooperative operation among multiple CM, a large monitoring coverage area is achieved, and collected data are transmitted back to their associated CH. However, due to the collective motion of the UAV swarm, the relative positions of CM may change dynamically, posing significant challenges to cluster stability and formation. Therefore, during the monitoring process, it is desirable that CM can autonomously form clusters, while CH can adaptively select their associated CM and determine optimal positioning strategies in real time.

3.1. CH–CM Link Establishment and Maintenance Model

Within the monitoring task area, we consider M CH and N CM, all moving independently within the region. Each CM is associated with exactly one CH. To reduce the risk of collisions in regions with high CM density, the number of CM per cluster is limited to N U . A CH is considered undersubscribed if the number of CM currently associated with it is less than N U . CH and their associated CM exchange Hello messages periodically over a dedicated control channel at a fixed interval T B [31]. Each Hello message contains the sender’s identifier, association status, carrier frequency, and signal power information. When a CM is already associated with a CH, it responds with a Hello message to maintain the link. Conversely, when a CM is unaffiliated with any cluster, it broadcasts a Hello message toward nearby CH to request association and actively join one of them. Only undersubscribed CH are permitted to accept new association requests.
Based on the received Hello messages, CH and CM can establish and update their association lists. If the i th CH and the j th CM appear in each other’s association lists and successfully exchange Hello messages within a predefined time window of η T B η 1 periods, a stable association link is considered established between them. Conversely, if no Hello message is received from the counterpart within η T B η 1 consecutive periods, the association link is deemed broken. A larger η value enhances robustness against transient link interruptions caused by occasional Hello packet loss, thereby improving cluster stability. However, an excessively large η may delay the network’s response to topology changes induced by UAV mobility. The optimal choice of η depends on specific factors such as the number of UAVs involved in the mission and the network configuration requirements. Furthermore, by leveraging the carrier frequency and received signal power information contained in the Hello messages, the relative position between a linked CH–CM pair can be estimated using Doppler shift and signal power variation. This positional estimate enables dynamic adjustment of the CH’s location to maintain cluster structural stability.
As illustrated in Figure 2, assume that the shaded area centered at a CH represents its broadcast coverage region with radius R . When the distance d 1 between CM1 and the CH is less than R , CM1 can successfully receive the broadcast Hello message from the CH and establish a link. Conversely, if the distance d 2 between CM2 and the CH exceeds R , CM2 cannot receive the broadcast message and thus cannot form a link with the CH.
Assuming that CHs and CMs are independently and uniformly distributed within the task region A of area S , with N h CH and N m CM, the spatial accessibility probability, that a randomly located CM lies within the coverage disk of at least one CH is given by:
P c o v e r = π R 2 S ,
If each CH can associate with at most N max CM, and currently k CM are associated with a given CH, then the average load per CH is n ¯ c = N m N h , when n ¯ c < N max , most CH are in an undersubscribed state. Under random association, by Poisson approximation:
P u n s a t = P P o i s s o n n ¯ c < N max = k = 0 N max 1 n ¯ c k e n ¯ c k ! ,
Let P u n s a t denote the probability that a CH is undersubscribed with respect to its associated CM. Suppose the success probability of receiving a Hello message in each period is P h e l l o , which depends on channel quality, multipath effects, and path loss, among other factors. If a stable link is established only when Hello messages are successfully received for T consecutive periods, then the probability of establishing a stable link is given by P s t a b l e = P h e l l o T , where P h e l l o is a function of distance. According to the free-space path loss model:
P h e l l o d = 1 , d R 0 , ( R 0 / d ) n , d > R 0 .
where R 0 is the reliable communication radius (slightly less than R ), and n is the path loss exponent. For a CM to successfully join a CH, there must exist at least one CH within its communication range that is both undersubscribed and maintains a stable link. The CM will select either the nearest undersubscribed CH or the one with the strongest signal. Thus, the probability of successful link establishment is given by:
P l i n k = 1 i = 1 N h 1 P c o v e r , i P u n s a t , i P s t a b l e , i ,
Within the task region, the probability that a single CM successfully establishes a stable association link with a CH is given by:
P l i n k = π R 2 S k = 0 N max 1 n ¯ k e n ¯ k ! P 0 T ,
where P 0 is the probability of successfully receiving a Hello message within one period in the coverage area, and T is the number of consecutive successful periods required to establish a stable link.

3.2. UAV Kinematic Model

This section provides a detailed description of the UAV kinematic model, focusing on variations in both speed and heading to facilitate electromagnetic spectrum source localization. We first define the position update equation and then specify the sets of speed and heading used to update the UAV’s position. For analytical tractability, UAV motion is modeled in a two-dimensional plane, and the UAV’s position–velocity state is given by:
Q i t = x i t , y i t , v x i t , v y i t ,
where Q i t represents the position and velocity state of UAV i at time t , in which x i t and y i t denote the coordinates of UAV i at time t , x i t and y i t represent the velocity components along the x - and y -axes, respectively.
The acceleration v ˙ i t of the UAV is given by:
v ˙ i t = Δ v x i t , Δ v y i t T ,
where Δ v x i t and Δ v y i t denote the changes in velocity components of the UAV i at time t , In practical systems, these changes are controlled by the flight control system and updated at a fixed time step Δ t . Based on Equation (7), the UAV state update is given by:
Q i t + 1 = x i t + 1 , y i t + 1 , v x i t + 1 , v y i t + 1 = x i t + v x i t + 1 Δ t , y i t + v y i t + 1 Δ t , v x i t + Δ v x i t , v y i t + Δ v y i t ,
Each UAV is subject to three physical constraints, among which the velocity constraint is defined as follows:
v min v t v max ,
where v t = v x t 2 + v y t 2 denotes the magnitude of the UAV’s velocity, v min and v max represent the minimum (typically zero) and maximum flight speeds of the UAV, respectively.
The acceleration constraint is defined as follows:
| Δ v x t | a max Δ t | Δ v y t | a max Δ t ,
where a max denotes the maximum acceleration of the UAV.
Additionally, the position constraint can be defined as:
x min x t x max y min y t y max ,

3.3. Spectrum Sensing Model

In the monitoring task region, spectrum sensing is performed by CM to detect UHF radiation source. The small-scale fading channel between the source and CM follows the Nakagami-m distribution [32]. Therefore, the probability density function (PDF) of the small-scale fading amplitude h can be expressed as:
f | h | r = 2 m m Γ m Ω m r 2 m 1 exp m Ω r 2 , r 0 ,
where r denotes the small-scale fading amplitude h , Ω represents the average power of the fading signal, i.e., Ω = E | h | 2 , E denotes expectation, Γ m is the Gamma function with Γ m = m 1 ! , and m is the shape parameter that quantifies the severity of small-scale fading.
For the UHF radiation source, when sensed by a CM, the path loss from the source s to the CM can be modeled using a log-distance path loss model, given by:
L c m , s d c m , s = L 0 + 10 n log 10 d c m , s d 0 + X σ ,
where d c m , s is the Euclidean distance between the radiation source s and the CM, L 0 is the path loss at the reference distance d 0 = 1 (in dB), which is calculated as L 0 = 20 log 10 4 π d 0 λ , λ = c / f is the wavelength with c being the speed of light and f the operating frequency, n is the path loss exponent, and X σ is a zero-mean Gaussian random variable with standard deviation σ .
When multiple CM simultaneously monitor the radiation source s the received signal power at the i th CM can be determined based on its precise location and the measured signal strength differences, enabling estimation of the source’s coordinates ( x s , y s ) and the path loss exponent n . Specifically, the received signal power at the i th CM is given by:
P r , i = P t , s + G t + G r , i L i , s d i , s ,
where P t , s is the transmit power of the radiation source, G t is the transmitter antenna gain, G r , i is the receiver antenna gain of the i th CM, and L i , s d i , s denotes the path loss from the source to the i th CM. Assuming the first CM is selected as a reference, the difference in signal strength between the i th CM and the reference CM is defined as:
Δ P i = P r , i P r , 1 ,
Substituting Equation (14), we obtain the sensing model expression:
Δ P i = G r , i G r , 1 10 n log 10 d i d 1 + X σ , i X σ , 1 ,
If all CMs have identical receiving antenna gains, i.e., G r i = G r 1 , then:
Δ P i = 10 n log 10 d i d 1 + X σ , i X σ , 1 ,
For localization of the radiation source s, at least three or more CM are required. The distance from each CM to the source can be estimated using the measured received power P r , i t [33], given by:
d i t = 10 P r , i t P 1 10 n ,
where d i t represents the distance from the i th CM to the radiation source at time t . Based on the logarithmic relationship between P r , i t and distance, the distance can be estimated accordingly. However, due to the limited sensing range of each CM, measurements are only valid under the condition:
d i t d max ,
where d max denotes the maximum monitoring coverage radius of a CM. Under the constraints of Equations (18) and (19), the set of effective measurement data from CM at time t is defined as:
C M t = i P r , i t > P min ,
where P min is the minimum detectable signal power threshold. For triangulation-based localization, at least three spatially distributed CM are required. Thus, accurate localization of the radiation source s is feasible only when | C M t | 3 , i.e., when three or more CM simultaneously acquire valid signal measurements P r , i t .
The CM estimates the AOA ϕ i , s of the signal from the radiation source s via its onboard sensing system, and combines this with its own position information to construct a nonlinear observation equation for estimating the source location x s = x s , y s . In this work, the Cramér–Rao Lower Bound (CRLB) [34] is introduced as a theoretical performance benchmark. It describes the fundamental limit of estimation accuracy: for any unbiased estimator of the source location x s , the trace of the covariance matrix of the estimation error cannot be lower than the CRLB. A smaller CRLB value indicates a more favorable distribution of CM for high-precision localization.

3.4. Communication Model

The path loss between a CH and a CM during communication transmission can be expressed as:
L h , m d h , m = L 0 + 10 n log 10 d h , m d 0 + X σ ,
where d h , m denotes the distance between the CH and the CM. The SNR received by the CM from the CH is given by:
S N R h , m = P t , h | g h , m | 2 N 0 B ,
where P t , h is the transmit power of the CH, g h , m is the channel gain, N 0 is the noise power spectral density, and B is the communication bandwidth. According to Shannon’s theorem, the maximum achievable data rate between the CH and CM is:
R h , m = B log 2 1 + S N R h , m ,
and the total throughput of the CH is defined as:
R h t o t a l = m C h R h , m .
Assume that the entire heterogeneous UAV swarm consists of H CH and M CM, which are divided into K clusters. Each cluster contains one CH and multiple CM. Let G k denote the set of all UAVs in the k th cluster, where k 1 , 2 , , K :
G k = { C H k , C M k , 1 , C M k , 2 , , C M k , M k } ,
where M k represents the number of CM in the k th cluster. The communication delay within cluster k , denote T h , m , k , can be expressed as:
T h , m , k = T p r o p , k + T p r o c , k + T q u e u e , k + 1 + Ψ h , m , k × T t r a n s , k ,
where T p r o p , k is the propagation delay, determined by the distance between the CH and its associated CM, T p r o c , k is the processing delay, including encoding and decoding time, T q u e u e , k is the queuing delay, dependent on network load and queue length, Ψ h , m , k is the average retransmission count, T t r a n s , k is the single transmission delay, determined by packet size and bandwidth. Assuming P s u c c e s s γ denotes the probability of successful packet delivery under a given SNR condition γ , which can be derived from metrics such as bit error rate (BER), we have:
Ψ h , m , k = 1 P s u c c e s s γ 1 ,
When the link state between CH and CM is denoted by σ h , m , it is defined as:
σ h , m = 1 , C M l i n k e d w i t h C H , 0 , o t h e r w i s e .
The overall communication delay T t o t a l is defined as the weighted average of intra-cluster communication delays across all clusters:
T t o t a l = k = 1 K m G k σ h , m w h , m , k T h , m , k k = 1 K m G k σ h , m w h , m , k ,
where w h , m , k is the communication weight between the CH and CM in the k th cluster, reflecting the relative importance or traffic load of the link.

4. Problem Formulation and Analysis

In this work, the objective is to achieve distributed localization of UHF radiation source by leveraging a heterogeneous UAV swarm, through coordinated collaboration among diverse UAV types, enabling efficient and high-precision localization. Specifically, the CH maintain stable communication links with the central control base station and manage their associated CM within each cluster. The CM are equipped with high-sensitivity sensing devices and are responsible for detecting and localizing UHF radiation source within their assigned task regions.
To address the constraints on individual UAVs in terms of endurance, payload, and performance, we adopt a cooperative strategy based on the Dynamic Range-Sensitive Source Localization (DRSS) method [35], which jointly estimates the source location x s = x s , y s and path loss exponent n by solving a nonlinear system of equations that incorporates both signal strength and AOA measurements. Using a weighted least squares approach, the estimation is performed when at least three or more CM simultaneously detect the source. By combining Equations (6) and (9)–(11), along with the sensing and communication constraints, the goal is to maximize the total duration S t during which the radiation source s is continuously localized. This leads to the following optimization problem for adjusting UAV motion and information transmission:
max Q i t t = 1 T , i t = 1 T S t ,
S ( t ) = { 1 , if | C CM ( t ) | 3 , 0 , otherwise .
The objective of this optimization is to find, under kinematic and communication constraints, the optimal trajectory Q i t for each UAV over the time horizon T , such that the cumulative duration S t , representing the number of time steps where at least three CM are actively detecting the source—is maximized. This ensures continuous three-point localization of the radiation source.
Simultaneously, under dynamic constraints, we aim to determine suitable clustering strategies and CH positions to maximize the overall network throughput while minimizing the end-to-end communication delay, thereby reducing the source localization latency. Based on Equations (22) and (29), the optimization problem can be further formulated as:
max h , m G k σ h , m B log 2 1 + S N R h , m ,
min T t o t a l = min k = 1 K m G k w h , m , k T h , m , k k = 1 K m G k w h , m , k .
Equations (30), (32) and (33) represent multi-objective non-linear optimization problems. While traditional methods such as nonlinear programming [35], genetic algorithms [36], particle swarm optimization [37], and interactive evolutionary algorithms [38] have been widely used for static or slowly varying scenarios and offer good performance in fixed environments, they suffer from poor adaptability and learning capability when faced with dynamic and complex environments. These methods require reinitialization and full search upon environmental changes, making them inefficient for real-time decision-making. To overcome these limitations, we propose HMUDRL, which enables autonomous decision-making by both CH and CM. By integrating dynamic environment perception, adaptive clustering, and joint optimization of mobility and communication, HMUDRL achieves robust and efficient distributed localization of UHF radiation source in highly dynamic scenarios.

5. Distributed Heterogeneous Multi-Agent Deep Reinforcement Learning Algorithm for UAV Clustering

In this section, we first introduce the HMUDRL framework and its key components, followed by a discussion of its training procedure. While existing hierarchical and cooperative multi-agent reinforcement learning approaches, such as MAPPO [8] and QMIX [39], adopt the centralized training with decentralized execution paradigm and have achieved notable progress in various domains, they are not specifically designed for physically constrained tasks like electromagnetic source localization, where sensing and communication are tightly coupled. Our proposed HMUDRL framework is not a straightforward adaptation of existing MARL paradigms; rather, it represents a domain informed architectural innovation tailored to UHF source seeking missions. By co-designing agent roles, communication topology, and reward structure, HMUDRL effectively addresses the core challenges inherent in decentralized, heterogeneous, and perception driven localization scenarios.

5.1. HMUDRL Framework

As illustrated in Figure 3, the proposed HMUDRL is composed of two types of agents: one serving CH and the other serving CM. These agents share the same environment but possess distinct observation spaces, action spaces, and reward functions tailored to their respective roles. The intelligent decision-making process is implemented via an actor-critic architecture: the actor network selects actions based on the current state, thereby executing the policy, while the critic network evaluates the anticipated future rewards, assessing the quality of the selected actions. HMUDRL adopts a distributed framework where each heterogeneous UAV agent operates independently. CH agents employ the Hierarchical Multi-Agent PPO (HMPPO) algorithm, leveraging their computational and communication capabilities to exchange global information with the central control center and share a common critic network. In contrast, CM agents utilize the Decentralized Distributed Multi-Agent PPO (DDMPPO) algorithm, focusing on cooperative spectrum sensing and localization tasks for UHF radiation source. This modular design enables effective coordination between heterogeneous agents while allowing them to interact dynamically with the environment. During actual operation, each CM only needs to communicate with its associated CH, eliminating the need for inter-agent communication among all UAVs. This reduces computational and communication overhead significantly. Moreover, the distributed architecture allows the two types of agents to be trained separately, enhancing the scalability and robustness of the HMUDRL framework.

5.2. HMPPO Design

In the UHF radiation source localization scenario, the objective is to collaboratively discover an optimal policy through distributed coordination among heterogeneous UAVs, enabling the CM to effectively locate radiation sources while minimizing communication overhead and latency. To achieve this goal, we propose HMPPO based on a decentralized control strategy for CH. The design of the CH control strategy focuses on maintaining stable cluster formation, balancing the workload among CH, ensuring high-quality communication links, low-latency transmission, and sufficient coverage.

5.2.1. Observation Set

As shown in Figure 2, each CH periodically broadcasts Hello messages for topology maintenance and establishes or updates associations with nearby CM. The observation set of the CH includes its own state as well as global contextual information. Specifically, the observation vector consists of eight key variables: the position of the CH Q h , the displacement S h relative to the centroid of all CM, the distance L h from the estimated radiation source location s , the spatial dispersion S m t of the associated CM, and the average inter-CH distance D h . The full observation set is thus expressed as:
O h = Q h , S h , L h , S m t , D h ,

5.2.2. Action Set

The CH adjusts its movement action Δ Q h based on environmental observations, which can be subdivided into five options: stay still, move north, south, east, or west, denoted as a t 0 , 1 , 2 , 3 , 4 . Only one action is executed at each time step; thus, a t is a discrete multiclass random variable with probability distribution determined by parameters π = π 0 , π 1 , π 2 , π 3 , π 4 , satisfying:
P ( a t = i ) = π i , i = 0 4 π i = 1 , π i 0 ,
To map the state information in Equation (34) to specific actions, we design an Actor-Network that takes the observation set and the advantage value from the Critic-Network as input. The network outputs unnormalized scores (logits) for each action, z = f O h ; θ , where f is a neural network function with output dimension 5 and θ denotes the learnable parameters. The action probabilities are then computed via the softmax function:
π i = exp ( z i ) j = 0 4 exp ( z j ) ,
where z i is the logit corresponding to action i . Finally, the action a t is selected stochastically according to the probability π i .

5.2.3. Reward

Based on the observable state variables of the CH, including AOA estimates reported by CM, RSS measurements, and relative geometric relationships between CH and CM, we design a reward function to guide the CH in learning optimal clustering and mobility policies. At each time step t , the scalar reward r t h for a CH is defined as a composite of three components: localization confidence reward r l o c h , coverage quality reward r c o v h , and exploration diversity reward r d i r h .
The localization confidence reward r l o c h is designed to reflect the accuracy and reliability of source localization based on AOA measurements from multiple CM. It is determined jointly by angular diversity and the number of valid AOA observations, and is expressed as:
κ t = { 0 , N t < 2 , ( N t N m ) ( 1 2 N t ( N t 1 ) 1 i < j N t | cos ( ϕ m i ϕ m j ) | ) , N t 2 .
where N t denotes the number of CM that successfully report valid AOA values at time t , ϕ m i is the AOA measured by the i th CM, and N m is the total number of CM in the cluster.
The coverage quality reward r c o v h encourages the CH to maintain its associated CM within effective communication and sensing ranges. It is defined as:
r c o v h = 1 N m i = 1 N m exp | Q i m c h R 0 | σ c o v ,
where Q i m is the position of the i th CM, c h is the centroid of the CH, R 0 is the maximum communication radius, and σ c o v is a scaling parameter.
The exploration diversity reward r d i r h promotes decentralized movement of CH to avoid clustering and enhance spatial coverage. It is formulated as:
r d i r = tan h 1 N h j = 1 N h   Q j , t h Q j , t 1 h v s c a l e ,
where N h is the total number of CH, Q j , t h is the position of the j th CH at time t and v s c a l e is a normalized velocity scale factor. The total reward function for the CH is then given by:
r t h = ω l o c h r l o c h + ω c o v h r c o v h + ω d i r h r d i r h .
where ω l o c h , ω c o v h and ω d i r h are weighting coefficients that balance the contributions of localization confidence, coverage quality, and exploration diversity, respectively.

5.3. DDMPPO Design

Considering the limited payload and operational duration of individual CM platforms, we aim to highlight their monitoring capabilities by optimizing information processing and networking transmission on a single platform, enabling lightweight and compact design for CM. By deploying a large number of such UAVs in a distributed manner, we ensure stable communication links with their associated CH while achieving real-time, persistent, and high-accuracy localization of UHF radiation source across the target region.

5.3.1. Observation Set

The observation set for each CM primarily includes: Its own position Q m , Distance d h , m to its associated CH, RSS P r , RSS variation Δ P r , AOA ϕ m of the incoming signal from the radiation source. Thus, the observation vector is defined as:
O m = Q m , d h , m , P r , Δ P r , ϕ m ,

5.3.2. Action Set

The action of a CM is mainly defined as a movement action Δ Q m . The actual motion decision is determined based on the measured AOA ϕ m and the received signal power P r When at least three CM simultaneously detect the radiation source, i.e., | C CM ( t ) | 3 , the CM can collaboratively perform triangulation to achieve accurate localization of the source.

5.3.3. Reward

A scalar reward signal is generated based on the state and action of the CM to provide feedback for learning. The total reward is composed of multiple components, including: RSS increase reward r Δ P r m , encouraging movement toward regions with higher signal strength, Distance maintenance reward r d m , rewarding actions that maintain a suitable distance from the CH, Valid AOA reward r ϕ m , incentivizing accurate AOA measurements. The RSS-based reward is formulated as:
r ϕ m = { 1 Δ ϕ min π , if   N 3 , 0 , otherwise ,
where Δ ϕ min = min i j | ϕ i ϕ j | denotes the minimum absolute angular difference between AOA measurements from different CM. A smaller Δ ϕ min indicates higher angular consistency among CM, leading to lower rewards, while Δ ϕ min approaching π / 2 yields the maximum reward, promoting diverse angular coverage. The total reward function for CM is then expressed as:
r t m = ω Δ P r m r Δ P r m + ω d m r d m + ω ϕ m r ϕ m .
where ω Δ P r m , ω d m and ω ϕ m are weighting coefficients that balance the contributions of signal strength improvement, distance maintenance, and angular diversity, respectively.

5.4. Training Process

In the proposed HMUDRL framework, both HMPPO (for CH) and DDMPPO (for CM) are implemented based on Algorithm 1, the PPO algorithm. The PPO method introduces a clipped objective function via a trust region approach [30] to constrain the magnitude of policy updates at each step, thereby enhancing training stability. The clipped surrogate objective is defined as:
L C L I P ( θ ) = E t min r t ( θ ) A t , clip ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A t ,
where r t ( θ ) = π θ new ( a t | s t ) π θ old ( a t | s t ) denotes the probability ratio between the new and old policies, θ represents the policy parameters (i.e., neural network weights), A t is the advantage function estimating the relative benefit of the current action, and ϵ controls the step size of policy update by clipping r t ( θ ) within the interval 1 ϵ , 1 + ϵ . The overall objective of the PPO algorithm [30] is formulated as:
L C L I P + V F + S ( θ ) = L C L I P ( θ ) c 1 L V F ( φ ) + c 2 S π θ s t ,
where L V F ( φ ) is the value loss function, defined as L V F ( φ ) = E t V φ ( s t ) V t target 2 , representing the squared error between the estimated value and the target value V t target = A t + V φ , o l d ( s t ) , S π θ s t is the entropy term, given by S [ π θ ] s t = - E a t ~ π θ ( | s t ) log π θ ( a t | s t ) , which encourages exploration and prevents premature convergence to local optima, c 1 and c 2 are hyperparameters balancing the contributions of value loss and entropy regularization.
For the HMPPO algorithm, the primary goal is to optimize cluster formation and mobility strategies for CH while maintaining high-quality communication links. To achieve this, the advantage function is estimated using the Generalized Advantage Estimation (GAE) method [31], which computes the advantage as:
A t h = l = 0 T ( γ λ ) l δ t + l h ,
where γ is the discount factor, λ is the GAE parameter that balances bias and variance in the advantage estimate, and δ t h is the temporal difference (TD) error defined as:
δ t h = r t h + γ V ( s t + 1 h ) V ( s t h ) ,
With r t h being the reward received by the CH at time step t , r t h = d log π ( a t | s t ) p ˜ d ( s t ) , where p ˜ d ( s t ) = e μ o d / d e μ o d , o d and o d are the observation scores, μ is the scaling factor controlling the distribution of observation probabilities, V ( s t h ) and V ( s t + 1 h ) denote the critic network’s value estimates for the current and next states, respectively.
For the DDMPPO algorithm, its primary objective is to optimize the sensing performance, SNR, and energy consumption of CM. The advantage function for CM is expressed as:
A t m = l = 0 T ( γ λ ) l δ t + l m ,
where δ t m denotes the TD error computed for CM, defined as:
δ t m = r t m + γ V ( s t + 1 m ) V ( s t m ) ,
With r t m being the reward obtained by CM at time step t , V ( s t m ) the value estimate of the current state from the critic network, and V ( s t + 1 m ) the value estimate of the next state.
In this work, we employ a HMUDRL framework that integrates CH and CM into a unified distributed learning architecture. To reduce decision complexity and effectively implement distributed cooperative control, we design a dual-branch policy network structure: one branch outputs control actions (e.g., cluster formation and mobility decisions) for CH, while the other branch generates monitoring-related actions (e.g., sensing and energy management) for CM. This modular architecture enables independent optimization of control and sensing tasks, with distinct action spaces and objectives for each agent type. Based on Equation (44), we reformulate the total policy loss function for HMUDRL as the sum of two separate clipped surrogate objectives:
L h C L I P ( θ ) = E t min r t h ( θ ) A t h , clip ( r t h ( θ ) , 1 ϵ , 1 + ϵ ) A t h ,
L m C L I P ( θ ) = E t min r t m ( θ ) A t m , clip ( r t m ( θ ) , 1 ϵ , 1 + ϵ ) A t m ,
where L h C L I P ( θ ) and L m C L I P ( θ ) represent the clipped policy losses for CH and CM, respectively. The clipping parameter ϵ limits the magnitude of policy updates to prevent large, destabilizing changes during training. Each branch operates independently, ensuring stable and coordinated updates across the two agent types.
L t o t a l C L I P ( θ ) = L h C L I P ( θ ) + L m C L I P ( θ ) ,
To ensure effective learning in the heterogeneous system, we apply entropy regularization to both agent types, promoting exploration and preventing premature convergence. The entropy terms are defined as:
S [ π θ , h ] s t = - E a t ~ π θ , h ( | s t ) log π θ , h ( a t | s t ) ,
S [ π θ , m ] s t = - E a t ~ π θ , m ( | s t ) log π θ , m ( a t | s t ) ,
The total policy entropy is then combined as:
S [ π θ ] s t = S [ π θ , h ] s t + S [ π θ , m ] s t ,
By jointly optimizing both branches using Equation (55), we unify the exploration pressure across the system, avoiding imbalance where one branch may converge prematurely while the other remains under-explored. Furthermore, due to the distinct state spaces, action spaces, reward functions, and objectives of CH and CM, we design separate critic networks for each agent type and compute their respective value losses:
L h V F ( φ h ) = E t V φ h ( s t h ) V t , h target 2 ,
L m V F ( φ m ) = E t V φ m ( s t m ) V t , m target 2 ,
where V t , h target and V t , m target are the target values for the CH and CM critic networks, respectively. The total value loss is then given by:
L t o t a l V F ( φ ) = α h L h V F ( φ h ) + α m L m V F ( φ m ) ,
where α h and α m are weighting coefficients balancing the contributions of the two critic networks. Based on the overall objective function of the PPO algorithm in Equation (45), the complete objective function for the HMUDRL algorithm incorporates the policy loss, value loss, and entropy regularization terms, as shown in Equation (59):
L H M U D R L ( θ , ϕ ) = L t o t a l C L I P ( θ ) L t o t a l V F ( ϕ ) + β S [ π θ ] s t .
where β is the weight coefficient for the entropy term. By appropriately tuning the weight coefficients, the trade-off among policy improvement, value estimation accuracy, and exploration can be effectively balanced, enabling efficient optimization of the heterogeneous UAV monitoring system.
Algorithm 1 Training Process of HMUDRL Algorithm
1:Initialize policy networks  π θ h  for CH agents and  π θ m  for CM agents, value networks  V φ h  for CH agents and  V φ m  for CM agents,  θ o l d h     θ h ,   θ o l d m     θ m ,   φ o l d h     φ h ,   φ o l d m     φ m ,   D h for CH agents and D m for CM agents
2:for iteration i =   1 ,   2 ,   ,   I  do
3:      for episode t =   1 ,   2 ,   ,   T  do
4:            for each CH agent h  do
5:                  Observe state s t h
6:                  Select action a t h   ~   π θ , h o l d O t h
7:            end for
8:            for each CM agent m do
9:                  Observe state s t m
10:                  Select action a t m   ~   π θ , m o l d O t m
11:            end for
12:            Execute actions a t h ,   a t m in environment
13:            Observe rewards R t h ,   R t m and next states O t + 1 h ,   O t + 1 m
14:            Store transition ( O t h ,   a t h ,   R t h ,   O t + 1 h ) and ( O t m ,   a t m ,   R t m ,   O t + 1 m ) in D h and D m
15:      end for
16:      Compute advantages A t h and A t m using Equations (46) and (48)
17:      Compute target values V t , h target and   V t , m target
18:      for k =   1 ,   2 ,   ,   K  do
19:            for each CH agent h  do
20:                  Compute policy losses L h C L I P ( θ ) using Equation (50)
21:                  Compute entropy bonus S [ π θ , h ] s t using Equation (53)
22:                  Compute value loss L h V F ( φ h ) using Equation (56)
23:                  Update θ h and φ h
24:            end for
25:            for each CM agent  m  do
26:                  Compute policy losses L m C L I P ( θ ) using Equation (51)
27:                  Compute entropy bonus S [ π θ , m ] s t using Equation (54)
28:                  Compute value loss L m V F ( φ m ) using Equation (57)
29:                  Update θ m and φ m
30:            end for
31:            Compute total loss L H M U D R L ( θ , φ ) using Equation (59)
32:      end for
33:      Update θ o l d h     θ h ,   φ o l d h     φ h for all CH agents
34:      Update θ o l d m     θ m ,   φ o l d m     φ m for all CM agents
35:end for

6. Simulation Results

This section presents the simulation experiments of the proposed HMUDRL algorithm. First, we simulate the monitoring and sensing environment described in the paper. Subsequently, numerical simulations are conducted to evaluate the performance of the proposed method in terms of localization accuracy, reward value, number of steps to first detection, average interaction frequency, and other metrics. The results are compared and analyzed against baseline algorithms.

6.1. Experimental Setup and Parameters

The simulation is configured in a 1000 × 1000 m2 mission area, where CH and CM are deployed. The deployment altitude of CH ranges from 200 to 400 m, while that of CM ranges from 100 to 300 m. The speed of CH varies between 0 and 22 m/s, and that of CM ranges from 0 to 14 m/s. The transmission power of CH is set to 20 dBm, with a carrier frequency of 3.0 GHz and a bandwidth of 10 MHz. The Hello message broadcasting interval is set to 1 s [40]. The maximum control range of each CH is defined as 500 m. Each CM is equipped with a 2.4 GHz UHF radiation source detector, which consists of eight measurement units arranged at 45° intervals for omnidirectional sensing to ensure reliable signal reception. Considering the payload capacity and energy constraints of UAVs, and to verify the practical deployability of our approach on resource constrained UAV platforms, which typically lack dedicated GPU hardware, all reinforcement learning training and evaluation experiments were conducted using a lightweight training scheme. The proposed HMUDRL framework was implemented in Python 3.11.9 and PyTorch 2.4.1, without any GPU acceleration, and trained entirely on the CPU of a standard desktop computer equipped with an Intel Core i5-1130G7 processor and 16 GB RAM, thereby clearly demonstrating its computational efficiency. Each training episode lasts 10 s, discretized into 100 time steps (each step lasting 0.1 s). For convenience in simulation, the speeds of both CH and CM are uniformly set to 10 m/s (1 m per step). The weights of different rewards in the heterogeneous UAV monitoring system are determined through the training process, and the specific task scenario is adjusted to achieve a balance between control and sensing objectives. The simulation parameters are summarized in Table 2.
The weights of individual reward components, such as the coverage reward, localization accuracy reward, and communication cost penalty, were determined through a grid search over a reasonable range and further refined by incorporating UAV maneuverability constraints and UHF propagation characteristics. Starting from balanced initial weights, we iteratively adjusted them to prioritize successful localization while ensuring feasible trajectories and low communication overhead. HMUDRL exhibits robust performance under moderate variations in these weights.

6.2. Experimental Results and Analysis

This study establishes a multi-intelligent UAV monitoring platform, deploying three CH and fourteen CM. In each experiment, the UHF radiation source is randomly placed within the experimental area, and the CH control the CM to perform localization tasks. The system records key metrics throughout the process. As shown in Figure 4, the red solid line represents the trajectory of the CH during the localization process; the blue dot and dashed line indicate the movement path of the CM during localization; the yellow star marks the actual location of the UHF radiation source; the green “×” denotes the initial position of the CM; and concentric circles around the radiation source within the region illustrate changes in color representing the gradient change of the radiation field strength. Figure 4a illustrates the localization trajectories under RSS-only conditions (i.e., without AOA information). After completing the full HMUDRL algorithm training, both the CH and CM effectively converge toward the radiation source. CM perform spiral search motions around the source, ultimately achieving precise localization with the final estimated source position closely matching the true radiation source location. Figure 4b shows the case where the CM is randomly deployed within the designated area, but fails to locate the source due to the lack of dynamic coordination mechanisms. In this scenario, localization performance deteriorates significantly compared to when the HMUDRL algorithm is employed. Figure 4c presents the situation where the clustering optimization function of the CH is disabled. Without dynamic cluster adjustment, the CH remains fixed at a predefined position (e.g., uniformly distributed), which limits the ability of CM using DMPPO to explore optimally. As a result, the overall localization efficiency may be compromised. Figure 4d demonstrates the performance of the HMUDRL algorithm under high RSS and AOA noise conditions. Despite the more challenging environment, the system’s localization accuracy slightly decreases, yet it still achieves relatively accurate source positioning. This highlights the robustness and adaptability of the HMUDRL algorithm in complex and noisy monitoring environments.
Figure 5 Comparison of the moving average total rewards over 1000 training episodes between the proposed HMUDRL algorithm and four baseline methods (MADDPG, MASAC, COMA, and MAA2C). The x-axis represents the number of episodes during training, while the y-axis shows the smoothed cumulative reward values, reflecting the stability and convergence performance of each algorithm’s policy optimization throughout the prolonged learning process.
Overall, the HMUDRL algorithm (red curve) demonstrates a significant learning advantage in the early stages of training, with its reward value rapidly increasing and stabilizing after approximately the 150th episode, ultimately settling around 180. This indicates that the algorithm possesses rapid convergence capabilities and strong policy exploration efficiency. In contrast, although the MASAC algorithm (purple curve) also exhibits a fast initial growth rate, it suffers from significant reward fluctuations and noticeable oscillations in later stages, suggesting instability in its policy. The MAA2C algorithm (green curve) maintains a relatively stable growth trend throughout the training process, with reward values ranging between 70 and 90, demonstrating good robustness; however, its maximum reward level is notably lower than that of HMUDRL, indicating room for improvement in cooperative decision-making capability. The COMA algorithm (orange curve) experiences severe reward volatility, especially with several significant declines during the mid-training phase, suggesting that its counterfactual baseline mechanism struggles to effectively model individual contributions in complex heterogeneous environments, leading to unstable policy learning. The MADDPG algorithm (blue curve) performs the worst, with its reward values hovering around 25 without significant changes as training progresses, indicating that its centralized Critic network faces severe dimensionality issues when handling large-scale heterogeneous UAV swarms, resulting in significant value estimation errors and difficulties in policy updates. Notably, by introducing a hierarchical architecture fusion mechanism, the HMUDRL algorithm successfully achieves differentiated policy learning for two types of agents: CH and CM, thereby avoiding performance bottlenecks caused by unified policy spaces in traditional MARL approaches. Moreover, the hierarchical reward design further enhances the policy’s responsiveness to environmental observations, enabling continuous optimization of localization behaviors in dynamic electromagnetic environments. The HMUDRL algorithm not only leads significantly in final reward levels but also outperforms existing mainstream multi-agent reinforcement learning algorithms in terms of training stability and convergence speed, fully validating its effectiveness and superiority in the task of heterogeneous UAV collaborative localization.
Table 3 presents a comparative analysis of the proposed HMUDRL algorithm against four baseline methods, in terms of root mean square error (RMSE), localization success rate, and the number of steps to first detection over 1000 training episodes. The results demonstrate that HMUDRL significantly outperforms all baselines in the UHF radiation source localization task. In terms of RMSE, HMUDRL achieves a mean error of 39.34 m, substantially lower than MADDPG (620.43 m), MASAC (395.88 m), and COMA (135.02 m), and only slightly better than MAA2C (82.70 m). More importantly, HMUDRL exhibits the lowest median RMSE among all methods at 22.83 m, indicating consistent high-precision performance. Its maximum localization error is 678.11 m, which is notably lower than those of MASAC (1774.58 m) and MADDPG (1125.30 m), reflecting improved robustness under worst-case scenarios. Regarding efficiency, HMUDRL requires only 6 steps on average to first detect the target—outperforming all baselines except MADDPG (5 steps). Crucially, this rapid detection is achieved without compromising accuracy, as HMUDRL maintains a high localization success rate, underscoring its effective exploration capability in dynamic environments. These advantages stem from the proposed hierarchical architecture: CH employ the HMPPO algorithm to optimize cluster formation and mobility control, while CM utilize the DDMPPO algorithm to balance sensing performance and energy consumption. This dual-policy design, combined with a hierarchical reward fusion mechanism, mitigates the curse of dimensionality in policy updates. Furthermore, differentiated entropy regularization prevents premature convergence in either agent branch. In contrast, MADDPG suffers from dimensionality explosion in its centralized critic network, leading to biased value estimation. MASAC’s increased parameter complexity tends to induce policy oscillations. MAA2C, which shares a single critic across heterogeneous agents, fails to account for their distinct objectives and observation spaces, thereby limiting localization accuracy. These limitations collectively highlight the superiority of HMUDRL in addressing the challenges of heterogeneous multi-UAV cooperative localization.
Figure 6 further reveals the characteristics of the data distribution. The RMSE distribution of the HMUDRL algorithm is notably more concentrated, with a narrow interquartile range (IQR), indicating high consistency and robustness in localization performance. In contrast, the boxplots for MADDPG and MASAC exhibit significantly elongated boxes accompanied by numerous outliers, e.g., MADDPG reaches a maximum RMSE of 1125 m, demonstrating substantial error fluctuations and poor stability. Although COMA and MAA2C display relatively compact distributions compared to MADDPG and MASAC, their boxplots are visibly wider than that of HMUDRL, suggesting degraded localization accuracy. Notably, COMA achieves a localization success rate (96.3%) comparable to HMUDRL (96.1%); however, its mean RMSE (135.02 m) is approximately 3.4 times higher than that of HMUDRL (39.34 m), revealing inadequate performance in complex or noisy environments despite frequent detection. It can be stated that the HMUDRL algorithm achieves a breakthrough in the integrated performance of accuracy, stability, and efficiency through architectural innovation and mechanism optimization.
According to the convergence of the HMUDRL algorithm and the MASAC baseline within the first 200 episodes as shown in Figure 5, this study selected 200 training episodes to conduct a critical comparative analysis of the computational efficiency between the proposed HMUDRL framework and four other baseline algorithms (MADDPG, MASAC, COMA, and MAA2C). The primary metric assessed is the duration per training episode (in seconds), which directly reflects the algorithm’s training runtime and serves as an effective proxy for inference latency during online deployment. In Figure 7, the results clearly demonstrate that over the entire span of 200 training episodes, HMUDRL, along with the three baseline algorithms MASAC, COMA, and MAA2C, showed similar durations per episode, consistently remaining within a narrow range of approximately 0.7 to 1.3 s. In contrast, MADDPG exhibited significantly higher durations, averaging around 8 to 9 s per episode, which is an order of magnitude greater.
Moreover, as shown in Table 2, HMUDRL achieves significantly higher localization accuracy and success rate compared to MASAC, COMA, and MAA2C. This superior performance, combined with its comparable computational efficiency, makes HMUDRL stand out overall. From a practical deployment perspective, this low per-episode duration directly translates into ultra-low inference latency during online operation. Given that our experiments were conducted on a standard CPU without GPU acceleration, the observed latency of approximately 1 s for HMUDRL holds strong promise for real-time applications on resource-constrained UAV platforms. Such low latency ensures that the UAV swarm can rapidly respond to dynamic changes in the electromagnetic environment, enabling timely repositioning and accurate target localization.
Analysis of internal information exchange frequency within UAV swarms, different network structures exhibit significant variations in communication overhead. For heterogeneous architectures, CM must upload AOA estimates and RSS information to CH, which perform source localization estimation based on this data and issue control or clustering instructions to the CM. Simultaneously, different CH share positions or estimation results to achieve cross-cluster cooperation. In our experimental setup, we simplify by assuming that at each time step, each CM sends state information (including AOA and RSS) once to its CH, and any pair of CH communicates once if they establish a connection. By contrast, in k-NN type homogeneous structures, all drones are considered the same type, with each drone exchanging information only with its nearest 5 neighbors due to communication range limitations. In fully connected homogeneous structures, every drone directly communicates with every other drone, forming a completely connected network.
Table 4 presents the average number of information exchanges per episode among UAV swarms under different architectures, along with the corresponding communication savings ratios. In the minimal configuration (1CH + 8CM), the heterogeneous architecture requires approximately 900 exchanges per episode, significantly less than the 4500 exchanges needed in the k-NN homogeneous structure and the 7200 exchanges in the fully connected homogeneous structure. This translates into communication savings of 80.0% and 87.5%, respectively. As the system scales up to the maximum configuration (7CH + 30CM), the number of exchanges in the heterogeneous architecture increases to 3100, whereas it skyrockets to 18,500 for k-NN and an overwhelming 133,200 for fully connected architectures. At this scale, the heterogeneous architecture’s communication savings ratio escalates to 83.2% compared to k-NN and 97.7% compared to fully connected structures. This trend underscores the significant advantage of heterogeneous architectures in reducing internal information exchange frequency within UAV swarms, especially when deploying large-scale UAV systems, leading to substantial reductions in communication overhead and improved overall system efficiency. Furthermore, these findings validate from a communication efficiency perspective the critical role of the proposed HMUDRL algorithm in optimizing UAV swarm collaborative localization. Specifically, in executing UHF radiation source localization tasks under complex electromagnetic environments, the HMUDRL algorithm achieves notable improvements in system-wide energy efficiency and scalability by implementing rational role division and sparse communication mechanisms, ensuring high positioning accuracy while drastically enhancing overall performance.
Figure 8 intuitively illustrates that, as the number of UAVs increases, the heterogeneous architecture requires significantly fewer information exchanges compared to homogeneous architectures. This disparity primarily stems from the more efficient communication strategy inherent in the heterogeneous design, wherein CH aggregate and broadcast information on behalf of their member CM, thereby substantially reducing redundant message transmissions across the swarm.
To quantitatively assess the contribution of each core component in the proposed HMUDRL framework, we conduct a controlled ablation study by systematically removing one module at a time while keeping all other hyperparameters and training procedures unchanged. Specifically, we evaluate three variants:
  • HMUDRL w/o Hierarchical Control (HC): disables the hierarchical control mechanism, allowing agents to act solely based on local observations without high-level coordination;
  • HMUDRL w/o Dynamic Clustering (DC): replaces dynamic clustering with static, pre-defined clusters, thereby removing the adaptability to source mobility;
  • HMUDRL w/o AOA: excludes AOA measurements from agent observations, relying only on RSS.
Figure 9 presents the cumulative distribution functions (CDFs) of localization error for the full HMUDRL model and its ablated versions after 1000 episodes. As expected, the complete HMUDRL consistently outperforms all variants across the majority of the error spectrum, confirming the synergistic benefit of integrating hierarchical control, dynamic clustering, and RSS + AOA. Notably, the removal of AOA leads to the most significant performance degradation, particularly in the high-accuracy regime (error < 5 m), underscoring the critical role of directional information in fine-grained localization. Similarly, disabling dynamic clustering results in slower convergence and reduced robustness under source movement, as evidenced by the rightward shift of its CDF curve.
An interesting observation is that the variant w/o HC exhibits a slightly higher cumulative probability than the full model in the sub-2 m region. We attribute this to transient over-correction by the high-level controller during the final refinement phase, which can occasionally perturb otherwise accurate estimates. However, this local advantage does not translate into global superiority: beyond 10 m, the full HMUDRL demonstrates markedly better performance, achieving a 90th-percentile error of 66.86 m, compared to 97.54 m for w/o HC as shown in Table 5. This confirms that hierarchical coordination is essential for efficient global search and long-tail error suppression.
Quantitative metrics further validate these trends, the full HMUDRL achieves the lowest median error (22.83 m), 90% error (66.86 m), and MAE (43.54 m) among all configurations. These results collectively demonstrate that each proposed component, hierarchical control, dynamic clustering, and AOA integration, provides distinct and complementary gains, and their joint design is key to the overall performance.

6.3. Comparative Analysis of Model Complexity

In electromagnetic spectrum monitoring and localization tasks, the model complexity of MARL algorithms directly impacts training efficiency, communication overhead, and deployment feasibility. The proposed HMUDRL algorithm incorporates a hierarchical architecture along with RSS variation modeling and AOA based sensing mechanisms, resulting in a distinct complexity profile compared to classical MARL approaches such as MADDPG [23], MASAC [24], COMA [25], and MAA2C [26]. In terms of parameter scale, HMUDRL adopts a heterogeneous dual-network structure: CH agents employ a standard Actor-Critic network with an input dimension of 13 and hidden layer size of 128; CM agents incorporate a gating fusion module based on a 128-dimensional multilayer perceptron (MLP), along with an action prior module that maps AOA angles softly. The action space size is set to 5, and the CH state dimension is 13, yielding approximately P C H = 2 × 13 × 128 + 128 × 128 + 128 × 5 40,000 parameters. The CM state dimension is 11, including gating logic, leading to about P C M 45,000 . For a system with N C H = 3 and N C M = 14 , the total parameter count is approximately 750,000.
In contrast, MADDPG and MASAC adopt a centralized training decentralized execution (CTDE) framework, where each agent maintains an individual policy network and shares a centralized Q-network. The input dimension of the centralized critic scales quadratically with the number of agents and their state-action spaces. For example, with N = 17 and d = 13 , the critic input dimension reaches 306, and per-layer parameters can exceed 10 5 , leading to total parameters up to 10 6 scale. COMA reduces variance via counterfactual baselines but requires the critic to evaluate all possible joint actions, resulting in computational complexity of O A N . MAA2C is fully decentralized, each agent independently trains its own Actor and Critic, but lacks inter-agent collaboration, making it unsuitable for cooperative tasks. From a computational complexity perspective, HMUDRL decouples global coordination into CH-level macro-deployment and CM-level local sensing and motion planning. This hierarchical design avoids modeling high-dimensional joint action spaces, keeping model complexity at O N , in contrast to MADDPG’s O N 2 and COMA’s O A N . Thus, HMUDRL demonstrates superior scalability and practicality for large-scale electromagnetic spectrum monitoring tasks.

7. Conclusions

This paper proposes HMUDRL, a distributed heterogeneous multi-agent deep reinforcement learning framework specifically designed for UHF electromagnetic radiation source localization in UAV swarm-based spectrum monitoring missions. By introducing a hierarchical architecture, HMUDRL categorizes UAVs into two distinct types of agents, CH and CM, each assigned unique roles, observation spaces, and reward mechanisms. This design enables efficient collaboration, significantly reduces communication overhead, and supports scalable decision-making. Experimental results demonstrate that HMUDRL achieves a localization success rate of 96.1% over 1000 test episodes, with an average RMSE of only 39.34 m, substantially outperforming four baseline algorithms in both accuracy and robustness. Notably, compared to these baselines, HMUDRL reduces the average localization error by approximately 87.3%, while its cluster-based information aggregation mechanism cuts inter-agent communication volume by more than 80%. These results strongly validate the practical deployment potential of HMUDRL in real-world scenarios. For instance, the system can be used in urban environments for illegal broadcast detection by rapidly localizing unauthorized UHF transmitters; it can assist regulatory authorities in identifying sources of harmful interference during spectrum enforcement; and in emergency response situations such as earthquakes or wildfires, it can locate distress signals from trapped individuals using compact UHF beacons. The current implementation is limited to 2D simulations with idealized channel models, Gaussian sensor noise assumptions, and lacks hardware-in-the-loop validation; future work will address these gaps by incorporating 3D flight dynamics, realistic RF propagation environments, and real-time UAV testbed experiments.

Author Contributions

Conceptualization, Y.S. and X.Z. (Xueqing Zhang); methodology, Y.S. and M.W.; software, Y.S. and Y.Y.; validation, Y.S., X.Z. (Xueqing Zhang) and T.X.; formal analysis, X.Z. (Xuan Zhu) and T.C.; investigation, Y.S., M.W. and X.Z. (Xueqing Zhang); resources, Y.S., M.W., X.Z. (Xuan Zhu), Y.Y. and T.C.; data curation, M.W. and X.Z. (Xueqing Zhang); writing—original draft preparation, Y.S. and M.W.; writing—review and editing, Y.S., X.Z. (Xueqing Zhang), M.W., Y.Y. and T.X.; visualization, Y.S. and T.X.; supervision, Y.S. and X.Z. (Xuan Zhu); project administration, Y.S., M.W. and X.Z. (Xueqing Zhang); funding acquisition, M.W. and X.Z. (Xueqing Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HMUDRLHeterogeneous Multi-UAV Deep Reinforcement Learning
UHFUltra High Frequency
UAVUnmanned Aerial Vehicle
CHCluster-Head UAVs
CMCluster-Monitoring UAVs
RMSERoot Mean Square Error
RFRadio Frequency
MARLMulti-Agent Reinforcement Learning
AOAAngle of Arrival
RSSReceived Signal Strength
DBSCANDensity-Based Spatial Clustering of Applications with Noise
SNRSignal-to-Noise Ratio
SOMSelf-Organizing Map
PSOParticle Swarm Optimization
HCBGSOHybrid Colliding Bodies Galaxy Swarm Optimization
GAGenetic Algorithms
DEDifferential Evolution
TDOATime Difference of Arrival
HPSOHybrid Particle Swarm Optimization
RLReinforcement Learning
DQNDeep Q-Networks
DDPGDeep Deterministic Policy Gradient
PPOProximal Policy Optimization
TS-DRLToken-Specific Deep Reinforcement Learning
MADDPGMulti-Agent Deep Deterministic Policy Gradient
MASACMulti-Agent Soft Actor-Critic
COMACounterfactual Multi-Agent
MAA2CMulti-Agent Advantage Actor-Critic
PDFProbability Density Function
CRLBCramér–Rao Lower Bound
BERBit Error Rate
DRSSDynamic Range-Sensitive Source Localization
MAPPOMulti-Agent PPO
QMIXQ-Mixing
HMPPOHierarchical Multi-Agent PPO
DDMPPODecentralized Distributed Multi-Agent PPO
GAEGeneralized Advantage Estimation
TDTemporal Difference
IQRInterquartile Range
MLPMultilayer Perceptron
CTDECentralized Training Decentralized Execution
HCHierarchical Control
DCDynamic Clustering
CDFCumulative Distribution Function
MAEMean Absolute Error

References

  1. Kalatzis, D.; Ploussi, A.; Spyratou, E.; Panagiotakopoulos, T.; Efstathopoulos, E.P.; Kiouvrekis, Y. Explainable AI for Spectral Analysis of Electromagnetic Fields. IEEE Access 2025, 13, 113407–113427. [Google Scholar] [CrossRef]
  2. Al Mahmud, K.; Kurum, M. SDR-Based S-Band Radiometer for UAS Platforms with Spectrum Monitoring and Dynamic Allocation. In Proceedings of the 2025 United States National Committee of URSI National Radio Science Meeting (USNC-URSI NRSM), Boulder, CO, USA, 7–10 January 2025; pp. 244–245. [Google Scholar]
  3. Chen, Y.; Zhu, Q.; Wang, J.; Jia, Z.; Wang, X.; Lin, Z.; Huang, Y.; Wu, Q.; Briso-Rodríguez, C. UAV-Aided Efficient Informative Path Planning for Autonomous 3D Spectrum Mapping. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 1664–1677. [Google Scholar] [CrossRef]
  4. Wang, Y.; An, J.; Shao, M.; Wu, J.; Zhou, D.; Yao, X.; Zhang, X.; Cao, W.; Jiang, C.; Zhu, Y. A Comprehensive Review of Proximal Spectral Sensing Devices and Diagnostic Equipment for Field Crop Growth Monitoring. Precis. Agric. 2025, 26, 54. [Google Scholar] [CrossRef]
  5. Testi, E.; Giorgetti, A. Wireless Network Analytics for the New Era of Spectrum Patrolling and Monitoring. IEEE Wirel. Commun. 2024, 31, 230–236. [Google Scholar] [CrossRef]
  6. Fang, Z.; Savkin, A.V. Strategies for Optimized UAV Surveillance in Various Tasks and Scenarios: A Review. Drones 2024, 8, 193. [Google Scholar] [CrossRef]
  7. Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-Art and Future Research Challenges in UAV Swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
  8. Chen, J.; Zhang, Z.; Fan, D.; Hou, C.; Zhang, Y.; Hou, T.; Zou, X.; Zhao, J. Distributed Decision Making for Electromagnetic Radiation Source Localization Using Multi-Agent Deep Reinforcement Learning. Drones 2025, 9, 216. [Google Scholar] [CrossRef]
  9. Gao, R.; Yan, G.; Niu, R.; Chang, W.; Yan, T.; Tang, C. A Novel Spectrum Sensing Method for Multiple Unknown Signal Sources Using Frequency Domain Energy Detection and DBSCAN. IEEE Access 2025, 13, 76811–76837. [Google Scholar] [CrossRef]
  10. Radhi, A.A.; Abdullah, H.N.; Akkar, H.A.R. Denoised Jarque-Bera Features-Based K-Means Algorithm for Intelligent Cooperative Spectrum Sensing. Digit. Signal Process. 2022, 129, 103659. [Google Scholar] [CrossRef]
  11. Fouda, H.S.; Farghaly, S.I.; Dawood, H.S. Weighted Joint LRTs for Cooperative Spectrum Sensing Using K-Means Clustering. Phys. Commun. 2024, 67, 102528. [Google Scholar] [CrossRef]
  12. Tao, B.; Wu, J.; Dou, X.; Wang, J.; Xu, Y. Memorial K-Means Clustering for Cooperative Spectrum Sensing in Cognitive Wireless Sensor Networks at Low SNR Regimes. Sens. Rev. 2025, 45, 443–452. [Google Scholar] [CrossRef]
  13. Konink-Donner, E.; Ruen, A.; Jha, R. Clustering RF Signals with the Growing Self-Organizing Map for Dynamic Spectrum Access. In Proceedings of the NAECON 2023—IEEE National Aerospace and Electronics Conference, Dayton, OH, USA, 28–31 August 2023; pp. 249–253. [Google Scholar]
  14. Zhang, W.; Zhang, W. An Efficient UAV Localization Technique Based on Particle Swarm Optimization. IEEE Trans. Veh. Technol. 2022, 71, 9544–9557. [Google Scholar] [CrossRef]
  15. Wang, K.; Kooistra, L.; Pan, R.; Wang, W.; Valente, J. UAV-based Simultaneous Localization and Mapping in Outdoor Environments: A Systematic Scoping Review. J. Field Robot. 2024, 41, 1617–1642. [Google Scholar] [CrossRef]
  16. Dixit, A.; Devi, M.N.N.; Gazi, F.; Hussain, M.M. OAL-HMT: Optimized AAV Localization Using Hybrid Metaheuristic Techniques. IEEE J. Indoor Seamless Position. Navig. 2025, 3, 142–151. [Google Scholar] [CrossRef]
  17. Bandari, S.; Nirmala Devi, L. A Multi-Objective Approach for Optimal Target Coverage UAV Placement: Hybrid Heuristic Formulation. J. Control Decis. 2025, 12, 551–567. [Google Scholar] [CrossRef]
  18. Chen, F.; Li, H.; Lin, Z.; Zhu, Q.; Zhong, W.; Chen, X.; Zhou, J.; Li, H. Optimized Genetic Algorithm-Based Multi-UAV Cooperative TDOA Localization for Complex Multipath Scenarios. In Proceedings of the 2024 IEEE 24th International Conference on Communication Technology (ICCT), Chengdu, China, 18–20 October 2024; pp. 1283–1287. [Google Scholar]
  19. Arafat, M.Y.; Moh, S. Localization and Clustering Based on Swarm Intelligence in UAV Networks for Emergency Communications. IEEE Internet Things J. 2019, 6, 8958–8976. [Google Scholar] [CrossRef]
  20. Ebrahimi, D.; Sharafeddine, S.; Ho, P.-H.; Assi, C. Autonomous UAV Trajectory for Localizing Ground Objects: A Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2021, 20, 1312–1324. [Google Scholar] [CrossRef]
  21. Shurrab, M.; Mizouni, R.; Singh, S.; Otrok, H. Reinforcement Learning Framework for UAV-Based Target Localization Applications. Internet Things 2023, 23, 100867. [Google Scholar] [CrossRef]
  22. Guan, Q.; Cao, H.; Tan, J.; Jia, L.; Yan, D.; Chen, B. Token-Specific Deep Reinforcement Learning for Energy-Efficient Capacitated Electric Vehicle Routing Problems. Appl. Energy 2025, 396, 126314. [Google Scholar] [CrossRef]
  23. Hou, Y.; Zhao, J.; Zhang, R.; Cheng, X.; Yang, L. UAV Swarm Cooperative Target Search: A Multi-Agent Reinforcement Learning Approach. IEEE Trans. Intell. Veh. 2024, 9, 568–578. [Google Scholar] [CrossRef]
  24. Qin, Y.; Zhang, Z.; Li, X.; Huangfu, W.; Zhang, H. Deep Reinforcement Learning Based Resource Allocation and Trajectory Planning in Integrated Sensing and Communications UAV Network. IEEE Trans. Wirel. Commun. 2023, 22, 8158–8169. [Google Scholar] [CrossRef]
  25. Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. arXiv 2024, arXiv:1705.08926v3. [Google Scholar] [CrossRef]
  26. Wang, Q.; Xu, W.; Chen, H.-H. A Heterogeneous-Agent Deep Reinforcement Learning Approach for Dynamic Spectrum Access in Cognitive Wireless Networks. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 2221–2235. [Google Scholar] [CrossRef]
  27. Ekechi, C.C.; Elfouly, T.; Alouani, A.; Khattab, T. A Survey on UAV Control with Multi-Agent Reinforcement Learning. Drones 2025, 9, 484. [Google Scholar] [CrossRef]
  28. Wang, M.; Chen, P.; Cao, Z.; Chen, Y. Reinforcement Learning-Based UAVs Resource Allocation for Integrated Sensing and Communication (ISAC) System. Electronics 2022, 11, 441. [Google Scholar] [CrossRef]
  29. Jiang, K.; Tian, K.; Feng, H.; Zhao, Y.; Wang, D.; Gao, J.; Cao, S.; Zhang, X.; Li, Y.; Yuan, J.; et al. Distributed UAV Swarm Augmented Wideband Spectrum Sensing Using Nyquist Folding Receiver. IEEE Trans. Wirel. Commun. 2024, 23, 14171–14184. [Google Scholar] [CrossRef]
  30. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  31. Liao, X.; Wang, Y.; Han, Y.; Li, Y.; Lin, C.; Zhu, X. Heterogeneous Multi-Agent Deep Reinforcement Learning for Cluster-Based Spectrum Sharing in UAV Swarms. Drones 2025, 9, 377. [Google Scholar] [CrossRef]
  32. Vo, V.N.; Nguyen, L.-M.-D.; Tran, H.; Dang, V.-H.; Niyato, D.; Cuong, D.N.; Luong, N.C.; So-In, C. Outage Probability Minimization in Secure NOMA Cognitive Radio Systems With UAV Relay: A Machine Learning Approach. IEEE Trans. Cogn. Commun. Netw. 2023, 9, 435–451. [Google Scholar] [CrossRef]
  33. Mustafa Abro, G.E.; Abdallah, A.M. Graph Attention Networks For Anomalous Drone Detection: RSSI-Based Approach with Real-World Validation. Expert Syst. Appl. 2025, 273, 126913. [Google Scholar] [CrossRef]
  34. Chen, L.; You, C.; Wang, Y.; Li, X. Variable-Speed UAV Path Optimization Based on the CRLB Criterion for Passive Target Localization. Sensors 2025, 25, 5297. [Google Scholar] [CrossRef]
  35. Vinh Hien, D.; Le Hoang Anh, N. Nonlinear Scalarizations in Set Optimization with Variable Ordering Structures and Applications. J. Appl. Math. Comput. 2025, 71, 1609–1630. [Google Scholar] [CrossRef]
  36. Liang, J.; Ban, X.; Yu, K.; Qu, B.; Qiao, K.; Yue, C.; Chen, K.; Tan, K.C. A Survey on Evolutionary Constrained Multiobjective Optimization. IEEE Trans. Evol. Comput. 2023, 27, 201–221. [Google Scholar] [CrossRef]
  37. Yu, K.; Yang, Z.; Liang, J.; Qiao, K.; Qu, B.; Suganthan, P.N. An Individual Adaptive Evolution and Regional Collaboration Based Evolutionary Algorithm for Large-Scale Constrained Multiobjective Optimization Problems. Swarm Evol. Comput. 2025, 95, 101925. [Google Scholar] [CrossRef]
  38. Li, K.; Lai, G.; Yao, X. Interactive Evolutionary Multiobjective Optimization via Learning to Rank. IEEE Trans. Evol. Comput. 2023, 27, 749–763. [Google Scholar] [CrossRef]
  39. Apaza, R.D.; Han, R.; Li, H.; Knoblock, E.J. Intelligent Spectrum and Airspace Resource Management for Urban Air Mobility Using Deep Reinforcement Learning. IEEE Access 2024, 12, 164750–164766. [Google Scholar] [CrossRef]
  40. Xing, N.; Zong, Q.; Dou, L.; Tian, B.; Wang, Q. A Game Theoretic Approach for Mobility Prediction Clustering in Unmanned Aerial Vehicle Networks. IEEE Trans. Veh. Technol. 2019, 68, 9963–9973. [Google Scholar] [CrossRef]
Figure 1. Heterogeneous UAV swarm for monitoring UHF radiation source.
Figure 1. Heterogeneous UAV swarm for monitoring UHF radiation source.
Drones 10 00054 g001
Figure 2. Spatial relationship model for link establishment between CH and CMs.
Figure 2. Spatial relationship model for link establishment between CH and CMs.
Drones 10 00054 g002
Figure 3. Architecture of the proposed HMUDRL framework for UHF radiation source localization. The system consists of two types of UAV agents: CH and CM, each governed by a dedicated policy network. CM agents observe local signal features, including RSS, RSS variation, AOA, and distance to their associated CH, and execute movement actions using the DDMPPO algorithm to optimize sensing geometry. CH agents employ the HMPPO algorithm, receive aggregated AOA and RSS reports from their cluster members, and adjust their positions to balance three objectives: angular diversity, CM coverage quality, and spatial dispersion among clusters, thereby enhancing both the coverage and accuracy of radiation source localization.
Figure 3. Architecture of the proposed HMUDRL framework for UHF radiation source localization. The system consists of two types of UAV agents: CH and CM, each governed by a dedicated policy network. CM agents observe local signal features, including RSS, RSS variation, AOA, and distance to their associated CH, and execute movement actions using the DDMPPO algorithm to optimize sensing geometry. CH agents employ the HMPPO algorithm, receive aggregated AOA and RSS reports from their cluster members, and adjust their positions to balance three objectives: angular diversity, CM coverage quality, and spatial dispersion among clusters, thereby enhancing both the coverage and accuracy of radiation source localization.
Drones 10 00054 g003
Figure 4. Trajectories of CH and CM during the localization process: (a) Localization trajectories under standard RSS noise of 2 dB and AOA noise of 5°; (b) Localization trajectories with CH and CM randomly deployed and fixed in position; (c) Localization trajectories when clustering optimization of the CH is disabled, with the CH fixed at a predefined location and only the CM moving using the DDMPPO algorithm; (d) Localization trajectories under high-noise conditions with RSS noise of 6 dB and AOA noise of 20°.
Figure 4. Trajectories of CH and CM during the localization process: (a) Localization trajectories under standard RSS noise of 2 dB and AOA noise of 5°; (b) Localization trajectories with CH and CM randomly deployed and fixed in position; (c) Localization trajectories when clustering optimization of the CH is disabled, with the CH fixed at a predefined location and only the CM moving using the DDMPPO algorithm; (d) Localization trajectories under high-noise conditions with RSS noise of 6 dB and AOA noise of 20°.
Drones 10 00054 g004
Figure 5. Comparison of average reward curves between the proposed algorithm and four baseline methods.
Figure 5. Comparison of average reward curves between the proposed algorithm and four baseline methods.
Drones 10 00054 g005
Figure 6. Boxplot comparison of the proposed algorithm with four baseline algorithms.
Figure 6. Boxplot comparison of the proposed algorithm with four baseline algorithms.
Drones 10 00054 g006
Figure 7. Comparison of computational efficiency and latency between the proposed algorithm and four baseline algorithms.
Figure 7. Comparison of computational efficiency and latency between the proposed algorithm and four baseline algorithms.
Drones 10 00054 g007
Figure 8. Comparison of average internal information exchange frequency among UAV swarms under different network structures.
Figure 8. Comparison of average internal information exchange frequency among UAV swarms under different network structures.
Drones 10 00054 g008
Figure 9. CDF of localization error for HMUDRL and its ablated versions under UHF radiation source localization scenarios.
Figure 9. CDF of localization error for HMUDRL and its ablated versions under UHF radiation source localization scenarios.
Drones 10 00054 g009
Table 1. Comparison of MARL Approaches for Heterogeneous, Scalable UAV Coordination.
Table 1. Comparison of MARL Approaches for Heterogeneous, Scalable UAV Coordination.
AlgorithmSupports
Heterogeneous Agents
Supports Hierarchical
Architecture
Communication
Mechanism
Scalability
MADDPG [23]Yes
Limited support
Yes
with modifications
Centralized critic; all-to-all message passing during trainingPoor
Critic input scales quadratically
MASAC [24]Yes
Limited support
Theoretically feasible
but inefficient
Centralized training with shared critics; dense inter-agent messagingModerate
High parameter overhead
COMA [25]Yes
Limited support
Yes
with modifications
Centralized critic with counterfactual baseline; full joint-action enumerationPoor
Exponential in agents
MAA2C [26]Yes
Limited support
Yes
with modifications
Fully decentralized; no explicit coordinationModerate
Weak collaboration
MAPPO [8]Yes
Highly flexible
Yes
with modifications
Global state for centralized criticModerate
Global critic causes memory
bottleneck
HMUDRL (Ours)YesYesCluster-based aggregation; CH broadcasts fused infoGood
Complexity decoupled
from swarm size
Table 2. Hyperparameter setting.
Table 2. Hyperparameter setting.
System ParametersNumerical SettingsDescription
m 3number of CH
n 14number of CM
episodes1000number of training episodes
T 100time steps per episode
α a 0.0001actor learning rate
α c 0.001critic learning rate
γ 0.99discount factor
λ 0.95GAE lambda
ϵ 0.1policy update clipping parameter
α h 0.6weight of CH reward in joint reward
α m 0.4weight of CM reward in joint reward
ω l o c h 0.6weight of localization confidence reward
ω c o v h 0.2weight of coverage quality reward
ω d i r h 0.2weight of exploration diversity reward
ω Δ P r m 0.5weight of RSS gain reward
ω d m 0.2weight of distance-based reward to CH
ω ϕ m 0.3weight of valid AOA reward
P 0 -20reference RSS value at 1 m
n 2.0path loss exponent
σ R S S 2.0RSS measurement standard deviation (dB)
σ A O A 5.0AOA measurement noise standard deviation (°)
γ A O A −70.0minimum RSS threshold for AOA estimation (dBm)
D S S ' 20localization success threshold (m)
D s a f e 1safe distance (m)
Table 3. Comparison of RMSE, localization success rate, and number of steps to first detection among the proposed algorithm and four baseline methods over 1000 episodes.
Table 3. Comparison of RMSE, localization success rate, and number of steps to first detection among the proposed algorithm and four baseline methods over 1000 episodes.
Algorithm NameRMSE Mean (m)RMSE
Median (m)
RMSE
Max (m)
RMSE
Min (m)
Localization Success Rate (%)Steps to First
Detection
HMUDRL39.3422.83678.110.8996.1%6
MADDPG [23]620.43666.151125.3012.0693.7%5
MASAC [24]395.88297.761774.5813.1594.6%8
COMA [25]135.0230.07890.990.3696.3%6
MAA2C [26]82.7028.89992.920.7694.6%8
Table 4. Average number of internal information exchanges per episode and corresponding communication savings ratio under different network structures.
Table 4. Average number of internal information exchanges per episode and corresponding communication savings ratio under different network structures.
Different Numbers
of Clusters
Exchange Frequency (Times/Episode)Communication
Savings Ratio (%)
Heterogeneous (CH + CM)Homogeneous (k-NN, k = 5)Homogeneous
(Full Connected)
vs. k-NN, k = 5vs. Full
Connected
1CH + 8CM9004500720080.0%87.5%
3CH + 14CM1500850027,20082.4%94.5%
5CH + 20CM210012,50060,00083.2%96.5%
7CH + 30CM310018,500133,20083.2%97.7%
Table 5. Performance evaluation of HMUDRL and ablated variants in terms of median error, 90% error, and mean absolute error (MAE).
Table 5. Performance evaluation of HMUDRL and ablated variants in terms of median error, 90% error, and mean absolute error (MAE).
MethodMedian Error (m)90% Error (m)MAE (m)
HMUDRL22.8366.8643.54
HMUDRL w/o HC32.1197.5467.95
HMUDRL w/o DC22.7566.9650.12
HMUDRL w/o AOA574.91833.92559.53
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Zhang, X.; Wang, M.; Yang, Y.; Xia, T.; Zhu, X.; Cui, T. Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms. Drones 2026, 10, 54. https://doi.org/10.3390/drones10010054

AMA Style

Sun Y, Zhang X, Wang M, Yang Y, Xia T, Zhu X, Cui T. Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms. Drones. 2026; 10(1):54. https://doi.org/10.3390/drones10010054

Chicago/Turabian Style

Sun, Yuanqiang, Xueqing Zhang, Menglin Wang, Yangqiang Yang, Tao Xia, Xuan Zhu, and Tonghe Cui. 2026. "Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms" Drones 10, no. 1: 54. https://doi.org/10.3390/drones10010054

APA Style

Sun, Y., Zhang, X., Wang, M., Yang, Y., Xia, T., Zhu, X., & Cui, T. (2026). Hierarchical Role-Based Multi-Agent Reinforcement Learning for UHF Radiation Source Localization with Heterogeneous UAV Swarms. Drones, 10(1), 54. https://doi.org/10.3390/drones10010054

Article Metrics

Back to TopTop