Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms

Liu, Zhaoyu; Cheng, Wenchu; Zeng, Liang; He, Xinxin

doi:10.3390/drones9110788

Open AccessArticle

Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms

¹

School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

²

School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing 100081, China

³

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(11), 788; https://doi.org/10.3390/drones9110788

Submission received: 11 October 2025 / Revised: 8 November 2025 / Accepted: 10 November 2025 / Published: 12 November 2025

(This article belongs to the Special Issue Advances in AI Large Models for Unmanned Aerial Vehicles)

Download

Browse Figures

Versions Notes

Highlights

This section summarizes the essential contributions and significance of this study, improving discoverability and readability for UAV intelligence researchers. Highlights provide concise insights into the technical advances and broader implications of this work.

What are the main findings?

A low-complexity multi-agent soft actor–critic (MASAC) framework is developed for heterogeneous UAV swarms under the centralized training–decentralized execution (CTDE) paradigm.
The proposed method integrates parameter sharing, device identity embeddings, and a shared-backbone twin critic to eliminate linear parameter growth while maintaining policy diversity and convergence stability.

What are the implications of the main findings?

The optimized framework achieves over 14× parameter compression and an approximately 93% reduction in training time, without degrading optimization performance in large-scale UAV clusters.
This work enables scalable, real-time deployment of multi-agent reinforcement learning in large-model-driven UAV systems for communication, sensing, and cooperative resource scheduling.

Abstract

Heterogeneous unmanned aerial vehicle (UAV) swarms are becoming critical components of next-generation non-terrestrial networks, enabling tasks such as communication relay, spectrum monitoring, cooperative sensing, and navigation. Yet, their heterogeneity and multifunctionality bring severe challenges in task allocation and resource scheduling, where traditional multi-agent reinforcement learning methods often suffer from high algorithmic complexity, lengthy training times, and deployment difficulties on resource-constrained nodes. To address these issues, this paper proposes a low-complexity multi-agent soft actor–critic (MASAC) framework that combines parameter sharing (shared actor with device embeddings and shared-backbone twin critics), lightweight network design (fixed-width residual MLP with normalization), and robust training mechanisms (minimum-bias twin-critic updates and entropy scheduling) within the CTDE paradigm. Simulation results show that the proposed framework achieves more than 14-fold parameter compression and over a 93% reduction in training time, while maintaining or improving performance in terms of the delay–energy utility function. These advances substantially reduce computational overhead and accelerate convergence, providing a practical pathway for deploying multi-agent reinforcement learning in large-scale heterogeneous UAV clusters and supporting diverse mission scenarios under stringent resource and latency constraints.

Keywords:

multi-agent soft actor–critic (MASAC); low complexity; unmanned aerial vehicle (UAV) cluster; resource allocation; centralized training–decentralized execution (CTDE)

1. Introduction

With the rapid advancement of unmanned aerial vehicle (UAV) technology and artificial intelligence, UAV swarms are evolving from simple formations characterized by “homogeneity and single-task capabilities” to intelligent systems featuring “heterogeneity and multifunctionality.” Within next-generation non-terrestrial networks (NTNs) and 6G application scenarios, diverse UAVs—including communication relays, infrared monitors, navigation units, spectrum scouts, and radar detectors—are organically integrated into a cluster system characterized by unified “communication, sensing, remote control, and guidance” capabilities. Such systems not only execute dedicated private missions (e.g., infrared UAVs performing target identification and navigation nodes conducting positioning calibration) but also collectively undertake shared system tasks (e.g., large-scale situational awareness, joint reasoning, and emergency communication support) through information sharing and resource coordination, significantly enhancing overall effectiveness [1,2,3,4]. A growing body of research and applications indicates that heterogeneous UAV swarms are emerging as a critical trend in unmanned system development and a vital component of future integrated air–ground–space networks [5,6].

Notably, with the emergence of large-scale artificial intelligence models, UAV clusters are gaining unprecedented capabilities. These models enable heterogeneous UAV nodes to perform more complex collaborative tasks—such as cross-modal target recognition, intelligent navigation, emergency communications, and scenario simulation—driven by multi-source sensor data. This is achieved through unified semantic representation, cross-modal information fusion, and powerful reasoning capabilities. However, despite these advances, the introduction of foundation models also exposes a significant algorithmic gap between the computational demands of large-model inference and the limited onboard resources available in UAV clusters. This trend is propelling UAV swarms toward “large-modelization”: some nodes focus on private tasks, while others collectively support public tasks through knowledge sharing and unified reasoning powered by large models. Bridging this gap requires a new generation of cooperative optimization algorithms that can maintain decision quality while drastically reducing computational and communication costs.

Large-model inference typically demands substantial computational power and high-throughput communication bandwidth. In practice, executing a single inference of a foundation model often requires billions of floating-point operations and extensive memory access, which already pose challenges even for edge servers [7]. In UAV swarm systems, however, individual aerial nodes are generally equipped with lightweight processors and limited battery capacity, making it difficult to sustain the intensive workload of on-board inference. At the same time, wireless links among UAVs usually offer constrained spectrum resources and are vulnerable to channel fading and interference, resulting in insufficient throughput for the massive parameter exchanges required by large-model inference. These constraints not only hinder the direct deployment of large AI models on UAV nodes, but also amplify the energy–latency trade-off: increasing computational frequency accelerates energy depletion, while expanding transmission bandwidth intensifies spectrum contention and inter-node interference. Overall, reconciling the inherent resource scarcity of UAV swarms with the heavy demands of large-model inference is one of the most critical challenges in practical UAV intelligence. Therefore, addressing the scalability bottleneck between large-model reasoning and swarm-level coordination has become an open research problem that current UAV optimization frameworks cannot adequately solve.

In recent years, the application of multi-agent deep reinforcement learning (MADRL) in drone swarms has gained momentum, offering novel solutions to the aforementioned challenges. Existing studies demonstrate that MADRL outperforms traditional optimization methods in typical scenarios such as integrated air–ground networks, vehicle-to-everything edge computing, drone-assisted internet of things (IoT) systems, and drone edge resource management [8,9,10]. Reinforcement learning frameworks enable the concurrent enhancement of task offloading, resource scheduling, and energy efficiency optimization in complex dynamic environments, demonstrating superior adaptability and scalability compared to traditional approaches [11,12,13]. Concurrently, advancements in core algorithms such as centralized training–distributed execution (CTDE), counterfactual policy gradient, and soft actor–critic have significantly improved training stability and decision accuracy [14,15,16]. Nevertheless, most existing MADRL frameworks are designed under full-model assumptions and overlook the constraints of computation, bandwidth, and latency that dominate large-model-driven UAV clusters.

More recent research has further integrated emerging technologies such as reconfigurable intelligent surfaces (RISs), 3D trajectory planning, and energy-efficient scheduling, continuously advancing the evolution of UAV cooperative optimization [17,18,19]. Against this backdrop, MADRL and large AI models form a natural complementarity: the former provides dynamic scheduling and online optimization capabilities for UAV swarms, while the latter plays a central role in semantic understanding and cross-modal reasoning. Their integration holds promise as a key direction for the intelligent evolution of heterogeneous UAV systems.

Nevertheless, existing approaches such as weighted multi-agent deep deterministic policy gradient (WMADDPG) and multi-agent soft actor–critic (MASAC) still suffer from high computational complexity, slow convergence, and significant deployment overhead, particularly when the number of devices increases or node heterogeneity intensifies [20]. These limitations constrain their application in resource-constrained UAV clusters. This motivates the present work, which aims to close this gap by proposing a low-complexity multi-agent optimization framework that preserves MASAC’s stability while achieving near-constant parameter scalability. To address this issue, this paper proposes a low-complexity multi-agent optimization method based on the MASAC framework. Through parameter sharing, lightweight network architecture design, and a resource proportional normalization mechanism, it significantly reduces the computational burden of training and deployment while maintaining performance. In summary, the contributions of this work are threefold: (1) a scalable MASAC architecture integrating parameter sharing and twin-critic regularization; (2) a lightweight design that achieves over 14× parameter compression; and (3) a set of adaptive training mechanisms ensuring stability and rapid convergence in heterogeneous UAV swarms. The method also naturally accommodates a dual-layer structure of “private tasks–public tasks,” providing scalable algorithmic support for the future evolution of heterogeneous UAV swarms toward “large-modelization.”

As summarized in Table 1, prior MADRL-based approaches—such as MADDPG, WMADDPG, and MASAC—have progressively improved UAV coordination through centralized training–decentralized execution (CTDE) mechanisms. However, they still face scalability and efficiency challenges when extended to large-scale heterogeneous UAV clusters. In contrast, our proposed low-complexity MASAC integrates parameter sharing, twin-critic regularization, and adaptive entropy scheduling to achieve near-constant parameter scalability while maintaining training stability. This design enables efficient deployment in large-model-driven UAV systems, bridging the gap between algorithmic feasibility and real-world applicability.

2. System Modeling

2.1. Network Model

The system comprises one central UAV (command aircraft) and multiple UAV agents with distinct functionalities. The central UAV communicates with all drones via wireless links, employing FDMA for both uplink and downlink transmission. Tasks consist of both common and specific components, where the common part can be partitioned and assigned to multiple drones for execution. The proportion of private tasks

v_{i}

is determined by each UAV’s specific function (e.g., an infrared UAV primarily handles monitoring tasks). Each drone must complete its assigned tasks within computational and energy constraints while collaboratively participating in the overall system optimization.

To provide a more intuitive illustration of the scenarios and optimization targets, the system model of a heterogeneous UAV swarm is presented in Figure 1. This model highlights the role of the central UAV as the scheduler and coordinator, while heterogeneous UAV nodes (e.g., communication relays, infrared monitors, radar detectors, navigation units, and spectrum scouts) execute private tasks and share public responsibilities. Cross-modal information fusion and semantic sharing are achieved through large-model interfaces, and task offloading as well as energy efficiency management are realized via uplink/downlink communication. Overall performance is measured by a utility function combining delay and energy consumption, laying the foundation for the optimization problem design in later sections.

The main parameters used in this system are summarized in Table 2.

2.2. Communication Model

2.2.1. Downlink Transmission

The central UAV assigns tasks to each UAV according to the allocated ratio

w_{i}

. Each UAV must also process its own private task ratio

v_{i}

.

The channel gain for each link follows the path loss model:

{PL}_{i} (d_{i}) [dB] = 148 + 40 {log}_{10} (\frac{d_{i}}{1 km})

.

Path loss values (dB) are converted into linearly proportional channel gain/attenuation coefficients:

h_{i} = 10^{- {PL}_{i} (d_{i}) / 10} = 10^{- \frac{148 + 40 {log}_{10} (d_{i})}{10}}

.

The downlink SNR and transmission rate are derived:

{SNR}_{i} = \frac{p_{i} h_{i}}{N_{0} b_{i}} = \frac{γ_{i}}{b_{i}}, γ_{i} ≜ \frac{p_{i} h_{i}}{N_{0}}, r_{i}^{down} = b_{i} {log}_{2} (1 + \frac{γ_{i}}{b_{i}}) ≜ b_{i} C (\frac{γ_{i}}{b_{i}})

.

The downlink transmission delay and energy consumption can be derived:

T_{i}^{down} = \frac{α_{i} D}{r_{i}^{down}}

,

E_{i}^{down} = p_{i} T_{i}^{down}

.

2.2.2. Uplink Transmission

After completing task computations, each UAV feeds back results to the central UAV via the uplink. The central UAV aggregates these to generate global results. During uplink transmission, each device transmits at full power

P_{i}^{up}

without further power allocation.

Thus, the uplink SNR and transmission rate are obtained:

{SNR}_{i}^{up} = \frac{P_{i}^{up} h_{i}}{N_{0} b_{i}}

,

r_{i}^{up} = b_{i} {log}_{2} (1 + \frac{P_{i}^{up} h_{i}}{N_{0} b_{i}})

.

Uplink transmission delay and energy consumption:

T_{i}^{up} = \frac{α_{i} O}{r_{i}^{up}}

,

E_{i}^{up} = P_{i}^{up} T_{i}^{up}

.

2.3. Computational Model

The local task computation delay and energy consumption for each UAV are

T_{i}^{c} = \frac{α_{i} C}{f_{i}}

and

E_{i}^{c} = κ_{i} (α_{i} C) f_{i}^{2}

.

We synthesize the total delay and total energy consumption of the heterogeneous UAV cluster into a utility function U, optimizing to maximize the utility function U, where

U = λ_{T} T + λ_{E} E

.

Total latency is the sum of the latencies of the slowest devices in each stage:

T = max_{i \in I} T_{i}^{down} + max_{i \in I} T_{i}^{c} + max_{i \in I} T_{i}^{up}, I = {1, \dots, I}

(1)

Total energy consumption is the sum of the energy consumption of each device in each phase:

E = \sum_{i = 1}^{I} (E_{i}^{down} + E_{i}^{c} + E_{i}^{up})

(2)

3. Problem Formulation

This section constructs the optimization problem based on system modeling, defining the task addressed in this research: given drone parameters and the constraints of the communication computation model, jointly optimize task allocation and resource scheduling strategies to minimize the system utility function U, achieving an optimal trade-off between delay and energy consumption.

3.1. Problem Background and Parameter Settings

Consider a swarm system comprising I = 30 UAVs, where each UAV can serve as a distinct functional node (e.g., communication, monitoring, and navigation). The proportion of private tasks

v_{i}

for each UAV is known, while they also collaboratively undertake public tasks. The physical spatial position, computational capability, and task proportion of each UAV are known, as detailed in Table 3.

3.2. Objective Function and Constraints

In the optimization problem, the decision variables include the public task allocation ratios

w_{i}

, the downlink bandwidth allocation ratios

b_{i}

, and the downlink power allocation ratios

p_{i}

. These variables jointly represent the coordination of tasks, the spectrum, and energy and are therefore the key factors to be optimized for efficient UAV cluster scheduling.

Based on the system modeling section, the system utility function U is defined as a weighted combination of total delay T and total energy consumption E:

\underset{w, b, p}{minimize} U (w, b, p) = λ_{T} T (w, b, p) + λ_{E} E (w, b, p)

(3)

s.t.

\begin{matrix} 1^{T} w = w, 0 ⪯ w ⪯ 1, \end{matrix}

(4)

\begin{matrix} 1^{T} b \leq B, b ⪰ 0, \end{matrix}

(5)

\begin{matrix} 1^{T} p \leq P_{down}, p ⪰ 0, \end{matrix}

(6)

\begin{matrix} 1^{T} v = v, \end{matrix}

(7)

\begin{matrix} w + v = 1, \frac{w}{v} = 2, \end{matrix}

(8)

\begin{matrix} λ_{T} + λ_{E} = 1, λ_{T} = 0.05, λ_{E} = 0.95 . \end{matrix}

(9)

This formulation integrates the definitions of optimization variables with the overall objective, explicitly capturing the trade-off between system latency and energy efficiency. It also establishes the mathematical foundation for the multi-agent reinforcement learning framework described in Section 4.

4. Artificial Intelligence Model Development

To provide an intuitive understanding of the proposed framework, the overall algorithmic network structure is illustrated in Figure 2. Figure 2a depicts the execution phase, where heterogeneous UAVs interact with the environment through shared observations, device embeddings, the shared actor, and action projection. Figure 2b shows the centralized training phase, including the replay buffer, twin-critic structure, target networks, and entropy regularization, which jointly enhance stability and efficiency. This schematic highlights the CTDE (centralized training and decentralized execution) paradigm adopted in this work and serves as a roadmap for the detailed descriptions in the following subsections.

4.1. Overall Architecture of the Model

The proposed framework is based on the multi-agent MASAC algorithm, integrating a centralized critic network with a shared actor network structure. It is specifically designed for task allocation and resource scheduling in UAV swarms, where each UAV observes local states and outputs feasible actions under system constraints.

Within the environment interaction module, the system generates observation vectors for each heterogeneous UAV node based on current task load, channel conditions, and local computational capacity:

o_{i} = [\begin{matrix} h_{i} \\ f_{i} \end{matrix}] \in R^{d_{o}}, where h_{i}, f_{i} \in R .

(10)

Specifically,

h_{i}

and

f_{i}

correspond to the instantaneous channel gain and normalized computational frequency, respectively, and their joint representation forms the feature vector

o_{i}

. The global state is then expressed as the concatenation of all local observations,

s = o_{1}, o_{2}, \dots, o_{I}

, representing the overall environment state for centralized training.

In the decision module, all UAVs share a parameterized actor network

π_{θ} (\cdot)

that incorporates device embeddings for differentiation under a unified policy. Each UAV i constructs its input vector

x_{i} = o_{i} \oplus e_{i}

, where ⊕ denotes vector concatenation, combining the observation features and its unique embedding representation. The shared actor maps this composite input through a nonlinear feature extractor

ϕ (\cdot)

to produce the continuous action

a_{i} = {[w_{i}, b_{i}]}^{⊤}

, representing task and bandwidth allocation decisions.

a_{i} = π_{θ} (o_{i} \oplus e_{i}) = σ (W_{θ} ϕ (o_{i}, e_{i}) + b_{θ}) .

(11)

The network output is first normalized through a Sigmoid activation to restrict each dimension of

a i

within

[0, 1]

. Subsequently, a Euclidean projection operator

Π A (\cdot)

maps the raw output onto the feasible action domain

A

by minimizing the squared distance to the admissible set. This operation guarantees that the resulting action

a_{i}^{'}

satisfies all physical and operational constraints imposed by the UAV system.

a_{i}^{'} = Π_{A} (a_{i}) = arg min_{a \in A} {∥ a - a_{i} ∥}_{2},

(12)

The centralized critic network evaluates the joint policy by estimating the expected soft Q-value

Q_{ϕ} (s, a)

over the state–action space, incorporating the stochastic policy distribution. It is trained to approximate the recursive Bellman expectation with a discount factor

γ

, leveraging target networks for stability.

Q_{ϕ} (s, a) ≜ E_{(s^{'}, r) \sim P} [r + γ Q_{ϕ^{'}} (s^{'}, π_{θ^{'}} (s^{'}))] .

(13)

where r represents the reward signal and

γ

denotes the discount factor.

To suppress overestimation bias, a twin-critic mechanism is adopted, where two Q-function estimators

Q ϕ_{1}

and

Q ϕ_{2}

share a common backbone but maintain distinct output heads. The final Q-value used for policy updates is obtained by a soft minimum or expectation-difference aggregation, thereby achieving a balance between estimation bias and smoothness in optimization.

Q_{twin} (s, a) = min_{k \in {1, 2}} \{Q_{ϕ_{k}} (s, a)\} = min_{k \in {1, 2}} [g_{k} \circ f_{ϕ} (s, a)] .

(14)

4.2. Training Module Design

The training module integrates a series of stabilization mechanisms designed to improve convergence reliability and sample efficiency under the CTDE paradigm. These mechanisms—including replay-based experience reuse, target network decoupling, entropy-regularized exploration, and smoothed temporal-difference updates—collectively ensure that policy learning remains numerically stable and robust throughout training, as illustrated in Figure 2.

4.2.1. Replay Buffer

The replay buffer improves sample utilization. Interaction trajectories

τ_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1})

are stored and randomly sampled:

D = {\{τ_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1})\}}_{t = 1}^{| D |} .

(15)

This breaks the correlation between samples, stabilizing training.

4.2.2. Target Network

A target network ensures stability in temporal-difference updates. When updating the critic, a delayed copy of the parameters is used:

y_{t} = r_{t} + γ E_{a_{t + 1} \sim π_{θ^{'}}} [Q_{ϕ^{'}} (s_{t + 1}, a_{t + 1})] .

(16)

4.2.3. Entropy Coefficient Scheduling

Segmented entropy coefficient scheduling balances exploration and exploitation. The policy objective function is

J_{π} (θ) = E_{s \sim D, a \sim π_{θ}} [Q_{ϕ} (s, a) - α KL (π_{θ} (\cdot | s) ∥ U (A))] .

(17)

where

α

is the entropy coefficient. In training,

α

is scheduled to decrease during early stages (promoting exploitation) and increase moderately in later stages (encouraging continued exploration).

4.2.4. Policy Smoothing and Hybrid TD

Target policy smoothing and hybrid temporal-difference objectives further reduce variance and enhance robustness.

Through this design, the proposed MASAC framework effectively optimizes the utility function in large-scale heterogeneous UAV scenarios, significantly reducing training complexity and inference overhead while maintaining performance.

5. Complexity Reduction Methods

To intuitively present the structural adjustments made in this chapter, the overall algorithm framework with complexity reduction mechanisms is illustrated in Figure 3. The figure highlights how parameter sharing, structural lightweighting, and training stabilization are integrated into the MASAC framework. Specifically, the left-hand side depicts the training loop with entropy scheduling, the replay buffer, and optimization; the middle block illustrates the shared actor enhanced by device embeddings and projection; and the right-hand side shows the centralized critic with twin Q-heads and the min aggregator. This visual overview provides context for the following sections, which detail each component and its role in reducing computational and memory overhead while maintaining training robustness.

This chapter systematically outlines three approaches—parameter sharing, structural lightweighting, and training robustness—to reduce computational complexity without altering problem assumptions or interfaces. These adaptations address the real-time and resource-constrained demands of UAV swarms. The core idea is to achieve maximum parameter and computational savings with minimal structural modifications within the CTDE framework. A set of training-side mechanisms aligned with optimization objectives ensures that policy performance and convergence stability are maintained or even enhanced under low-cost conditions.

5.1. Parameter Sharing at the Model Level

5.1.1. Device-Conditioned Shared Actor (DCSA)

In traditional multi-agent reinforcement learning, each agent typically maintains an independent policy network. This approach leads to linear growth in parameter size and computational overhead in large-scale UAV scenarios. To address this, we employ parameter sharing by retaining only one global actor network. By introducing device identity embedding vectors, different UAVs can generate differentiated actions within the same network. Specifically, for the ith UAV, its input is concatenated from the local observation

o_{i}

and the device embedding

e_{i}

:

x_{i} = o_{i} \oplus e_{i}, x_{i} \in R^{d_{o} + d_{e}},

(18)

Here,

o_{i}

represents local features such as channel gain and computational power, while

e_{i}

denotes a low-dimensional, learnable representation of the UAV node identity. The shared actor network

π_{θ} (\cdot)

processes the input

x_{i}

to output continuous actions:

a_{i} = π_{θ} (x_{i}) = σ (W_{θ} ϕ (x_{i}) + b_{θ}), a_{i} = [\begin{matrix} w_{i} \\ b_{i} \end{matrix}] .

(19)

where

w_{i}

denotes the public task allocation ratio and

b_{i}

represents the bandwidth allocation ratio.

Through this design, the parameter scale of the actor network no longer increases linearly with the number of heterogeneous UAVs I, but remains at a constant level. Simultaneously, the device embedding vector ensures that different drones can still learn differentiated strategies under the shared architecture, thereby balancing model lightweighting and strategy personalization [21].

5.1.2. Centralized Twin Critic (Shared Backbone + Dual Head)

In multi-agent reinforcement learning, value function estimation often suffers from overestimation bias, leading to unstable policy updates. To address this issue, this paper introduces a twin-critic architecture based on centralized critics. This approach employs a shared backbone network for global feature extraction, which then branches into two independent Q-value heads.

Specifically, the centralized critic receives the global state

s = {o_{1}, \dots, o_{I}}

and joint action

a = {a_{1}, \dots, a_{I}}

. It first obtains representations through the shared feature extractor

f_{ϕ} (\cdot)

:

z = f_{ϕ} (s, a) = ψ_{ϕ}^{⊤} φ (s, a),

(20)

Subsequently, two independent Q-heads provide estimates:

Q_{ϕ_{k}} (s, a) = g_{k} \circ f_{ϕ} (s, a) = w_{ϕ_{k}}^{⊤} z + b_{ϕ_{k}}, k \in {1, 2},

(21)

During policy updates, a minimum value decision mechanism is employed:

Q_{twin} (s, a) = min_{k \in {1, 2}} Q_{ϕ_{k}} (s, a) = min_{k \in {1, 2}} (w_{ϕ_{k}}^{⊤} ψ_{ϕ}^{⊤} φ (s, a) + b_{ϕ_{k}}) .

(22)

effectively mitigating value function overestimation.

This “shared-backbone + dual-head” design avoids the redundant overhead of multiple independent critics while maintaining estimation robustness. This approach enhances training stability while ensuring model compactness.

5.1.3. Qualitative Comparison of Parameters and Computational Complexity

The parameter scale of this paper converges from O(I) networks to “a single shared actor + a single shared-backbone critic + two linear Q-heads.” The drone identity embedding introduces only a minimal increment proportional to I × embedding dimension. Forward computation similarly converges from multiple parallel branches to a single reused backbone. The primary increase with drone count stems from input concatenation and lightweight loops required to generate actions for each drone, with an overall growth rate significantly lower than that of independent network schemes. This organization proves particularly effective in scenarios with I = 30 devices, keeping both inference latency and memory overhead within manageable ranges.

5.2. Structural Lightweighting

5.2.1. Fixed-Width MLP Backbone

For the feature extractors in both actor and critic backbone networks, this paper employs fixed-width Multi-Layer Perceptrons (MLPs). Unlike variable-width or low-rank approximation designs, this approach maintains structural simplicity and avoids additional implementation complexity. Between the input and output layers, the network width is fixed at the preset value

d_{h}

, with each layer structured as

h^{(l + 1)} = σ_{ReLU} (W^{(l)} h^{(l)} + b^{(l)} + ϵ^{(l)}), ϵ^{(l)} \sim N (0, τ^{2} I), W^{(l)} \in R^{d_{h} \times d_{h}} .

(23)

where

h^{(l)}

denotes the input to layer l, and

W^{(l)}

and

b^{(l)}

represent the weight matrix and bias of the linear layer, respectively.

This fixed-width MLP backbone maintains sufficient expressive power while avoiding network size inflation as the number of drones increases. Combined with parameter sharing strategies, its overall complexity remains within acceptable limits, providing a stable foundation for subsequent normalization and residual mechanisms.

5.2.2. Normalization and Residual Mechanisms

To further enhance training stability, this paper introduces normalization and residual structures in key layers of the MLP backbone. First, Layer Normalization standardizes the output of linear layers:

{\hat{h}}^{(l)} = \frac{W^{(l)} h^{(l)} + b^{(l)} - μ^{(l)} 1}{\sqrt{{(σ^{(l)})}^{2} + ε}}, μ^{(l)} = \frac{1}{d_{h}} 1^{⊤} (W^{(l)} h^{(l)} + b^{(l)}),

(24)

This is then directly added to the input vector to form a residual connection, followed by nonlinear activation:

h^{(l + 1)} = σ_{ReLU} ({\hat{h}}^{(l)} + h^{(l)}), where σ_{ReLU} (x) = max (0, x) .

(25)

This architecture effectively mitigates gradient vanishing or exploding issues in moderately deep networks, ensuring numerical stability across drones of varying scales while sharing parameters. Experiments demonstrate that this mechanism significantly reduces training jitter and accelerates convergence, serving as a crucial auxiliary means for complexity optimization.

5.3. Training Stability and Efficiency

5.3.1. Experience Replay and Target Network

In reinforcement learning, training often encounters two common issues: first, correlated sampling data leads to unstable training; second, value function updates are prone to oscillation. This paper employs two mechanisms—replay buffer and target network—to address these challenges.

The concept behind experience replay is straightforward: trajectories generated by the agent during interactions are stored in an “experience pool.” During training, a batch of data is randomly sampled from this pool:

(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D, D = ⋃_{i = 1}^{I} \{τ_{t}^{(i)} = (s_{t}^{(i)}, a_{t}^{(i)}, r_{t}^{(i)}, s_{t + 1}^{(i)})\} .

(26)

This approach breaks the temporal correlation between samples, stabilizing training while improving data utilization.

The target network prevents overly aggressive Q-value updates. When updating the critic, a delayed target network calculates the temporal difference objective:

y_{t} = r_{t} + γ E_{a^{'} \sim π_{θ^{'}}} [min_{k \in {1, 2}} Q_{ϕ_{k}} (s_{t + 1}, a^{'})] .

(27)

where

Q_{ϕ_{k}}

represents the target network’s estimate and

γ

denotes the discount factor. Since the target network’s parameters update more slowly, it acts as a “smoother” during training, preventing significant oscillations in the value function.

In summary, experience replay resolves data correlation issues, while the target network reduces volatility in value function updates. Together, they make the training process more robust and reliable in large-scale unmanned scenarios.

5.3.2. Entropy Coefficient Scheduling Mechanism

A core challenge in policy optimization is balancing exploration and exploitation: insufficient exploration traps the policy in local optima (e.g., uniform distribution), while excessive exploration hinders training convergence. To address this, this paper introduces an entropy regularization term within the MASAC framework and employs piecewise entropy coefficient scheduling to dynamically adjust exploration intensity.

In the standard SAC algorithm, the policy objective function is defined as follows:

J_{π} (θ) = E_{s \sim D, a \sim π_{θ}} [Q_{ϕ} (s, a) - α KL (π_{θ} (\cdot | s) ∥ U (A))],

(28)

where

α

represents the entropy coefficient controlling the exploration weight. A larger coefficient encourages a more uniform action distribution, while a smaller coefficient prioritizes pursuing high Q-values.

This model does not fix

α

but employs a piecewise scheduling approach: during the early training phase, a large

α

is set to ensure sufficient exploration; in the middle and late phases,

α

is gradually reduced, allowing the policy to leverage learned experience more effectively. This approach is analogous to equipping exploration with a “gearbox”: more experimentation early on, more utilization later, and avoiding getting stuck in the middle.

5.3.3. Reward Normalization and Clipping

In large-scale drone optimization problems, the magnitude difference between the delay component and energy consumption component of reward values is often substantial. Feeding these raw rewards directly to the algorithm can easily lead to numerical instability during training, such as gradient explosion or convergence oscillations.

To address this, this paper introduces a reward standardization and clipping mechanism during training.

First, rewards are standardized to maintain a relatively stable scale:

{\bar{r}}_{t} = \frac{r_{t} - E [r_{t}]}{\sqrt{Var [r_{t}] + ε}},

(29)

where

E [r_{t}]

and

\sqrt{Var [r_{t}] + ε}

denote the mean and standard deviation of the reward, respectively. This step smooths the reward distribution, preventing rewards of vastly different magnitudes from “canceling each other out.”

Second, to prevent extreme values from disrupting training, rewards are clipped using the Softplus function:

r_{t}^{(stable)} = Softplus ({\bar{r}}_{t}) = ln (1 + e^{{\bar{r}}_{t}}) .

(30)

This ensures that even extremely large reward values are gently “compressed”, preventing disruption to the training process.

In summary, reward normalization stabilizes the training process, while reward clipping mitigates extreme cases. Together, these techniques enable the algorithm to converge faster and more smoothly in complex unmanned aerial scenarios.

5.4. Complexity Analysis

To rigorously evaluate the complexity characteristics of the proposed “shared policy + shared value backbone + lightweight embedding” framework, we analyze both parameter scale and computational cost as functions of the swarm size I.

For the actor network, since parameters are shared across all UAVs, the majority of the model parameters remain constant regardless of I. The only scaling component originates from the device identity embedding, whose size grows linearly with the swarm size I and the embedding dimension

d_{e}

. Therefore, the parameter complexity of the actor network is as follows:

P_{Actor} (I) = Θ (1) + O (I \cdot d_{e}) .

(31)

Here,

O (\cdot)

refers to the asymptotic upper bound in algorithmic complexity analysis, while

Θ (\cdot)

denotes the asymptotically tight bound.

For the critic network, the centralized twin-critic architecture with a shared backbone ensures that its main parameter scale also remains constant. The only additional growth arises from concatenating I device observations of dimension

d_{o}

, yielding

P_{Critic} (I) = Θ (1) + O (I \cdot d_{o}) .

(32)

Accordingly, the overall parameter complexity can be expressed as follows:

P_{Total} (I) = C + O (I),

(33)

where C is the dominant constant term contributed by the shared actor and critic backbones. Empirical results confirm this formulation, giving an approximate decomposition:

P_{Total} (I) \approx 137,228 + 1032 \cdot (I - 1) .

(34)

This indicates that the parameter overhead is primarily constant, with only a negligible linear increment per additional UAV—far more efficient than the traditional “one network per UAV” approach with

O (I)

large coefficients.

6. Experiments and Comparative Analysis

6.1. Experimental Setup and Platform

All experiments were conducted in a self-developed heterogeneous UAV swarm simulation environment. The objective was to minimize the utility function while jointly optimizing the allocation ratio of shared tasks and downlink bandwidth.

All experiments in this study were conducted within a high-fidelity heterogeneous UAV swarm simulation platform developed by our research team. At the current stage, physical UAV flight experiments are not yet feasible due to the lack of an available real-world testbed. Nevertheless, the simulation environment is constructed using real UAV system parameters—including transmission power limits, bandwidth constraints, channel fading, and computational capacity—to closely reflect realistic operational conditions. This design ensures that the simulation results provide meaningful insights into practical UAV swarm deployment scenarios.

The objective of the experiments is to minimize the system utility function while jointly optimizing the allocation ratio of shared tasks and downlink bandwidth.

6.1.1. Environment and Task Parameters

The experiment employs a centralized training–distributed execution (CTDE) single-step simulation environment with a fixed number of heterogeneous UAV nodes set to I = 30. Each round, the environment generates observations for each UAV node based on the current round’s task load, UAV node computational power, and wireless channel conditions. Actions consist of two-dimensional continuous decisions: “task allocation ratio” and “bandwidth allocation ratio,” with values constrained within [0, 1]. The channel model accounts for shadow fading and is constrained by each UAV node’s bandwidth and downlink power. Before executing actions, the environment projects them onto the feasible domain and normalizes them to satisfy the overall system bandwidth constraint. Rounds are single-step to avoid interference from cumulative errors across steps. A global uniform reward signal is used, incorporating task allocation variance regularization to encourage load balancing. The evaluation phase uniformly employs deterministic execution (removing exploration noise). Specific environment and task parameters are detailed in Table 4.

6.1.2. Model and Algorithm Configuration

Beyond structural differences and corresponding training mechanisms, the original and optimized models share identical training hyperparameters and evaluation metrics. To align with current implementations, Table 5 lists key structural and training/evaluation parameters used for the optimized model in this chapter; the unoptimized model is annotated only where parameters differ.

6.1.3. Hardware and Software Platforms

Hardware and software platforms are as shown in Table 6.

6.2. Performance Guarantees for Optimized Models at Small Scale ( $I = 5$ )

Under large-scale heterogeneous UAV clusters, the analytical optimal solution for the utility function U is difficult to obtain directly. To establish a theoretical baseline, we derived an analytical solution for the problem under the small-scale condition I = 5. This solution serves as a performance upper bound for comparison with the results of the optimization model.

Based on the modeling in Section 3, under the configuration I = 5, the problem can be decomposed into the following three subproblems:

(1): Optimal Task Allocation Structure

From the KKT conditions, the active UAV nodes in the optimal solution satisfy

T_{i}^{comp} = T_{j}^{comp}, \forall i, j \in A

, where

A

denotes the active set; inactive nodes adopt the boundary solution

w_{i} = 0

.

(2): Downlink Bandwidth Allocation

Under optimal allocation, downlink transmission delays are equal across the active set

\frac{D_{i}}{b_{i} B {log}_{2} (1 + {SINR}_{i})} = τ, \forall i \in A

. This yields the closed-form bandwidth allocation

b_{i}^{*} = \frac{D_{i}}{B τ {log}_{2} (1 + {SINR}_{i})}

.

(3): Uplink Bandwidth Allocation

Applying the KKT conditions, the optimal uplink allocation satisfies

\frac{\partial U}{\partial b_{i}} = μ, \forall i \in A

, where

μ

is the Lagrange multiplier. This ultimately reduces to a one-dimensional equation:

\sum_{i \in A} \frac{O_{i}}{b_{i} B {log}_{2} (1 + {SINR}_{i})} = T_{up}^{*}

. Solving yields the unique optimal uplink bandwidth allocation

b_{i}^{*}

.

Combining the three components, the theoretical optimal value for I = 5 is

T^{*} = 0.055298, E^{*} = 0.002479, U^{*} = 0.005120

(35)

Experimental comparisons show that under identical conditions, the optimized model yields

T = 0.057251, E = 0.002477, U = 0.005216

(36)

As shown in Table 7.

When I = 5, the uniformly distributed resource allocation yields U = 0.005764. Thus, the optimized model achieves

85 %

of the theoretical optimum. This result demonstrates that the proposed lightweight optimization framework can approximate theoretical limits in small-scale UAV clusters, providing robust support for subsequent complexity analysis and performance validation in large-scale scenarios.

6.3. Training Efficiency Comparison Across Different Architectures

To evaluate the impact of optimization strategies on training efficiency and convergence performance, this study conducted comparative experiments between the unoptimized original MASAC architecture and the optimized MASAC architecture. Experiments were run under identical environmental parameters, task configurations, and random seed conditions to ensure reproducibility and fairness of results.

As shown in the training results in Table 8 and Figure 4, the unoptimized model exhibits significant fluctuations during training, with noticeable performance regression in the mid-to-late stages. It ultimately fails to converge within 5000 training steps, achieving its best result only during a brief convergence phase early in training, demonstrating unstable performance. In contrast, the optimized model converged faster and more smoothly near the optimal value of U = 0.008210, outperforming the unoptimized version. Furthermore, the training time for the optimized version was significantly reduced—from 4839.33 s for the original structure to 332.92 s—demonstrating that the optimized model substantially lowers computational complexity while maintaining performance.

To compare the optimization effectiveness of both models within the same timeframe, we plotted a comparison of U-convergence curves over identical durations, as shown in Figure 5. On the time axis scaled to the maximum total training duration of the optimized model, the unoptimized curve exhibits no significant exploration transition within the initial 0–500 s, indicating that exploration has not yet commenced. In contrast, the optimized curve achieves primary convergence and enters a low-fluctuation steady state within the same timeframe. This demonstrates that under identical time budgets, the optimized architecture achieves superior performance more rapidly.

The above comparison demonstrates that the optimized scheme offers significant advantages in training time without compromising convergence performance. Furthermore, its convergence process is more stable and smoother, indicating a substantial reduction in computational complexity for this problem.

6.4. Ablation Studies

All ablation experiments in this section compare with the optimized model described above, aiming to demonstrate the effectiveness of reducing computational complexity while improving performance. Specific ablation experiment configurations are shown in Table 9.

Models A1–A3 retain all other configurations consistent with A0 except for the modifications listed above. Each model was independently trained three times with random seeds on the

I \in {5, 15, 30}

dataset to enhance the persuasiveness of this section.

6.4.1. Entropy Temperature Strategy Ablation (A1)

The objective of this subsection’s experiments is to compare the effects of two entropy temperature designs on training stability and convergence efficiency. The A0 model’s

α

was manually scheduled with a decay-then-increase pattern (first decreasing from 0.20 to 0.05 and then increasing to 0.25). To validate this strategy’s model improvement, the A1 model’s

α

was set to a constant 0.05. The convergence quality is measured in Figure 6. The convergence speed and time efficiency are measured in Figure 7.

From Figure 6, we compare the median U values of models A0 and A1 during the final

10 %

of training iterations across three scales (

I \in {5, 15, 30}

). Results show the following: At a small scale I = 5, A1 exhibits significantly poorer convergence quality than A0, with a higher median and markedly increased variance. At medium–large scales I = 15 and I = 30, differences are not significant (labeled “n.s.”). However, A1’s median remains systematically higher than A0’s, indicating that fixed entropy temperature easily leads to “under-exploration/premature convergence,” leaving a performance gap in the tail. Across all three scales, the manually adjusted “decline-then-increase” entropy temperature schedule provides sufficient exploration early on and boosts perturbations later to escape suboptimal solutions. This approach enhances stability at small scales while maintaining a final U value no worse than that of the constant strategy at larger scales.

Figure 7 shows that across all three scales (

I \in {5, 15, 30}

), A1 consistently achieves a higher AUC than A0, with the gap narrowing as scale increases. A smaller AUC indicates that the model reaches a lower U faster and more stably within the same timeframe; thus, A0 demonstrates overall superiority over A1 in temporal efficiency and convergence speed. Error bars reveal that A1 exhibits greater volatility, suggesting that a fixed entropy temperature fails to adapt well to training demands across different phases. In contrast, A0’s “decline-then-increase” scheduling achieves rapid convergence early on while maintaining moderate exploration later, resulting in a lower integral error within the same timeframe.

Summarizing this section’s experiments, altering only the entropy temperature strategy resulted in A1’s performance deteriorating compared to A0 in both temporal efficiency and convergence robustness. This confirms that the current entropy temperature adjustment strategy is essential for ensuring model performance.

6.4.2. Removing Device Identity Embeddings (A2)

This subsection removes the learnable embedding of discrete UAV device IDs to assess its impact on model performance. Figure 8 measures the effect on convergence, while Figure 9 evaluates model complexity.

As shown in Figure 8, for simulations across three scales, the overall jitter amplitude of A0 is smaller than that of A2 during the mid-to-late stages, with this difference becoming more pronounced as I increases. The best U of A0 is typically slightly lower than that of A2, though the difference is not significant.

From Figure 9, it is evident that the A2 model converges to the optimal U significantly slower, and the gap with the A0 model widens as the device scale I increases: at the small scale I = 5, A2 and A0 are essentially consistent, indicating that the benefits of identity embedding are limited under weak heterogeneity and low interaction intensity; when scale increases to I = 15, A2’s time cost rises by approximately

10 - 15 %

compared to A0, expanding further to

15 - 20 %

at I = 30. Concurrently, A2 exhibits higher variance at medium-to-large scales.

Overall, identity embedding primarily accelerates convergence to optimal solutions rather than improving final optimal values, with more pronounced effects in large-scale and highly heterogeneous scenarios.

6.4.3. Removing Twin Critic (A3)

This subsection replaces the twin critic with a single critic in A0 to evaluate the practical contribution of the “dual critic” in suppressing overestimation bias and enhancing convergence stability and efficiency.

Figure 10 shows that A3 maintains convergence patterns similar to those of A0, indicating that replacing the twin critic with a single critic does not alter the model’s scientific validity or convergence behavior.

As shown in Figure 11, when I = 5, A3 and A exhibit similar convergence times to the optimal U value, but A3 demonstrates significantly higher variance. As the scale increases to I = 15 and I = 30, A3’s average optimization time consistently exceeds that of A0 (by approximately

15 - 20 %

), with variance further amplifying at I = 30, revealing distinct “slow-converging tails” in individual runs. Overall, the single-critic approach fails to deliver training efficiency advantages, instead introducing longer convergence times and higher uncertainty at medium-to-large scales. This demonstrates that the twin critic provides faster and more stable convergence characteristics at comparable computational complexity.

As shown in Figure 12, across all three scales, A3’s box and mean triangle consistently exceed those of A0, indicating larger U values for A3 during late convergence. A3’s taller box reflects greater inter-seed variability. As the number of devices increases, A3’s mean and median rise faster than A0’s, suggesting that the single critic accumulates more pronounced bias and variance at scaled-up scales.

The above results demonstrate that the twin-critic network significantly enhances model convergence performance, improves stability, reduces computational complexity, and increases model efficiency.

6.5. Visualization Results

6.5.1. Utility Function Curve

To visually demonstrate the convergence characteristics of the optimized MASAC algorithm during training, we plot the utility function U as a function of training iterations.

We reiterate the convergence curve of the utility function for the optimized model in this section, as shown in Figure 13. The curve exhibits rapid descent during the initial phase (approximately the first few hundred episodes), completing the transition from exploration to exploitation. Subsequently, it enters a prolonged, gradual cooling phase where U fluctuates minimally around a stable interval while steadily declining, demonstrating the training process’s stability and reproducibility. The figure marks the optimal utility value throughout the process: Best U = 0.005216. The tail section where this annotation resides reveals that the optimal point is very close to multiple local minima in the vicinity. This indicates that the final performance is not an isolated point “accidentally picked up,” but rather resides in a relatively flat trough region. This implies a certain robustness to small perturbations in parameters such as learning rate and entropy temperature. Overall, the optimized model exhibits a “fast-then-stable” convergence characteristic, with U ultimately entering a plateau phase characterized by low noise and minimal drift.

6.5.2. Scalability and Complexity Evaluation

To evaluate computational overhead across different device scales I, we measured both the per-iteration training time (end-to-end training cost) and the per-step forward latency during inference (policy inference cost only). Multiple independent measurements were conducted at each scale point, with the standard deviation plotted as error bars. Results are shown in Figure 14.

From the training perspective (left figure), the “time per round” for both architectures increases approximately linearly with I, demonstrating good scalability. The optimized model curve generally outperforms the unoptimized model. For instance, at I = 5, it achieves approximately 0.75 ms/ep compared to 0.65 ms/ep for the unoptimized model; at I = 30, it reaches 3.71 ms/ep versus 3.36 ms/ep for the unoptimized model. The gap primarily stems from constant-term overhead introduced by shared embeddings and dual critics, yet overall remains at a low millisecond-per-epitome level. Short error bars indicate a stable and controllable training duration.

From the inference side (right figure), the per-step forward delay also increases linearly with I. At I = 5, the optimized version has a delay of approximately 0.46 ms/step, while that for the unoptimized version is 0.33 ms/step; at I = 30, these values are 2.73 ms/step and 2.23 ms/step, respectively. The latency gap gradually increases with scale but remains within the millisecond range, with minimal error fluctuations, meeting real-time decision requirements. While ensuring training and inference complexities scale linearly with size and remain controllable, the optimized model trades a slight computational overhead for faster convergence and a lower utility function U.

Overall, the optimized model maintains good scalability while incurring only a minimal constant-term overhead in training and inference complexity, achieving faster convergence and a superior utility function U.

7. Conclusions and Outlook

This work introduces several key innovations in multi-agent optimization for heterogeneous UAV swarms, summarized as follows:

Low-Complexity MASAC Framework: A low-complexity multi-agent soft actor–critic (MASAC) framework is developed under the centralized training and decentralized execution (CTDE) paradigm. It effectively mitigates the challenges of high computational complexity and long training time in UAV swarm task allocation and communication resource scheduling, achieving scalable and deployable multi-agent coordination.
Parameter Sharing and Structural Lightweighting: The proposed architecture integrates a shared actor network with device identity embeddings and a centralized twin critic featuring a shared backbone and dual Q-heads. This eliminates linear parameter growth with respect to the number of agents while maintaining strategy diversity. Combined with fixed-width MLP backbones, Layer Normalization, and residual connections, the design achieves over 14-fold parameter compression and significantly enhances training stability.
Training Stabilization and Performance Enhancement: Multiple mechanisms—piecewise entropy coefficient scheduling, reward normalization and clipping, experience replay, and delayed target networks—are incorporated to balance exploration and exploitation, suppress overestimation bias, and accelerate convergence. These mechanisms jointly achieve over a 90% reduction in training time while improving the delay–energy trade-off utility, demonstrating superior robustness and efficiency.

Building upon these innovations, this paper proposes a low-complexity multi-agent soft actor–critic method to address the challenges of high computational complexity, long training times, and deployment difficulties in heterogeneous unmanned aerial vehicle swarm task allocation and resource distribution. The approach integrates parameter sharing, lightweight network design, and a centralized training–decentralized execution framework to ensure scalability while preserving decision accuracy.

Simulation results demonstrate that the proposed method achieves more than 14-fold parameter compression and reduces training time by approximately 93%, while improving the system utility function U from 0.008431 to 0.008210. These results confirm that the method significantly lowers computational overhead and accelerates convergence without compromising optimization performance, effectively alleviating the bottlenecks of traditional multi-agent soft actor–critic in large-scale unmanned aerial vehicle clusters.

This approach shows promising applications across various unmanned aerial vehicle swarm tasks. In communication relay, it enables efficient bandwidth and power allocation for emergency links. In infrared monitoring and radar reconnaissance, it balances task partitioning to reduce single-node loads and improve real-time sensing. In navigation and spectrum monitoring, it ensures reliable operation under strict energy constraints.

Looking ahead, two directions are worth exploring: validating robustness under dynamic environments such as mobility, channel variations, and task fluctuations and integrating emerging paradigms like federated learning and meta-learning to enhance transferability and rapid adaptation. These extensions will further promote the practical deployment of low-complexity multi-agent soft actor–critic in future large-scale unmanned aerial vehicle swarm systems.

Although current experiments are conducted in high-fidelity simulations due to the lack of an available physical UAV testbed, the environment parameters are carefully designed to reflect real-world conditions. In future work, small-scale physical UAV flight experiments will be implemented to further validate the proposed algorithm’s practicality and deployment readiness. These extensions will further promote the practical deployment of low-complexity multi-agent soft actor–critic in future large-scale unmanned aerial vehicle swarm systems.

Author Contributions

Conceptualization, Z.L. and L.Z.; methodology, Z.L.; software, Z.L.; validation, Z.L. and W.C.; formal analysis, Z.L.; investigation, Z.L.; resources, L.Z.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., W.C., and X.H.; visualization, Z.L.; supervision, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are publicly available and can be accessed upon reasonable request to the corresponding author.

Acknowledgments

The authors would like to express their sincere gratitude to Liang Zeng for his invaluable guidance and continuous support throughout this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zeng, Y.; Zhang, R.; Lim, T.J. Wireless communications with unmanned aerial vehicles: Opportunities and challenges. IEEE Commun. Mag. 2016, 54, 36–42. [Google Scholar] [CrossRef]
Mozaffari, M.; Saad, W.; Bennis, M.; Nam, Y.; Debbah, M. A tutorial on UAVs for wireless networks: Applications, challenges, and open problems. IEEE Commun. Surv. Tutor. 2019, 21, 2334–2360. [Google Scholar] [CrossRef]
Farid, U. Coordinated Heterogeneous UAVs for Trajectory Tracking and Payload Transport via Distributed Sliding Mode Control. Drones 2025, 9, 314. [Google Scholar] [CrossRef]
Zeng, Y.; Wu, Q.; Zhang, R. Accessing from the Sky: A Tutorial on UAV Communications for 5G and Beyond. Proc. IEEE 2019, 107, 2327–2375. [Google Scholar] [CrossRef]
Skaltsis, G.M.; Papadopoulos, G.; Voulodimos, A.; Tzes, A. A Review of Task Allocation Methods for UAVs. J. Intell. Robot. Syst. 2023, 107, 76. [Google Scholar] [CrossRef]
Liao, X.; Wang, Y.; Han, Y.; Li, Y.; Lin, C.; Zhu, X. Heterogeneous Multi-Agent Deep Reinforcement Learning for Cluster-Based Spectrum Sharing in UAV Swarms. Drones 2025, 9, 377. [Google Scholar] [CrossRef]
Qazi, Z.; Shi, J.; Mathis, S.V.; Delimitrou, C.; Zaharia, M. PRISM: Distributed Inference for Foundation Models at Edge. arXiv 2025, arXiv:2507.12145. [Google Scholar] [CrossRef]
Huang, X.; Xia, X.; Wang, Z.; Peng, M. Joint Drone Access and LEO Satellite Backhaul for a Space–Air–Ground Integrated Network: A Multi-Agent Deep Reinforcement Learning-Based Approach. Drones 2024, 8, 218. [Google Scholar] [CrossRef]
Cao, J.; Dou, J.; Liu, J.; Wei, X.; Guo, Z. Multi-Agent Deep Reinforcement Learning Framework Strategized by Unmanned Aerial Vehicles for Multi-Vessel Full Communication Connection. Remote Sens. 2023, 15, 4059. [Google Scholar] [CrossRef]
Wang, D.; Liu, Y.; Yu, H.; Hou, Y. Three-Dimensional Trajectory and Resource Allocation Optimization in Multi-Unmanned Aerial Vehicle Multicast System: A Multi-Agent Reinforcement Learning Method. Drones 2023, 7, 641. [Google Scholar] [CrossRef]
Liu, M.; Wu, H.; Lu, W.; Guo, L.; Lee, I.; Jamalipour, A. Security-Aware Designs of Multi-UAV Deployment, Task Offloading and Service Placement in Edge Computing Networks. IEEE Trans. Mob. Comput. 2025, 24, 11046–11060. [Google Scholar]
Liu, S.; Sun, G.; Zhang, C.; Liu, X.; Wang, J.; Zhao, C.; Niyato, D. Energy Efficient Trajectory Control and Resource Allocation in Multi-UAV-Assisted MEC via Deep Reinforcement Learning. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Taipei, Taiwan, 8–12 December 2025. [Google Scholar]
Foerster, J.; Assael, Y.M.; de Freitas, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6390. [Google Scholar]
Yue, L.; Yang, R.; Zuo, J.; Yan, M.; Zhao, X.; Lv, M. Factored Multi-Agent Soft Actor-Critic for Cooperative Multi-Target Tracking of UAV Swarms. Drones 2023, 7, 150. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic Algorithms and Applications. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Chen, H.; Song, H.; Zhang, H.; Nie, Y.; Duan, M. A Hierarchical Deep Reinforcement Learning Approach for Throughput Maximization in Reconfigurable Intelligent Surface-Aided Unmanned Aerial Vehicle Integrated Sensing and Communication Network. Drones 2024, 8, 717. [Google Scholar] [CrossRef]
Wu, Q.; Zhang, R. Towards Smart and Reconfigurable Environment: Intelligent Reflecting Surface Aided Wireless Network. IEEE Commun. Mag. 2019, 57, 106–112. [Google Scholar] [CrossRef]
Liu, X.; Zhong, W.; Wang, X.; Duan, H.; Fan, Z.; Jin, H.; Huang, Y.; Lin, Z. Deep Reinforcement Learning-Based 3D Trajectory Planning for Cellular Connected UAV. Drones 2024, 8, 199. [Google Scholar] [CrossRef]
Wang, Y.; Cui, Y.; Yang, Y.; Li, Z.; Cui, X. Multi-UAV Path Planning for Air-Ground Relay Communication Based on Mix-Greedy MAPPO Algorithm. Drones 2024, 8, 706. [Google Scholar] [CrossRef]
Kim, W.; Sung, Y. Parameter Sharing with Network Pruning for Scalable Multi-Agent Deep Reinforcement Learning. In Proceedings of the 22nd International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), IFAAMAS, London, UK, 29 May–2 June 2023; pp. 1942–1950. [Google Scholar]

Figure 1. System model for heterogeneous UAV swarms.

Figure 2. Overall MASAC-based optimization framework for heterogeneous UAV clusters: (a) execution pathway with environment interaction, shared actor, and projection; (b) centralized training pathway with replay buffer, twin critics, target networks, and entropy regularization.

Figure 3. Overall algorithm model integrating complexity reduction mechanisms: parameter sharing via device-conditioned shared actor, centralized twin critic with shared backbone, and training-side enhancements, including entropy scheduling, replay buffer, and target networks.

Figure 4. Comparison of U-convergence curves before and after optimization.

Figure 5. Comparison of U-convergence curves within the same timeframe.

Figure 6. Final U (median of last 10% window) for A0 vs. A1. (* means a large difference, and n.s. means a small difference).

Figure 7. AUC comparison at uniform time for A0 vs. A1.

Figure 8. Convergence with vs. without identity embeddings (A0 vs. A2). The triangle represents the number of steps required to reach the minimum U value.

Figure 9. Time to optimal U with vs. without embeddings (A0 vs. A2). The asterisk (*) indicates the best possible U value.

Figure 10. Convergence of A0 vs. A3 models at three scales. The triangle represents the number of steps required to reach the minimum U value.

Figure 11. Bar chart of time to optimal U for A0 vs. A3. The asterisk (*) indicates the best possible U value.

Figure 12. Boxplots of final U (bottom 10%) for A0 vs. A3.

Figure 13. Convergence curve of utility function U for the optimized model.

Figure 14. Scalability: training time per round and inference latency vs. number of UAVs.

Table 1. Comparison of representative multi-agent optimization methods for UAV swarms.

Method	Key Idea	Complexity Trend	Main Limitations	Application Scope
MADDPG	Centralized critic with decentralized deterministic actors	$O (I N)$ parameters (I: agents, N: network size)	Prone to instability and slow convergence in large-scale tasks	Small homogeneous UAV teams
WMADDPG	Weighted reward structure for heterogeneous agents	Linear growth with I	High training cost, difficulty in scaling to 30+ UAVs	Heterogeneous UAV clusters (limited scale)
MASAC	Stochastic actor–critic with entropy regularization under CTDE	Moderate but increasing with network depth	Stable training but heavy computation for multi-agent settings	Edge computing and cooperative sensing tasks
Proposed Low-Complexity MASAC	Parameter-shared actor + twin critics with adaptive entropy and lightweight residual backbone	Near-constant with I (shared representation)	Slight accuracy degradation under extreme heterogeneity	Large-model-driven heterogeneous UAV swarms

Table 2. Key parameters for system modeling.

Symbol	Description	Symbol	Description
I	Number of agent devices	$κ_{i}$	Energy consumption coefficient
C	Total computational workload (FLOPs)	$P_{i}^{U P}$	Maximum uplink transmission power for device i (full-power transmission)
$D, O$	Input/output data volume (bits)	w	Public task ratio
B	Total system bandwidth	v	Specific task proportion (known, with $w + v = 1$ )
$P_{d o w n}$	Total downlink transmission power from controller to all devices	$b_{i}$	Bandwidth allocated to device i
$N_{0}$	Unit-bandwidth noise power density	$p_{i}$	Downlink transmission power allocated to the i-th device
$T_{max}$	Time delay upper limit (s)	$α_{i}$	Total task ratio
$E_{max}$	Maximum energy consumption (J)	$T_{i}^{d o w n}, T_{i}^{u p}, T_{i}^{c}$	Downlink/uplink/local computation latency for device i
$d_{i}$	Node-to-node distance	$E_{i}^{d o w n}, E_{i}^{u p}, E_{i}^{c}$	Downlink/uplink/local computation energy consumption of device i
$h_{i}$	Channel gain/attenuation coefficient	$r_{i}^{d o w n}$	Transmission rate of device i
$f_{i}$	Computing power of device i (CPU frequency, GHz)	U	System utility function (combined delay and energy consumption)

Table 3. Heterogeneous UAV cluster simulation parameters (

I = 30

).

Table 3. Heterogeneous UAV cluster simulation parameters (

I = 30

).

Device	x (m)	y (m)	$f_{i}$ (GHz)	$v_{i}$	Device	x (m)	y (m)	$f_{i}$ (GHz)	$v_{i}$
1	10	0	0.5 $\times 10^{7}$	0.0100	16	9	−4	1.05 $\times 10^{7}$	0.0175
2	9	4	0.6 $\times 10^{7}$	0.0105	17	10	0	1.15 $\times 10^{7}$	0.0180
3	7	7	0.7 $\times 10^{7}$	0.0110	18	9	4	1.25 $\times 10^{7}$	0.0185
4	4	9	0.8 $\times 10^{7}$	0.0115	19	7	7	1.35 $\times 10^{7}$	0.0190
5	0	10	0.9 $\times 10^{7}$	0.0120	20	4	9	1.45 $\times 10^{7}$	0.0195
6	−4	9	1.0 $\times 10^{7}$	0.0125	21	0	10	0.52 $\times 10^{7}$	0.0200
7	−7	7	1.1 $\times 10^{7}$	0.0130	22	−4	9	0.62 $\times 10^{7}$	0.0205
8	−9	4	1.2 $\times 10^{7}$	0.0135	23	−7	7	0.72 $\times 10^{7}$	0.0210
9	−10	0	1.3 $\times 10^{7}$	0.0140	24	−9	4	0.82 $\times 10^{7}$	0.0215
10	−9	−4	1.4 $\times 10^{7}$	0.0145	25	−10	0	0.92 $\times 10^{7}$	0.0220
11	−7	−7	0.55 $\times 10^{7}$	0.0150	26	−9	−4	1.02 $\times 10^{7}$	0.0225
12	−4	−9	0.65 $\times 10^{7}$	0.0155	27	−7	−7	1.12 $\times 10^{7}$	0.0230
13	0	−10	0.75 $\times 10^{7}$	0.0160	28	−4	−9	1.22 $\times 10^{7}$	0.0235
14	4	−9	0.85 $\times 10^{7}$	0.0165	29	0	−10	1.32 $\times 10^{7}$	0.0240
15	7	−7	0.95 $\times 10^{7}$	0.0170	30	4	−9	1.42 $\times 10^{7}$	0.0245

Table 4. Environment and task parameters.

Category	Parameter Item	Value/Range	Description
Scale	Number of UAV Nodes I	30	Fixed-scale CTDE scenario.
Observation	Observation dimensions per node	2	Distance and local CPU computational power.
Actions	Dimensions per node	2	Task ratio and bandwidth ratio, $[0, 1]$ .
Constraints	System bandwidth constraint	$\sum_{i = 1}^{I} b_{i} \leq 1$	Projection/normalization on environment side to satisfy system constraints.
Channel	Shadow fading	Enable	Synchronization considers device power/bandwidth capability constraints.
Round	Single-step simulation	True	One decision and settlement per round for fair comparison.
Reward	Global reward + variance normalization	Enable	Encourages balanced distribution across devices.
Evaluation	Execution mode	Deterministic	Removes exploration noise; directly computes T, E, and U.

Table 5. Model and algorithm configuration.

Parameter/Mechanism	Value/Setting	Parameter/Mechanism	Value/Setting
Shared Actor	Enable	Discount Factor	0.98/0.005
UAV Node Embedding Dimension	8	Batch Size/Replay Capacity	512/200,000
Output Mapping	Sigmoid	AMP/Gradient Clipping	Enabled/0.5
Structure	Shared Backbone Twin-Critic	Behavioral Noise	Exponential decay, floor 0.01
Width Scaling	WIDTH_MULTIPLIER = 0.5	Temperature Scheduling	$α_{i n i t} = 0.20$ , $α_{m i n} = 0.01$ , $α_{m a x} = 0.25$ , $T_{s w} = 1500$ , $β_{d o w n} = 0.995$ , $β_{u p} = 1.002$
Low-Rank Bottleneck Rank	RANK = 64	Instant Reward Mixing	$ρ = 0.5$
Normalization/Activation/Dropout	LayerNorm/SiLU/0.05	Smooth Noise	Clipped to ±0.3
Optimizer/Weight Decay	AdamW/ $1 \times 10^{- 4}$	Execution Mode	Deterministic
Learning Rate (Actor/Critic)	$1 \times 10^{- 4}$ / $5 \times 10^{- 4}$

Table 6. Hardware and software platform summary.

Category	Item	Configuration/Version
Hardware Platform	Processor	AMD Ryzen 9 7940H (4.00 GHz)
	Graphics	Radeon 780M Graphics (Integrated)
	Memory	16.0 GB RAM
	Storage	1.38 TB Solid-State Drive
	Operating System	Windows 11 Home Chinese Edition 24H2 (OS 26100.4652)
Software Platform	Python	3.13.5
	PyTorch	2.7.1 (CPU version)
	Gymnasium	1.1.1
	PettingZoo	1.24.3
	NumPy	2.2.6
	Pandas	2.3.0
	Matplotlib	3.10.3
	CUDA/GPU	N/A/Not used
	Computational Method	CPU multithreading

Table 7. Comparison of theoretical optimal value and optimized model at

I = 5

. The bolding is for emphasis.

Table 7. Comparison of theoretical optimal value and optimized model at

I = 5

. The bolding is for emphasis.

$I = 5$	T	E	U
Theoretical Optimum Value	0.055298	0.002479	0.005120
Optimized Model	0.057251	0.002477	0.005216

Table 8. Performance and training time comparison across different architectures. The bolding is for emphasis.

Method	T	E	U	Training Time (s)
Unoptimized	0.097543	0.003741	0.008431	4839.33
Optimized	0.095661	0.003741	0.008210	332.92
Relative Improvement	$- 1.929 %$	$- 0 %$	$- 2.621 %$	$- 93.121 %$

Table 9. Ablation experiment configuration.

Ablation Study Model ID	Specific Configuration
A0 (Baseline)	The optimized model used in this experiment.
A1	Based on A0, with the entropy temperature strategy fixed.
A2	Removed device identity embeddings from the shared actor network while retaining A0.
A3	Based on A0, replaced the twin-critic network with the single-critic network.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Cheng, W.; Zeng, L.; He, X. Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms. Drones 2025, 9, 788. https://doi.org/10.3390/drones9110788

AMA Style

Liu Z, Cheng W, Zeng L, He X. Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms. Drones. 2025; 9(11):788. https://doi.org/10.3390/drones9110788

Chicago/Turabian Style

Liu, Zhaoyu, Wenchu Cheng, Liang Zeng, and Xinxin He. 2025. "Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms" Drones 9, no. 11: 788. https://doi.org/10.3390/drones9110788

APA Style

Liu, Z., Cheng, W., Zeng, L., & He, X. (2025). Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms. Drones, 9(11), 788. https://doi.org/10.3390/drones9110788

Article Menu

Towards Scalable Intelligence: A Low-Complexity Multi-Agent Soft Actor–Critic for Large-Model-Driven UAV Swarms

Highlights

Abstract

1. Introduction

2. System Modeling

2.1. Network Model

2.2. Communication Model

2.2.1. Downlink Transmission

2.2.2. Uplink Transmission

2.3. Computational Model

3. Problem Formulation

3.1. Problem Background and Parameter Settings

3.2. Objective Function and Constraints

4. Artificial Intelligence Model Development

4.1. Overall Architecture of the Model

4.2. Training Module Design

4.2.1. Replay Buffer

4.2.2. Target Network

4.2.3. Entropy Coefficient Scheduling

4.2.4. Policy Smoothing and Hybrid TD

5. Complexity Reduction Methods

5.1. Parameter Sharing at the Model Level

5.1.1. Device-Conditioned Shared Actor (DCSA)

5.1.2. Centralized Twin Critic (Shared Backbone + Dual Head)

5.1.3. Qualitative Comparison of Parameters and Computational Complexity

5.2. Structural Lightweighting

5.2.1. Fixed-Width MLP Backbone

5.2.2. Normalization and Residual Mechanisms

5.3. Training Stability and Efficiency

5.3.1. Experience Replay and Target Network

5.3.2. Entropy Coefficient Scheduling Mechanism

5.3.3. Reward Normalization and Clipping

5.4. Complexity Analysis

6. Experiments and Comparative Analysis

6.1. Experimental Setup and Platform

6.1.1. Environment and Task Parameters

6.1.2. Model and Algorithm Configuration

6.1.3. Hardware and Software Platforms

6.2. Performance Guarantees for Optimized Models at Small Scale ( I = 5 )

6.3. Training Efficiency Comparison Across Different Architectures

6.4. Ablation Studies

6.4.1. Entropy Temperature Strategy Ablation (A1)

6.4.2. Removing Device Identity Embeddings (A2)

6.4.3. Removing Twin Critic (A3)

6.5. Visualization Results

6.5.1. Utility Function Curve

6.5.2. Scalability and Complexity Evaluation

7. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.2. Performance Guarantees for Optimized Models at Small Scale ( $I = 5$ )