Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning

Yan, Xiangpei; Yu, Guorui; Huang, Guoke; Zhou, Ruchuan; Hao, Liu

doi:10.3390/app15084337

Open AccessArticle

Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning

by

Xiangpei Yan

,

Guorui Yu

^*,

Guoke Huang

,

Ruchuan Zhou

and

Liu Hao

AVIC China Helicopter Design and Research Institute, Jingdezhen 333001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4337; https://doi.org/10.3390/app15084337

Submission received: 3 March 2025 / Revised: 9 April 2025 / Accepted: 12 April 2025 / Published: 14 April 2025

Download

Browse Figures

Versions Notes

Abstract

Traditional methods have limitations regarding efficient collaboration and dynamic response in complex dynamic environments. Although existing swarm intelligence control methods possess certain adaptive optimization capabilities, they still face challenges in individual and global collaborative optimization and adaptability. To address this challenge, a swarm intelligence control design method based on double-layer deep reinforcement learning (D-DRL) is proposed. This method uses a double architecture where the inner layer is responsible for dynamic decision-making and behavior optimization, and the outer layer manages resource allocation and strategy optimization. The dynamic interaction between the inner and outer layers, coupled with multi-level collaborative optimization, enhances the system’s adaptability and operating performance. The results of the unmanned aerial vehicle (UAV) swarm case study show that our method achieves effective convergence and outperforms existing swarm intelligence control approaches. Specifically, it simultaneously optimizes energy efficiency and task completion amount with a superior performance. This improvement significantly enhances the comprehensive task effectiveness of the swarm.

Keywords:

swarm intelligence control; multi-agent system; collaborative optimization; double-layer reinforcement learning; multi-agent reinforcement learning

1. Introduction

Relying on emerging technologies, such as big data, cloud computing, and artificial intelligence, swarm intelligence control has been widely applied in fields such as industrial production, smart cities, environmental monitoring, and unmanned transportation. Consequently, swarm intelligence control has become a crucial component of modern control systems [1]. The swarm intelligence control system comprises numerous heterogeneous agents with perception, learning, and decision-making capabilities. Through a multi-level collaborative optimization mechanism, it achieves autonomous decision-making at the individual level, collaborative interaction at the group level, and macro-control at the system level. As a result, it enhances both task completion efficiency and overall system performance [2]. Figure 1 illustrates the operational framework of swarm intelligence control within complex dynamic environments. Heterogeneous agents, equipped with perception, learning, and decision-making capabilities, collaborate across multiple levels through information sharing and coordinated optimization strategies. This enhances the system’s adaptability, scalability, and robustness, ensuring efficient operation in dynamic settings [3]. Heterogeneous multi-agents interact through wireless communication, information sharing, and coordinated decision-making, forming an efficient swarm intelligence system. The diagram encompasses typical application scenarios, such as smart cities, unmanned transportation, and industrial production, highlighting the extensive application and significance of swarm intelligence control in complex environments. Therefore, swarm intelligence control provides new ideas and tools for solving coordination and optimization problems in complex dynamic environments. Its wide applicability and practical significance make it a valuable solution across various fields.

In practical applications, swarm intelligence control needs to achieve multi-objective collaborative optimization, such as key indicators like energy efficiency and task completion amount. However, as system scale increases and interactions grow more complex, traditional swarm control methods gradually reveal their limitations [4]. The specific reason can be summarized into two aspects. On the one hand, simple agents lack sufficient adaptive adjustment capabilities in dynamic environments due to limitations in computing, perception, and action capabilities. The shortcoming intensifies the conflict between local and global optimization, ultimately hindering overall system performance improvement. On the other hand, there is a strong coupling between the behavior optimization of agents and the global optimization of the system. The existing methods currently lack a dynamic adaptive mechanism for multi-objective collaborative optimization, and it is difficult to effectively decouple individual behavior optimization from global optimization. ln the dynamic and complex environment, the inability to balance the relationship between the individual optimal solution and the global optimal solution affects the overall performance of the system [5]. For example, in UAV swarm control, it is necessary to ensure task completion amount while minimizing energy cost and enhancing system intelligence. Efficient energy utilization not only extends the system operation time, but also reduces the cost and enhances the task completion efficiency [6]. However, when scheduling UAV swarms in complex dynamic environments, balancing the conflict between individual behaviors and global objectives, as well as enhancing UAV adaptability, remains a challenge that affects the system’s performance [7]. Therefore, a key challenge remains in designing an adaptive optimization framework for swarm intelligence control systems. This framework should dynamically coordinate individual and global optimization while balancing task completion and energy efficiency under multi-objective conflicts.

This paper focuses on dynamic adaptive adjustment and the collaborative learning optimization of agents and control strategies within the swarm, proposing a swarm intelligence control method based on double-layer deep reinforcement learning. This method effectively addresses the challenge of optimizing both energy efficiency and task completion in complex dynamic environments. It overcomes the limitations caused by the insufficient separation between individual and global optimization, as well as inadequate dynamic adaptability. Consequently, the overall system performance is significantly enhanced. In comparison to the existing research, the main innovations of this paper are as follows: (1) A swarm intelligence control framework based on double-layer deep reinforcement learning is proposed. Through dynamic interactions between the inner and outer layers, along with multi-level collaborative optimization, this method achieves an effective separation between individual and global optimization. This separation facilitates the seamless coordination between individual behavior optimization and global resource allocation. As a result, the system’s adaptability and overall operational performance are significantly improved. (2) A learning algorithm designed for complex dynamic environments is developed. This algorithm uses deep reinforcement learning to enhance agents’ self-adjustment capabilities, enabling them to adapt dynamically to both environmental changes and the behaviors of other agents. Thus, the system achieves a dynamic equilibrium between multi-agents and their environment. (3) Through simulation validation, this method demonstrates significant advantages in multi-objective collaborative optimization, effectively balancing energy efficiency and task completion, further enhancing the swarm’s comprehensive task effectiveness.

2. Related Work

Swarm intelligence control is a control paradigm focused on multi-agent collaboration mechanisms. It enables the efficient execution of complex tasks by coordinating collaboration among heterogeneous agents to achieve common goals or optimize overall system effectiveness [6,7]. Currently, swarm intelligence control is primarily classified into three categories: centralized control, distributed control, and hybrid control [8,9,10], as shown in Table 1.

Centralized control involves the unified management of agents by a central controller to simplify the decision-making process and achieve global optimization. However, this control method has the risks of insufficient optimization capability, single point failure, and processing capacity bottleneck. To solve these problems, Peng et al. [11] proposed a fuzzy rule-based neural network method to enhance control optimization capabilities, thereby improving the scheduling efficiency of railway systems. Hatata et al. [12] proposed a central controller-based gorilla troops optimization algorithm to address the voltage coordination problem caused by the limited computational capacity of the central controller. Meng et al. [13] proposed the energy management optimization model based on reinforcement learning to minimize energy storage operational costs. Mostaani et al. [14] employed a behavior-based state aggregation algorithm to jointly design control and communication strategies in multi-agent systems, aiming to prevent single-point failures. Wang et al. [15] proposed a combination of a rule-based control architecture method to improve control efficiency and response speed, adapting to complex working scenarios. Although these studies have advanced solutions in specific scenarios, they still face limitations inherent to centralized control, particularly in large-scale systems and dynamic environments, where system robustness and adaptability are difficult to ensure.

Distributed control performs tasks through communication and collaboration among agents. This control approach provides better fault tolerance and robustness, granting agents greater autonomy and decision-making capabilities, but also faces the challenges of increased decision-making complexity, load balancing issues, and a lack of information coordination mechanisms. To solve these problems, Guo et al. [16] proposed a distributed voltage restoration and power allocation control scheme for DC microgrid systems to achieve energy savings in dynamically complex environments. Guo et al. [17] addressed actuator fault issues by proposing a distributed adaptive fault-tolerant controller for high-speed trains considering actuator faults. Li et.al. [18] proposed a distributed event-triggered transmission strategy based on periodic sampling, which alleviates communication burdens through an information coordination mechanism. Zheng et al. [19] proposed a distributed stochastic algorithm based on the enhanced genetic algorithm (GA) to solve the cooperative area search problem for UAV swarms, improving the search efficiency by enhancing information communication. Han et al. [20] proposed a path planning method based on the multi-strategy evolutionary learning artificial bee colony algorithm to enhance the global optimization capability of path planning. While distributed control improves agent autonomy, it still struggles with coordination and global optimization, especially in large-scale complex tasks, where information flow and decision processes are difficult to synchronize effectively.

Hybrid control combines the advantages of centralized and distributed control, integrating central controller management and agent autonomy to improve system robustness and scalability. However, it also introduces challenges, including increased collaboration complexity, insufficient global optimization capabilities, and reduced control reliability. To solve these problems, Song et al. [21] proposed a hybrid aggregation adaptive control algorithm to resolve UAV swarm conflicts by determining optimal hybrid weights, thereby optimizing the performance in complex environments. Ibrahim et al. [22] combined linear active disturbance rejection control with a hybrid particle swarm optimization algorithm to ensure high controller reliability while optimizing the global maximum power output. Yakout et al. [23] proposed a gorilla troops optimizer based on a optimal reinforcement learning controller to enhance the frequency control of hybrid power systems. Liu et al. [24] investigated coordination mechanisms in multi-objective deep reinforcement learning to minimize energy costs and delay conflicts in UAV swarms. Hou et al. [25] proposed a distributed collaborative search method based on a multi-agent reinforcement learning algorithm, which improves global optimization capabilities in complex environments and is applicable to large-scale scenarios. Although hybrid control can combine the advantages of both, it still faces the problem of balancing task optimization and energy efficiency. Especially in complex tasks and dynamic environments, the effectiveness and flexibility of hybrid control are still insufficient.

A comparison of existing research approaches, their contributions, and their limitations is summarized in Table 1, which highlights the key differences between prior studies and the proposed method. Despite significant progress in optimizing complex scenarios, current swarm intelligence control methods still face several limitations. These include incomplete decoupling between local and global optimization, as well as limited dynamic adaptability. Hence, achieving the simultaneous optimization of energy efficiency and task completion remains challenging. To resolve these challenges, this paper proposes a swarm intelligence control design method based on double-layer deep reinforcement learning. The method utilizes a double architecture, with the inner layer responsible for dynamic decision-making and behavior optimization of the agents, while the outer layer manages global resource allocation and strategy optimization. By leveraging the powerful function approximation capability of deep reinforcement learning (DRL), which has been demonstrated to outperform classical methods, such as the artificial bee colony algorithm (ABC) in similar contexts [26], the proposed framework enhances the decision-making capacity of agents in high-dimensional, complex environments.

Additionally, parameter settings and training strategies adopted in this study draw from widely recognized practices and guidelines discussed in several studies [27,28,29,30]. These references provide useful insights into the design of reward functions, policy update mechanisms, and architecture selection, which collectively contribute to the stable and efficient implementation of our DRL-based control framework. By employing adaptive and collaborative optimization mechanisms among multi-agents within the swarm, this approach enhances individual agents’ adaptability and flexibility in complex dynamic environments. As a result, the overall effectiveness of the swarm in accomplishing tasks is further improved.

3. Double-Layer Deep Reinforcement Learning Design for Swarm Intelligence Control

3.1. Overall Architecture of Double-Layer Deep Reinforcement Learning Framework

To facilitate readers to understand the symbols used in this paper and their meanings, a symbol comparison (Table 2) is presented below for quick reference in the subsequent method description.

Swarm intelligence control involves multi-level decision optimization, including individual-level autonomous decision-making, group-level collaborative interaction, and system-level macro-control. In complex dynamic environments, general multi-agent systems are constrained by local information and fail to achieve a global optimal decision. Consequently, a higher-level manager is often introduced to control swarm behavior, optimize resource allocation, and enhance task effectiveness [31]. The method proposed in this paper uses double-layer deep reinforcement learning to construct a hierarchical architecture consisting of inner-layer execution modules (i.e., agents) and outer-layer coordination modules (i.e., manager). Each layer predicts the expected reward based on real-time feedback from the environment and optimizes the behavior strategy accordingly. In complex dynamic environments, multi-agents can be flexibly deployed as needed to accommodate control tasks of different scales and complexities. The overall model architecture is illustrated in Figure 2 and primarily includes the following three core mechanisms:

(1) Inner-layer agent optimization mechanism: This mechanism focuses on the optimization of autonomous decision-making at the individual level and collaborative optimization at the group level, covering the following aspects:

(a)

Input Information (labeled as 1 and 2):

Each agent extracts spatial input values from the environment within its field of view. The input values are generated by the swarm control system map and represent local information, such as task location and environment status.
The agent network input value consists of public observations, manager and agent observations, and agent observations. Public observations provide the swarm control system status and manager control strategy; manager and agent observations include task income, task amount, and swarm income; and agent observations include energy cost, ability, and local space information. This input integrates global and local information to provide a basis for optimization decisions for the policy network.

(b)

Optimizing Strategy Network (labeled as 3):

These spatial inputs are processed through a Convolutional Neural Network (CNN) to extract spatial features. Extracted features are fed into a Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) network for feature fusion and time-series prediction. This enables agents to optimize their decision-making processes.
The Agent Reinforcement Learning Optimizer adjusts the action strategy based on the output of the policy optimization network.
To enhance task efficiency, multi-agents share local information, enabling collaborative interactions.

(c)

Agent Actions (labeled as 4):

The inner policy network finally outputs specific action instructions, such as search, move, complete tasks, etc. This behavior acts on the environment or task scenario.

(d)

Feedback Rewards (labeled as 5):

When the agent performs an action, the environment provides a certain reward or penalty signal to update the parameters of the inner strategy and complete the closed loop of reinforcement learning.

The inner-layer agent optimization mechanism ensures effective decision-making at the individual level while achieving a group-level collaboration. Detailed explanations are provided in Section 3.2.

(2) Outer-layer manager optimization mechanism: This mechanism focuses on macro-control at the system level, responsible for optimizing global resource allocation and control strategies:

(a)

Input Information (labeled as 1 and 2):

The manager extracts global input values from the observation data of the environment and the agent, including the state of the environment, the distribution of tasks, and the action state of the agent. This information is integrated into the manager network input value as the input for subsequent feature extraction.
The manager network input values include public, manager, and agent observations. Compared with the agent network input value, there is no agent observation. This input integrates manager observations, observed environmental states, global information, and agent observations, providing a basis for the optimization decision of the policy network.

(b)

Optimizing Strategy Network (labeled as 3):

The manager comprehensively analyzes the environmental state, global information, and observable values of the agents. The CNN is used to extract features from the environment and agent distribution, followed by feature fusion and time series prediction using the DNN and LSTM to optimize resource allocation and control strategies.
The manager reinforcement learning optimizer captures the dynamic changes in the swarm in both temporal and spatial dimensions. Based on this, it predicts swarm behavior trends, thereby enabling the efficient optimization of control strategies and resource allocation.

(c)

Manager Actions (labeled as 4):

The manager generates multiple control strategies (e.g., strategy 1, strategy 2, …, strategy N) and selects the optimal one to guide agent behavior.

(d)

Feedback Rewards (labeled as 5):

The manager also receives feedback rewards based on global performance indicators and multi-agent collaboration effects, thereby continuously optimizing its decision-making strategy.

The outer-layer manager optimization mechanism ensures comprehensive system-level control by utilizing global information to optimize task allocation and strategy formulation, thereby enhancing the comprehensive task effectiveness of the system. Detailed explanations are provided in Section 3.3.

(3) Bilateral dynamic interaction and evolutionary game mechanism: This mechanism lies at the intersection of the inner-layer agents and the outer-layer manager, enabling dynamic collaboration and optimization:

(a)

Bilateral Dynamic Interaction:

The manager formulates control strategies based on the global state to guide agents’ decision-making. After executing tasks, agents provide feedback on local states, influencing the manager’s strategy adjustments. Additionally, agents share information through local interactions to enhance collaborative effects.

(b)

Evolutionary Game Mechanism:

Agents adjust their actions in dynamic environments based on reward maximization to adapt to optimal task allocation and resource utilization. The manager optimizes global strategies through reinforcement learning, guiding the swarm toward better solutions and ensuring overall performance improvement. By facilitating mutual learning and dynamic equilibrium between agents and the manager, a collaborative optimization of the swarm control system is achieved.

This bilateral dynamic interaction and evolutionary game mechanism ensure dynamic adaptability and a balance between global and local decision-making. Detailed explanations are provided in Section 3.4.

3.2. Inner-Layer Agent Optimization Mechanism for Individual and Group Control

The individual and group control algorithm based on deep reinforcement learning in the inner layer constitutes a hybrid approach. The agents continuously compete and evolve through the interaction with the environment to identify the optimal strategy. Simultaneously, a collaboration among the agents ensures the optimization of the swarm’s overall performance. The behavioral strategy objective of agent

i

is to maximize its expected cumulative discounted reward, as expressed in the following formula:

\forall i : \max E_{a_{i} ~ π_{i}, a_{- i} ~ π_{- i}, s^{'} ~ τ} [\sum_{t = 1}^{H} γ^{t} * \underset{r_{i, t}}{\underset{⏟}{T a s k_{t, i}}} + T a s k_{1, i}]

(1)

where the instantaneous reward,

r_{i, t}

, is the task completion amount,

T a s k_{t, i}

, of agent

i

from time

t - 1

to

t

. The agent’s behavioral strategy depends not only on its observations,

o_{i, t}

, and hidden states,

h_{i, t}

, but also on the behavior strategy,

π_{- i}

, of other agents and the environment transition,

τ

. Under the discount factor

γ \in (0, 1)

, the agent maximizes its discounted expected return by utilizing historical data from parallel environments and accumulating the reward,

r_{i, t}

, through the strategy network parameters,

θ_{i, t}

. The joint strategy

π = (π_{1}, \dots, π_{N})

represents the set of strategies for all agents, while

π_{- i} = (π_{1}, \dots, π_{i - 1}, π_{i + 1}, \dots, π_{N})

represents the strategy set excluding agent

i

.

During the training process, the Proximal Policy Optimization (PPO) algorithm [32] is used to map observations to actions through the neural network

π_{i} (a_{i} | o_{i}, h_{i}; θ)

and update the strategy weights,

θ_{i}

. The specific optimization mechanism is described as follows:

∆ θ_{i} \propto E_{a_{1} ~ π_{1}, \dots, a_{N} ~ π_{n}, s^{'} ~ τ} [\sum_{t = 1}^{H} γ^{t} r_{i, t} \nabla_{θ_{i}} l o g π_{i} (a_{i, t} | o_{i, t}; θ_{i})]

(2)

where N represents the number of agents in the inner layer, which are of various types and have different fields of view and hidden states. This approach is inspired by concepts similar to MADDP [33], using a centralized training, decentralized execution model, where multi-agents of different types share a parameter set,

θ

. Each agent’s strategy,

π_{i} (a_{i} | o_{i}, h_{i}; θ)

, is determined based on its observations,

o_{i}

, and hidden states,

h_{i}

, enabling cooperation between agents to improve the swarm’s task completion efficiency. This mechanism enables agents to learn new behaviors from agents in different regions, facilitating cross-agent behavior sharing and enhancing learning efficiency. Simultaneously, to prevent collisions between different agents in the inner layer, soft constraints are incorporated into the reward function, as demonstrated by the following mechanism [34]:

R_{t o t a l} = R_{o r i g i n a l} - ω \cdot \max (0, d_{s a f e} - ‖ p_{A} - p_{B} ‖_{2})^{2}

(3)

where

R_{o r i g i n a l}

is the original reward function,

R_{t o t a l}

is the reward function after the constraint is added, and

ω

is the weight used to adjust the importance of collision penalties.

d_{s a f e}

denotes the safety distance, typically set to

0

, while

p_{A}

and

p_{B}

represent the positions of any two agents.

Strategy optimization at the individual level focuses on the agent’s perception, memory, and decision-making capabilities, highlighting how agents adjust their behavior based on observations and internal states to adapt to both environmental conditions and task requirements. At the group level, it involves both competitive and cooperative interactions. Agents may either compete for the same task or achieve collaborative optimization through information sharing, with strategy selection determined by their perception and decision-making mechanisms. Dynamic interaction at the group level enhances system efficiency in competitive scenarios and strengthens overall task effectiveness in cooperative settings, jointly driving the evolution and optimization of swarm intelligence.

3.3. Outer-Layer Manager Optimization Mechanism for System Control

The outer-layer manager optimizes the control strategy by learning from its hidden states and historical data. The manager observes the agents’ rewards, positions, task completion status, and competitive states, but cannot directly assess the agents’ intrinsic abilities or attributes. The manager employs a unified strategy, selecting the optimal control strategy based on historical experience to optimize the performance. The manager’s objective is to maximize the system’s overall task effectiveness, as shown in the following formula:

\forall i : \max E_{τ ~ π_{c}, a ~ π, s^{'} ~ τ} [\sum_{t = 1}^{H} γ^{t} * \underset{r_{c, t}}{\underset{⏟}{C T E_{t}}} + C T E_{1}]

(4)

C r r a (z) = \frac{z^{1 - η} - 1}{1 - η}, η > 0

(5)

T C E_{t, i} = C r r a (R e w a r d_{t, i}) - C o s t_{t, i}

(6)

C T E_{t} = \frac{\sum_{t = 1}^{i = N} T C E_{t, i}}{N} \frac{\sum_{i = 1}^{i = N} T a s k_{t, i}}{N}

(7)

where the

C r r a

function exhibits marginal effects, and the parameter

η

controls the degree of nonlinearity: the higher

η

is, the more pronounced the nonlinearity.

R e w a r d_{t, i}

represents the reward of the agent,

i

, from

t - 1

to

t

;

C o s t_{t, i}

is the corresponding energy cost; and task completion efficiency,

T C E_{t, i}

, is a concave function that increases the agent’s reward while nonlinearly reducing the energy cost. In this paper, task completion efficiency is also referred to as energy efficiency. To balance the task completion amount and efficiency in the system, this paper draws on the definition of social welfare [35] in sociology, and proposes an evaluation metric, which is the comprehensive task effectiveness

C T E_{t}

at time

t

, whose value is equivalent to the manager’s instantaneous reward,

r_{c, t}

.

At the system level, the manager is responsible for the global management and performance optimization of the entire swarm intelligence control system. It utilizes historical control strategies and interaction behavior data to adjust the strategy network, altering the agents’ income structure to indirectly influence their rewards and maximize comprehensive task effectiveness. Under this strategy, the manager and agents each pursue the maximization of comprehensive task effectiveness and the optimization of their behavior to adapt to the environment. This process reflects a Stackelberg game, where the manager acts as the leader and the agents as followers responding to strategies [36].

3.4. Bi-Directional Dynamic Interaction Process and Evolutionary Game Mechanism

The double-layer reinforcement learning framework provides a structured learning approach to simultaneously optimize the behavior of inner agents and the control strategy of the outer manager, thereby enhancing the system’s comprehensive task effectiveness. This framework integrates inner and outer loop learning, with the strategy network acquiring knowledge through experience, without relying on prior modeling knowledge or assumptions. The specific details are provided in Algorithm 1.

Algorithm 1 of the proposed method is detailed below:

Algorithm 1: Design of Swarm Intelligence Control Based on Double-layer Deep Reinforcement Learning

Input: Initial solution, sampling range

H

, number of control policy change periods

T

, stopping criteria for policy optimization

C

, state, observation values, and hidden vectors

s, o, h, o_{c}, h_{c}

Output: Trained agent policy network weights

θ

, trained manager policy network

\emptyset

Initialization: Agent and manager policy weights $θ, \emptyset \Leftarrow θ_{0}, \emptyset_{0}$ , agent and manager data buffers $D, D_{C} \Leftarrow {}, {}$
while network policy weights are being trained do
for t = 1 to $H$ do
$a, h \Leftarrow π_{i} (. | o, h, θ)$ agent makes a decision, executes the action, and updates the agent’s hidden layer parameters
if t mod $T$ = 0 then
$τ, h_{c} \Leftarrow π_{c} (. | o_{c}, h_{c}, θ)$ adjust the control policy and update the manager’s hidden layer parameters
else
$h_{c} \Leftarrow π_{c} (. | o_{c}, h_{c}, θ)$ control policy remains unchanged, update the manager’s hidden layer parameters
end if
$D \Leftarrow D \cup \{(o, a, r, o^{'})\}$ update agent’s data buffer

11: $D_{C} \Leftarrow D_{C} \cup \{(o_{c}, τ, r_{c}, o_{c}^{’})\}$ update manager’s data buffer
12: {s $, o, o_{c}} \Leftarrow {s^{’}, o^{'}, o_{c}^{’}$ } update manager’s state
13: end for
14.: $D, D_{C} \Leftarrow {}, {}$ update $θ, \emptyset$ using buffer data and reset the buffer data
15: if the training cycle ends then
16.: $s, o, h, o_{c}, h_{c} \Leftarrow s_{0}, o_{0}, h_{0}, o_{C, 0}, h_{c, 0}$ reset training cycle
17.: end if
18.: if stopping criteria are met, then
19.: Return trained agent and planner policy network parameters $θ$ and $\emptyset$
20.: end if
21.: end while

In the inner loop, the agent optimizes its behavior by completing tasks and receiving rewards, thus maximizing the objective function. Due to the dynamic changes in the manager’s control strategy, the agent operates in a non-stationary Markov decision environment, requiring continuous adaptation to environmental changes and behavior adjustment to respond to new strategies.

In the outer loop, the conflict between the agents in the swarm, each seeking to maximize its benefits, and the system’s overall objectives forces the manager to periodically adjust the control strategy according to the learning objectives, thereby improving comprehensive task effectiveness. Changes in agent behavior induce non-stationarity issues for the manager, requiring joint training and synchronous updates of both the agents’ and the manager’s policy weights.

In the double-layer reinforcement learning framework, the collaborative optimization between the agents and manager, as well as extreme control strategies, may lead to non-stationarity and partial observability issues during the learning process, causing both agents and managers to deviate from optimal behavior. To solve these issues, this paper uses phase learning, entropy regularization, and simulated annealing algorithms to stabilize the learning process [37,38]. The learning process is structured as follows:

(1) Pre-training in a no-manager control environment: Agents undergo an initial training phase in a flexible setting, allowing them to adapt to diverse strategies before the introduction of control constraints.

(2) Control strategy integration and entropy regularization: A control strategy is introduced, and entropy regularization is applied to promote strategy diversity and prevent premature convergence, thereby stabilizing the optimization process.

(3) Gradual integration of the manager’s control strategy using simulated annealing: To prevent unstable learning dynamics caused by the sudden introduction of the manager’s control strategy, a simulated annealing algorithm during the early phase of training is employed. Specifically, for the first one-quarter of the training duration, the influence of intervention strategies is reduced to zero. This gradual introduction ensures a smoother adaptation process, mitigating abrupt disruptions in agent learning.

Entropy regularization is achieved by incorporating a weighted entropy term into the strategy gradient objectives of both the agents and the manager, as shown in the following formula:

Entropy (π) = - E_{a ~ π (. | s)} [l og π (a | s)]

(8)

where entropy regularization enhances strategy exploration, stabilizes the solving process, and prevents training from failing to converge due to strategy mutations.

4. A Case Study of Intelligence Control of Unmanned Aerial Vehicle Swarm

4.1. Design of Experimental System Scenarios

This paper constructs a UAV swarm control system to validate the effectiveness of the swarm intelligence control method based on double-layer deep reinforcement learning. The UAV swarm is chosen as the research subject not only because it shares similar operating mechanisms with other swarm control systems, but also due to its highly complex dynamic interaction characteristics. In practical applications, UAV swarms must operate in complex environments while addressing multi-dimensional tasks and managing dynamic agent collaboration [39]. Such operational complexities demand advanced intelligent decision-making capabilities form the control system. By analyzing the evolutionary process of a UAV swarm control system through simulations, conclusions are derived that are applicable not only to UAV swarms but also potentially extendable to other swarm intelligence systems. This broad applicability underscores the universality and reliability of the research results.

In complex and dynamically changing environments, UAV swarms performing strike missions often struggle to distinguish between high-value and low-value targets. The difficulty stems from technical limitations, such as endurance constraints, limited storage capacity, restricted communication ability, and the algorithmic complexity in target recognition system. The assessment of high-value targets requires more energy, which in turn impacts task prioritization and the optimization of energy costs [40]. The manager can adjust strategies through data analysis and incentive mechanisms to increase the task completion amount. However, the overwhelming number of low-value targets compared to high-value ones leads UAVs to prioritize attacking low-value targets to maintain a competitive advantage. This results in challenges in simultaneously optimizing both task completion amount and task completion efficiency.

The UAV swarm intelligence control system developed in this paper includes three roles: UAVs, a manager, and tasks, with tasks allocated to target areas based on predefined rules, as shown in Figure 3. The system is divided into three task areas: i, ii, and iii (low-, medium-, and high-value task areas). The UAVs autonomously explore targets within their field of view and strike them, while dynamically increasing energy costs. A task is regarded as successfully completed when at least one UAV reaches the task location (i.e., the corresponding grid cell) before the task’s availability window expires. During this period, no other UAVs are permitted to interfere with or compete for the task. Once these conditions are satisfied, the task is removed from the environment and a corresponding reward is assigned to the agent responsible. This definition ensures that task completion is constrained both spatially and temporally, and is directly linked to each agent’s ability to manage limited energy and make strategic decisions in a decentralized multi-agent setting. Additionally, new tasks are automatically generated during processing. The manager selects the control strategy based on the UAV’s performance and allocates resources accordingly. Various experimental scenarios are created through different control strategies and parameters to analyze the evolution patterns of the UAV swarm control system and the optimization of strategies.

In the experimental system, UAVs make decisions based on environmental changes, task locations, control strategies, and the behavior of neighboring UAVs within their field of view. They choose a direction to move (up, down, left, or right) or remain stationary. Tasks are randomly generated, and incomplete tasks may disappear, with new tasks continuously being generated. UAVs with varying performance levels yield different rewards and consume varying amounts of energy when completing the same task (high-performance UAVs provide higher rewards and incur lower energy costs). The state space consists of multiple feature components, including: Environmental dynamics (task locations, task rewards, and energy requirements within the UAV’s field of view), Agent-level context (the UAV’s own energy level and performance capability), Neighboring agents (behavior and status of other UAVs in nearby grid cells), and High-level policy input (a control signal from the manager representing the global strategic directive). The UAVs’ strategy network outputs decisions in the form of a probability distribution, as shown in the following formula:

a_{i, t} ~ π_{a} (o_{i, t}^{vision}, o_{i, t}^{o u t e r}, o_{i, t}^{inner}, o_{i, t}^{policy}, h_{i, t - 1}; θ)

(9)

where the strategy network simultaneously updates the hidden state,

h_{i, t - 1}

, with inputs, including spatial observation values,

o_{i, t}^{vision}

, within the UAV’s field of view; the rewards and task information of UAVs in the field of view,

o_{i, t}^{o u t e r}

; the energy cost and capabilities of UAVs in the field of view,

o_{i, t}^{inner}

; and the manager’s control strategy,

o_{i, t}^{policy}

.

The manager influences the UAV swarm’s decision-making through the control strategy and adjusts the reward mechanism to optimize overall behavior. The manager acts at a higher decision-making level and influences the swarm’s behavior by adjusting the reward structure and providing strategic policy guidance. The manager observes the global map state, task location, and UAV rewards. Its output is the probability distribution of its control strategy, as shown in the following formula:

τ ~ π_{i} (o_{c, t}^{vision}, o_{c, t}^{o u t e r}, o_{c, t}^{policy}, h_{c, t - 1}; \emptyset)

(10)

where the strategy network simultaneously updates the manager’s hidden state,

h_{c, t - 1}

, with inputs, including the rewards and task information of all UAVs in the swarm,

o_{c, t}^{o u t e r}

, and the control strategy from the start of the system to its implementation,

o_{c, t}^{policy}

.

4.2. Experimental System Setting

In the UAV swarm control system, the cyclical game between the manager and the UAVs may lead to two trends: (1) intensified competition, reducing task completion efficiency, harming UAV income, and threatening the swarm’s sustainable development; (2) insufficient task completion amount, leading to a decline in swarm performance. To analyze the performance differences of different swarm control algorithms, this paper designs a comparative experiment, where the UAV’s performance, task initialization conditions, and total funding are identical to ensure a fair comparison. To quantify the performance of the algorithms in terms of resource allocation and reward balance, the UAV reward function is formulated as follows:

R e w a r d_{t, i} = \frac{β * R e v e n u e_{t}}{N} + (1 - β) * R e v e n u e_{t} * P r o p o r t i o n_{t, i}

(11)

where

P r o p o r t i o n_{t, i}

represents the fraction of tasks completed by UAV

i

relative to the total tasks in the swarm at time

t

.

N

denotes the total number of UAVs.

R e v e n u e_{t}

represents total funds within the period.

β

is the ratio of basic income to performance incentive, dynamically adjusted according to the control strategy. In this context, the agent’s task income can be defined as the benefit in each cycle, determined by the fraction of tasks completed by the UAV relative to the total tasks in the swarm and the manager’s control strategy. Furthermore, this paper evaluates the algorithms from three levels: individual, group, and system. At the individual level, the focus is on task income and task completion efficiency. At the group level, trajectory heatmaps are analyzed to examine agent interactions. At the system level, the mean values of task income, energy cost, task completion amount, task completion efficiency, task throughput, and comprehensive task effectiveness are measured [41].

In this study, we compare five representative and advanced UAV swarm control algorithms, as summarized in Table 3. The compared algorithms are as follows: the centralized rule-based control architecture method proposed by Wang et al. [15] (referred to as RBCA), the distributed random algorithm based on an improved genetic algorithm proposed by Zheng et al. [19] (referred to as GA), the distributed artificial bee colony path planning method based on multi-strategy evolutionary learning proposed by Han et al. [20] (referred to as ABC), the hybrid distributed collaborative search method based on multi-agent reinforcement learning proposed by Hou et al. [25] (referred to as DRL), and the method combining artificial bee colony algorithm with single-layer reinforcement learning (referred to as ABC-DRL). The inclusion of ABC-DRL as a comparative method aims to demonstrate the necessity of adopting a double-layer architecture with deep reinforcement learning. Given that DRL outperforms ABC in handling function fitting [26], ABC-DRL adopts the previously mentioned distributed ABC method as the inner model of the double-layer architecture, while the outer model employs a DRL structure consistent with D-DRL.

Table 3 clearly presents a side-by-side comparison of these methods in terms of the design, operational characteristics, and performance metrics. The optimization goals of RBCA, GA, ABC, and DRL are to improve comprehensive task efficiency, while ABC-DRL and D-DRL perform multi-level optimizations by optimizing the task completion amount in the inner layer and comprehensive task effectiveness in the outer layer. Under the condition that all reinforcement learning methods converge, the superiority of the double-layer deep reinforcement learning method is verified by comparing the performance metrics of different algorithms.

The experimental environment uses a 25 × 25 grid, where the state is represented as a three-dimensional tensor of size 25 × 25 × c, with c indicating the number of agents in each cell. The system parameters in this study are based on a UAV swarm produced by a collaborating helicopter company and are mapped to real-world application scenarios. To protect the company’s interests, certain parameters were adapted from established literature sources [27,28,29,30]. Specifically, the UAVs’ movement and task-related energy costs follow [27,28]; the task value distribution is drawn from [28,29]; the task respawn probability is based on [29,30]; and environmental dynamics (e.g., task generation rule) are modeled using sinusoidal functions inspired by [30]. The parameters listed in Table 4 can be categorized into two types: the first comprises core parameters (e.g., the number of UAVs, UAVs’ performance, UAVs’ vision, running time, single-period revenue, and the nonlinear parameter of efficiency), which are directly derived from the actual operational data of the UAV swarm to ensure consistency with real-world mission environments; the second consists of secondary parameters, which are set based on the existing literature to prevent the disclosure of sensitive data. Since the core parameters remain unchanged and the replacement of secondary parameters does not affect the overall modeling approach or experimental design, this adjustment has a limited impact on the validity and applicability of this study’s conclusions while ensuring the scientific rigor and verifiability of the model. To enhance the realism and traceability of the simulation environment, this study adopts a hybrid parameterization approach that combines empirical data with established modeling principles. Compared to the existing works that often utilize simplified or static formulations, the proposed model introduces dynamic and context-aware mechanisms. This integration of real-system-informed and literature-guided components enables a parameter setting framework that is both practically grounded and academically rigorous.

Here, U denotes Uniform Distribution and N denotes Normal Distribution. Specifically, the PPO algorithm in this project was implemented using the RLlib library. The environment for simulating the UAV swarm control system is set up using the Gym library. The detailed parameters of D-DRL training are shown in Table 5.

4.3. Experimental Results and Analysis

This paper selected 10 different random seed groups in the experiments. By averaging the results, it analyzed the convergence of the three reinforcement learning methods (DRL, ABC-DRL, and D-DRL) during the training process. Figure 4a,b depict the convergence curves of the mean reward in the inner and outer layers, respectively. These curves not only reflect the trend of reward changes over time, but also correspond to the improvement in task completion amount and comprehensive task effectiveness. It can be observed that all three methods exhibit a gradual increase in rewards followed by stabilization, with D-DRL demonstrating the highest stability and convergence level, particularly in terms of comprehensive task effectiveness. Figure 4c,f illustrate the changes in policy entropy in the inner and outer layers, representing the process of the agent exploring high-value actions. The policy entropy of the D-DRL method is slightly lower and fluctuates less in the later stages, indicating that it locks onto high-value strategies more quickly, thereby reducing policy uncertainty. Figure 4d,e compare the convergence of policy gradient loss, where the gradient loss of all methods shows a downward trend and eventually stabilizes. The D-DRL method exhibits a smoother decline in gradient loss, suggesting more stable policy optimization and the avoidance of large gradient fluctuations. Figure 4g,h analyze the value loss in the inner and outer layers, showing that value loss is initially high due to significant estimation errors in state value assessments. As training progresses, the value loss gradually decreases and stabilizes, with D-DRL demonstrating superior convergence and estimation stability. Finally, Figure 4i further reveals the changes in the two methods, ABC-DRL and D-DRL, when dynamically adjusting the scaling factor for regulation. The regulation strategy is updated every 100 steps, and the scaling factor is usually greater than 0.5 to ensure high task completion efficiency. In some cases, the scaling factor is less than or equal to 0.5 to stimulate competition among UAVs, thereby increasing the task completion amount. In the ABC-DRL method, the duration of the scaling factor being greater than 0.5 is longer, indicating that there might be deficiencies in the task completion amount, and it needs to be optimized by reducing the scaling factor.

In addition to the step-based performance comparison, we report the approximate wall time required for each algorithm to complete 12,000 episodes (1000 steps per episode) under the same hardware configuration. The estimated training durations are: D-DRL (~71 h), DRL (~67 h), and ABC-DRL (~70 h). While D-DRL incurs a slightly higher time cost, this is attributable to its hierarchical control structure and inter-agent coordination. The additional computational overhead is considered worthwhile, given the method’s superior task completion performance and greater robustness in dynamic environments.

Figure 5, Figure 6 and Figure 7 show the impact of six swarm intelligence control algorithms on the following indicators during the learning evolution process of the UAV swarm control system: individual UAV task income, individual UAV task completion efficiency, the mean values of task income, energy cost, task completion amount, task completion efficiency, task throughput, and comprehensive task effectiveness. The experimental results are analyzed from individual, group, and system levels, as follows:

(1) Figure 5a–f show the distribution of individual UAV task income and task completion efficiency under different control methods. From the individual level analysis, the conclusions are as follows: (1) Under RBCA control, most UAVs have a task completion efficiency below zero, indicating intense competition within the swarm and large task income disparities; (2) Compared to RBCA, GA and ABC show an increased proportion of negative task completion efficiency, with ABC showing the largest increase. The task income disparity is reduced in both algorithms, with ABC showing the smallest difference, and task income is normally distributed; (3) Under DRL and ABC-DRL control, about half of the UAVs have a task completion efficiency greater than zero, reducing internal competition, enhancing the task completion efficiency and task income, and narrowing the income gap, with ABC-DRL showing the smallest gap and task income following a normal distribution; (4) Under D-DRL control, most UAVs have a task completion efficiency greater than zero, significantly enhancing the swarm task completion efficiency, enhancing internal collaboration, and task income for individual UAVs are normally distributed, with the highest swarm task income.

(2) At the group level, Figure 6a–f analyze the UAV swarm trajectory heatmaps under different control algorithms, the following conclusions can be drawn: (1) Under RBCA control, the UAV swarm trajectory heatmap mainly shows light colors, indicating strong randomness, high exploration, and no significant aggregation effect; (2) Under GA, ABC, and DRL control, UAVs exhibit a clear aggregation effect in areas with higher task generation probability, reducing random exploration and improving efficiency, especially in the ABC algorithm, where trajectory overlaps and aggregation effect are most significant; and (3) Under ABC-DRL and D-DRL control, aggregation areas are dispersed, trajectories are relatively independent, exploration is high, and the trajectories overlap with task distribution. Under D-DRL, the overlap between trajectories and task distribution is more significant, improving the task completion efficiency.

(3) Figure 7a–f show the mean values of task income, energy cost, task completion amount, task completion efficiency, task throughput, and comprehensive task effectiveness under different swarm control algorithms. From the system-level analysis, the following conclusions can be drawn: (1) For task income and energy cost, a higher task income does not necessarily accompany a higher energy cost. The D-DRL and DRL methods exhibit strong learning capabilities and achieve a higher income with a low energy cost. RBCA initially has a high income with a higher energy cost but fails later due to the inability to adjust the strategy. GA and ABC show relatively stable but poor results. ABC-DRL outperforms ABC in terms of learning ability. (2) For task completion amount and task throughput, D-DRL and DRL achieve higher task throughput values compared to other methods, indicating stronger task search and completion capabilities. ABC-DRL improves the task completion efficiency by reducing high-energy tasks and ineffective competition within the swarm. The RBCA method is effective initially, but its performance weakens over time. (3) For task completion efficiency, RBCA outperforms GA and ABC, suggesting that simpler models avoid undesirable evolutionary directions to some extent. ABC alleviates the conflict between energy efficiency and task completion amount in the double-layer learning framework, validating the effectiveness of the double-layer learning architecture. The choice of the UAV swarm control algorithm can delay the turning point of task completion efficiency and improve task completion efficiency. (4) For comprehensive task effectiveness, D-DRL performs the best in enhancing comprehensive task effectiveness, ensuring that the UAV swarm control system maximizes task completion amount based on the task completion efficiency (energy efficiency). Compared to other intelligent control algorithms without adjustment mechanisms, RBCA exhibits greater stability and resistance to change in complex dynamic environments. Deep reinforcement learning not only has a high learning ability, but also self-regulation and collaboration capabilities. While the RBCA partially mitigates the trade-off between energy efficiency and task completion amount, its learning mechanism struggles under dynamic conditions. Specifically, changes in agent behavior or environmental variations may cause delays in adaptation responses, ultimately hindering the swarm’s ability to achieve desired outcomes during collaborative tasks.

The experimental results shown in Figure 5, Figure 6 and Figure 7 indicate that D-DRL maximizes the task completion amount while ensuring system task completion efficiency (energy efficiency), leading to the highest comprehensive task effectiveness for the swarm. Simple models (such as RBCA) can alleviate undesirable evolution to some extent but cannot adapt to complex task requirements in the long term. In contrast, the double-layer learning framework effectively mitigates the conflict between energy efficiency and task completion amount, improving the adaptability of the UAV swarm in complex environments.

To ensure the reliability of the results, this paper conducts experiments with 10 different random seeds and averages the results, as shown in Table 6. The table lists key metrics related to the task, including the mean values of task income, energy cost, task completion amount, task completion efficiency, task throughput, and comprehensive task effectiveness. The experimental results demonstrate that D-DRL outperforms other methods in all metrics, confirming its superiority in UAV swarm control.

5. Conclusions and Future Work

This paper proposes a swarm intelligence control design method based on double-layer deep reinforcement learning. By constructing a double-layer deep reinforcement learning framework, this method addresses the challenge of simultaneously optimizing energy efficiency and task completion. This challenge arises from the incomplete separation of individual and global optimization, as well as limited dynamic adaptability. This approach significantly enhances the swarm’s comprehensive task effectiveness. The research in this paper not only provides a new solution for UAV swarm intelligence control, but also offers new research ideas and tools for optimization in the field of swarm intelligence control.

In future work, further improvements will be made to the double-layer deep reinforcement learning-based swarm intelligence control design method, focusing on four main aspects: (1) comparing the simulation results of the swarm intelligence control system with the experimental results in real scenarios; (2) extending the swarm intelligence control algorithm to three-dimensional spaces, among other areas; (3) using simpler architectures (e.g., attention-based models) instead of multiple deep networks (CNN, LSTM, and DNN) to reduce computational costs; and (4) applying swarm intelligent control algorithm to more fields, such as unmanned transportation and industrial intelligence, etc.

Author Contributions

Conceptualization, G.Y.; methodology, X.Y.; software, G.H. and R.Z.; validation, X.Y.; formal analysis, L.H.; investigation, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing does not apply to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Perez-Cerrolaza, J.; Abella, J.; Borg, M.; Donzella, C.; Cerquides, J.; Cazorla, F.J.; Englund, C.; Tauber, M.; Nikolakopoulos, G.; Flores, J.L. Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A Survey. ACM Comput. Surv. 2024, 56, 1–40. [Google Scholar] [CrossRef]
Wang, G.-Y.; Cheng, D.-D.; Xia, D.-Y.; Jiang, H.-H. Swarm Intelligence Research: From Bio-Inspired Single-Population Swarm Intelligence to Human-Machine Hybrid Swarm Intelligence. Mach. Intell. Res. 2023, 20, 121–144. [Google Scholar] [CrossRef]
Senapati, M.K.; Al Zaabi, O.; Al Hosani, K.; Al Jaafari, K.; Pradhan, C.; Ranjan Muduli, U. Advancing Electric Vehicle Charging Ecosystems with Intelligent Control of DC Microgrid Stability. IEEE Trans. Ind. Appl. 2024, 60, 7264–7278. [Google Scholar] [CrossRef]
Wareham, T.; de Haan, R.; Vardy, A.; van Rooij, I. Swarm Control for Distributed Construction: A Computational Complexity Perspective. ACM Trans. Hum. Robot Interact. 2023, 12, 1–45. [Google Scholar] [CrossRef]
Katal, A.; Dahiya, S.; Choudhury, T. Energy Efficiency in Cloud Computing Data Centers: A Survey on Software Technologies. Cluster Comput. 2023, 26, 1845–1875. [Google Scholar] [CrossRef]
Qin, X.; Song, Z.; Hou, T.; Yu, W.; Wang, J.; Sun, X. Joint Optimization of Resource Allocation, Phase Shift, and UAV Trajectory for Energy-efficient RIS-assisted UAV-enabled MEC Systems. IEEE Trans. Green Commun. Netw. 2023, 7, 1778–1792. [Google Scholar] [CrossRef]
Wang, Y.; Chen, H.; Xie, L.; Liu, J.; Zhang, L.; Yu, J. Swarm Autonomy: From Agent Functionalization to Machine Intelligence. Adv. Mater. 2025, 37, 2312956. [Google Scholar] [CrossRef]
Cetinsaya, B.; Reiners, D.; Cruz-Neira, C. From PID to Swarms: A Decade of Advancements in Drone Control and Path Planning—A Systematic Review (2013–2023). Swarm Evol. Comput. 2024, 89, 101626. [Google Scholar] [CrossRef]
Liu, Y.; Liu, J.; He, Z.; Li, Z.; Zhang, Q.; Ding, Z. A Survey of Multi-Agent Systems on Distributed Formation Control. Unmanned Syst. 2024, 12, 913–926. [Google Scholar] [CrossRef]
Saha, D.; Bazmohammadi, N.; Vasquez, J.C.; Guerrero, J.M. Multiple Microgrids: A Review of Architectures and Operation and Control Strategies. Energies 2023, 16, 600. [Google Scholar] [CrossRef]
Peng, F.; Zheng, L. Applications. Fuzzy Rule-based Neural Network for High-speed Train Manufacturing System Scheduling Problem. Neural Comput. Appl. 2023, 35, 2077–2088. [Google Scholar] [CrossRef]
Hatata, A.Y.; Hasan, E.O.; Alghassab, M.A.; Sedhom, B.E. Centralized Control Method for Voltage Coordination Challenges with OLTC and D-STATCOM in Smart Distribution Networks based IoT Communication Protocol. IEEE Access 2023, 11, 11903–11922. [Google Scholar] [CrossRef]
Meng, Q.; Hussain, S.; Luo, F.; Wang, Z.; Jin, X. An Online Reinforcement Learning-based Energy Management Strategy for Microgrids with Centralized Control. IEEE Trans. Ind. Appl. 2025, 61, 1501–1510. [Google Scholar] [CrossRef]
Mostaani, A.; Vu, T.X.; Chatzinotas, S.; Ottersten, B. Task-effective Compression of Observations for the Centralized Control of a Multiagent System over Bit-budgeted Channels. IEEE Internet Things J. 2023, 11, 6131–6143. [Google Scholar] [CrossRef]
Wang, Y.; Xing, L.; Wang, J.; Xie, T.; Chen, L. Multi-objective Rule System Based Control Model with Tunable Parameters for Swarm Robotic Control in Confined Environment. Complex Syst. Model. Simul. 2024, 4, 33–49. [Google Scholar] [CrossRef]
Guo, F.; Huang, Z.; Wang, L.; Wang, Y. Distributed Event-Triggered Voltage Restoration and Optimal Power Sharing Control for an Islanded DC Microgrid. Int. J. Electr. Power Energy Syst. 2023, 153, 109308. [Google Scholar] [CrossRef]
Guo, Y.; Wang, Q.; Sun, P.; Feng, X. Distributed Adaptive Fault-Tolerant Control for High-Speed Trains Using Multi-Agent System Model. IEEE Trans. Veh. Technol. 2023, 73, 3277–3286. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Sun, J.; Wang, G.; Chen, J. Data-Driven Consensus Control of Fully Distributed Event-Triggered Multi-Agent Systems. Sci. China Inf. Sci. 2023, 66, 152202. [Google Scholar] [CrossRef]
Zheng, J.; Ding, M.; Sun, L.; Liu, H. Distributed Stochastic Algorithm Based on Enhanced Genetic Algorithm for Path Planning of Multi-UAV Cooperative Area Search. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8290–8303. [Google Scholar] [CrossRef]
Han, Z.; Chen, M.; Shao, S.; Wu, Q. Improved Artificial Bee Colony Algorithm-Based Path Planning of Unmanned Autonomous Helicopter Using Multi-Strategy Evolutionary Learning. Aerosp. Sci. Technol. 2022, 122, 107374. [Google Scholar] [CrossRef]
Song, Y.; Lim, S.; Myung, H.; Lee, H.; Jeong, J.; Lim, H.; Oh, H. Distributed Swarm System with Hybrid-Flocking Control for Small Fixed-Wing UAVs: Algorithms and Flight Experiments. Expert Syst. Appl. 2023, 229, 120457. [Google Scholar] [CrossRef]
Ibrahim, A.-W.; Fang, Z.; Cai, J.; Hassan, M.H.F.; Imad, A.; Idriss, D.; Tarek, K.; Abdulrahman, A.A.-S.; Fahman, S. Fast DC-Link Voltage Control Based on Power Flow Management Using Linear ADRC Combined with Hybrid Salp Particle Swarm Algorithm for PV/Wind Energy Conversion System. Int. J. Hydrogen Energy 2024, 61, 688–709. [Google Scholar]
Yakout, A.H.; Hasanien, H.M.; Turky, R.A.; Abu-Elanien, A.E.B. Improved Reinforcement Learning Strategy of Energy Storage Units for Frequency Control of Hybrid Power Systems. J. Energy Storage 2023, 72, 108248. [Google Scholar] [CrossRef]
Liu, X.; Chai, Z.-Y.; Li, Y.-L.; Cheng, Y.-Y.; Zeng, Y. Multi-Objective Deep Reinforcement Learning for Computation Offloading in UAV-Assisted Multi-Access Edge Computing. Inf. Sci. 2023, 642, 119154. [Google Scholar] [CrossRef]
Hou, Y.; Zhao, J.; Zhang, R.; Cheng, X.; Yang, L. UAV Swarm Cooperative Target Search: A Multi-Agent Reinforcement Learning Approach. IEEE Trans. Intell. Veh. 2024, 9, 568–578. [Google Scholar] [CrossRef]
Chen, J.; Zhao, Y.; Wang, M.; Yang, K.; Ge, Y.; Wang, K.; Lin, H.; Pan, P.; Hu, H.; He, Z.; et al. Multi-Timescale Reward-Based DRL Energy Management for Regenerative Braking Energy Storage System. IEEE Trans. Transp. Electrif. 2025. [Google Scholar] [CrossRef]
Cabuk, U.C.; Tosun, M.; Dagdeviren, O.; Ozturk, Y. Modeling Energy Consumption of Small Drones for Swarm Missions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10176–10189. [Google Scholar] [CrossRef]
Laghari, A.A.; Jumani, A.K.; Laghari, R.A.; Nawaz, H. Unmanned aerial vehicles: A review. Cogn. Robot. 2023, 3, 8–22. [Google Scholar] [CrossRef]
Horyna, J.; Baca, T.; Walter, V.; Albani, D.; Hert, D.; Ferrante, E.; Saska, M. Decentralized Swarms of Unmanned Aerial Vehicles for Search and Rescue Operations without Explicit Communication. Auton. Robot. 2023, 47, 77–93. [Google Scholar] [CrossRef]
Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-Art and Future Research Challenges in UAV Swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
Larkin, E.V.; Akimenko, T.A.; Bogomolov, A.V. The Swarm Hierarchical Control System. In International Conference on Swarm Intelligence; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; pp. 30–39. ISBN 9783031366215. [Google Scholar]
Gu, Y.; Cheng, Y.; Chen, C.L.P.; Wang, X. Proximal Policy Optimization with Policy Feedback. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 4600–4610. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
Ye, M.; Han, Q.-L.; Ding, L.; Xu, S. Distributed Nash Equilibrium Seeking in Games with Partial Decision Information: A Survey. Proc. IEEE Inst. Electr. Electron. Eng. 2023, 111, 140–157. [Google Scholar] [CrossRef]
Barman, S.; Khan, A.; Maiti, A.; Sawarni, A. Fairness and Welfare Quantification for Regret in Multi-Armed Bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 6762–6769. [Google Scholar]
Stein, A.; Salvioli, M.; Garjani, H.; Dubbeldam, J.; Viossat, Y.; Brown, J.S.; Staňková, K. Stackelberg Evolutionary Game Theory: How to Manage Evolving Systems. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2023, 378, 20210495. [Google Scholar] [CrossRef]
Liu, F.; Zhang, T.; Zhang, C.; Liu, L.; Wang, L.; Liu, B. A Review of the Evaluation System for Curriculum Learning. Electronics 2023, 12, 1676. [Google Scholar] [CrossRef]
Makri, S.; Charalambous, P. Curriculum Based Reinforcement Learning for Traffic Simulations. Comput. Graph. 2023, 113, 32–42. [Google Scholar] [CrossRef]
Bu, Y.; Yan, Y.; Yang, Y. Advancement Challenges in UAV Swarm Formation Control: A Comprehensive Review. Drones 2024, 8, 320. [Google Scholar] [CrossRef]
Tang, J.; Duan, H.; Lao, S. Swarm Intelligence Algorithms for Multiple Unmanned Aerial Vehicles Collaboration: A Comprehensive Review. Artif. Intell. Rev. 2023, 56, 4295–4327. [Google Scholar] [CrossRef]
Bai, Y.; Zhao, H.; Zhang, X.; Chang, Z.; Jäntti, R.; Yang, K. Toward Autonomous Multi-UAV Wireless Network: A Survey of Reinforcement Learning-Based Approaches. IEEE Commun. Surv. Tutor. 2023, 25, 3038–3067. [Google Scholar] [CrossRef]

Figure 1. The operational framework of swarm intelligence control in complex dynamic environments. The unmanned transportation on the upper left and industrial production on the upper right represent typical scenarios for applying swarm intelligence control in different fields. The dynamic interactive module above the center forms a closed loop of the urban environment below and the group intelligent control scene above through information sharing.

Figure 2. Architecture diagram of the swarm intelligence control design method based on double-layer deep reinforcement learning.

Figure 3. Design diagram of the UAV swarm intelligence control experimental system.

Figure 4. Training convergence curves of D-DRL, DRL, and ABC-DRL, and the trapezoidal diagram of the scaling factor for regulation. In (a–h), the x-axis represents the training steps of the D-DRL method, while in (i), it denotes the steps executed in the UAV swarm control simulation environment. All training curves are smoothed using an exponential moving average. Subfigures (a–d,g) use a smoothing factor of 0.8, and subfigures (e,f,h) use a smoothing factor of 0.6.

Figure 5. Distribution of UAV income and efficiency under different methods (individual level). (a) shows the distribution of income and efficiency in RBCA, (b) shows the distribution of income and efficiency in GA, (c) shows the distribution of income and efficiency in ABC, (d) shows the distribution of income and efficiency in DRL, (e) shows the distribution of income and efficiency in ABC-DRL, (f) shows the distribution of income and efficiency in D-DRL.

Figure 6. Heatmap of UAV motion trajectories under different control methods (group level). (a) shows the RBCA trajectory heatmap, (b) shows the GA trajectory heatmap, (c) shows the ABC trajectory heatmap, (d) shows the DRL trajectory heatmap, (e) shows the ABC-DRL trajectory heatmap, (f) shows the D-DRL trajectory heatmap.

Figure 7. Comparative analysis chart of indicators of the UAV swarm control system (system level). (a) shows the task income, (b) shows the energy cost, (c) shows the task throughout, (d) shows the task completion, (e) shows the task completion efficiency, (f) shows the comprehensive task effectiveness.

Table 1. Comparison of related works.

Method Category	Related Work	Strengths	Weaknesses	Comparison to This Work
Centralized Control	Peng et al. [11], Hatata et al. [12], Meng et al. [13], Mostaani et al. [14], Wang et al. [15]	Simplifies decision-making, achieves global optimization	Risk of insufficient optimization capability, single point failure, and processing capacity bottleneck	This work addresses the shortcomings of centralized control by decoupling local and global optimization, enhancing scalability and adaptability in complex scenarios
Distributed Control	Guo et al. [16], Guo et al. [17], Li et.al. [18], Zheng et al. [19], Han et al. [20]	Better fault tolerance and autonomy, increases agent decision-making capabilities	Increased decision complexity, load balancing issues, lack of effective information coordination	This work resolves the coordination problem in distributed control by introducing a double-layer architecture for better global optimization
Hybrid Control	Song et al. [21], Ibrahim et al. [22], Yakout et al. [23], Liu et al. [24], Hou et al. [25]	Combines the advantages of centralized and distributed control	Higher collaboration complexity, insufficient collaborative optimization capabilities, and reduced control flexibility	This work optimizes the balance between task completion and energy efficiency using a double-layer framework that improves both local and global optimization

Table 2. Mathematical symbols and their corresponding meanings.

Symbol	Meaning	Symbol	Meaning
$t$	Time	$i, j, k$	Agent indices
$θ$	Agent policy weights	$T a s k$	Task completion amount
$\emptyset$	Manager policy weights	$ω$	Collision penalty weight
$a_{i}$	Agent $i$ action	$T C E$	Task completion efficiency
$τ$	Manager action	$C T E$	Comprehensive task effectiveness
$s_{i}$	Agent $i$ state	$R e w a r d$	Agent income
$s_{c}$	Manager state	$C o s t$	Energy cost
$o_{i}$	Agent $i$ observation	$R e v e n u e$	Total funds
$o_{c}$	Manager observation	$β$	The ratio of basic income to performance incentive
$h_{i}$	Agent $i$ hidden state	$P r o p o r t i o n$	The fraction of tasks completed relative to the total tasks
$h_{c}$	Manager hidden state	$H$	Sampling horizon
$r_{i}$	Agent $i$ reward	$N$	The number of agents
$r_{c}$	Manager reward	$η$	Degree of nonlinearity
$π_{i}$	Agent $i$ policy	$d_{s a f e}$	Safety distance
$π_{c}$	Manager policy	$γ$	Discount factor

Table 3. Comparative summary of UAV swarm control algorithms.

Algorithm	Architecture	Optimization Goal	Double-Layer Structure	Characteristics/Performance Comparison
RBCA [15]	Centralized	Comprehensive task efficiency	Single layer	Suitable for coordinated tasks but lacks flexibility
GA [19]	Distributed	Comprehensive task efficiency	Single layer	High randomness, unsuitable for complex searches
ABC [20]	Distributed	Comprehensive task efficiency	Single layer	Efficient local search, weaker global optimization
DRL [25]	Hybrid	Comprehensive task efficiency	Single layer	Strong learning ability, but incomplete separation of individual and global optimization
ABC-DRL [20,26]	Hybrid	Inner: task completion amount; Outer: comprehensive task efficiency	Double layer	Good for individual optimization, but weak in global optimization
D-DRL	Hybrid	Inner: task completion amount; Outer: comprehensive task efficiency	Double layer	Combines individual and global optimization, superior performance

Table 4. Important parameter settings in the UAV swarm intelligence control system.

Category	Variable	Parameter Settings	Origin
Core	Number of UAVs (Region 1: Region 2: Region 3)	6:6:6	Industrial UAV system
	UAV’s vision	$U (1, 4)$
	Running Time	1000
	UAV’s performance	$N (0, 1)$
	Nonlinearity parameter of efficiency	$η = 0.21$
	Single-period revenue	$R e v e n u e_{t} = 20$
Secondary	UAV’s energy cost for movement	$k * x$ $, where k = 1, x$ is the movement distance	[27,28]
	UAV’s exploration energy cost	$U (1, 3)$	[27,28]
	UAV’s completion energy cost	$U (10, 15)$	[27,28]
	Task value type	$U (1, 3)$	[28,29]
	Task respawn probability	0.1	[29,30]
	Region 1 task generation rule	$5 \times s i n (t) + 40$	[30]
	Region 2 task generation rule	$3 \times s i n (t) + 30$	[30]
	Region 3 task generation rule	$2 \times s i n (t) + 20$	[30]

Table 5. Important parameter settings of D-DRL.

Category	Variable	Parameter Settings
General Settings	Reinforcement learning training algorithm	PPO
	First-stage training steps: second-stage training steps (million)	150:150
	Training duration (million)	150
	Second-stage manager strategy annealing duration (million)	37.5
	Second-stage entropy regularization duration (million)	70
	Value/Policy networks share weights	False
	Discount factor	$0.998$
	Generalized Advantage Estimation discount parameter	0.98
	GAE lambda	0.98
	Gradient clipping norm threshold	10
	Value function loss coefficient	0.05
	Whether the advantage is normalized	True
	SGD sequence length	50
	Number of parallel environment replicas	60
	Number of fully connected layers	2
	Sampling horizon (steps per replica)	200
	Episode length	1000
	Episodes	12,000
	Epochs	1000
	Activation function	Relu, Softmax
	CPU	16
Agent Settings	Agent CNN dimension	128
	Agent LSTM unit size	128
	Entropy regularization coefficient	0.025
	Agent learning rate	0.0003
	All agents share weights	True
	Number of convolutional layers	2
	Policy updates per horizon (agents)	16
	Agent SGD mini-batch size	3000
Manager Settings	Policy updates per horizon (manager)	4
	Manager CNN dimension	256
	Manager LSTM unit size	256
	Entropy regularization coefficient	0.2
	Manager learning rate	0.0001
	Manager SGD mini-batch size	3000
	Number of convolutional layers	2

Table 6. Comparison of indicators after averaging the results of 10 random experimental groups.

	Income	Amount	Cost	Throughput	Efficiency	Effectiveness
D-DRL	1111.11 ± 0.00	417.77 ± 4.98	1229.89 ± 6.29	0.42 ± 0.00	27.66 ± 1.31	11,650.87 ± 457.51
ABC-DRL	1082.61 ± 4.44	186.57 ± 9.27	1325.13 ± 20.69	0.19 ± 0.01	1.55 ± 3.77	268.73 ± 698.61
DRL	1111.11 ± 0.00	404.99 ± 2.25	1241.00 ± 9.55	0.40 ± 0.00	22.87 ± 2.20	9260.62 ± 884.89
ABC	1062.44 ± 13.93	192.68 ± 13.54	1341.52 ± 24.42	0.19 ± 0.01	−8.39 ± 3.26	−1629.32 ± 691.03
GA	1050.78 ± 13.85	$182 .$ 48 ± 13.65	1321.32 ± 21.98	0.18 ± 0.01	−6.87 ± 2.59	−1269.14 ± 538.49
RBCA	1016.56 ± 14.43	159.19 ± 10.04	1283.91 ± 18.20	0.16 ± 0.01	−4.24 ± 2.68	−686.86 ± 453.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, X.; Yu, G.; Huang, G.; Zhou, R.; Hao, L. Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning. Appl. Sci. 2025, 15, 4337. https://doi.org/10.3390/app15084337

AMA Style

Yan X, Yu G, Huang G, Zhou R, Hao L. Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning. Applied Sciences. 2025; 15(8):4337. https://doi.org/10.3390/app15084337

Chicago/Turabian Style

Yan, Xiangpei, Guorui Yu, Guoke Huang, Ruchuan Zhou, and Liu Hao. 2025. "Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning" Applied Sciences 15, no. 8: 4337. https://doi.org/10.3390/app15084337

APA Style

Yan, X., Yu, G., Huang, G., Zhou, R., & Hao, L. (2025). Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning. Applied Sciences, 15(8), 4337. https://doi.org/10.3390/app15084337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design of Swarm Intelligence Control Based on Double-Layer Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. Double-Layer Deep Reinforcement Learning Design for Swarm Intelligence Control

3.1. Overall Architecture of Double-Layer Deep Reinforcement Learning Framework

3.2. Inner-Layer Agent Optimization Mechanism for Individual and Group Control

3.3. Outer-Layer Manager Optimization Mechanism for System Control

3.4. Bi-Directional Dynamic Interaction Process and Evolutionary Game Mechanism

4. A Case Study of Intelligence Control of Unmanned Aerial Vehicle Swarm

4.1. Design of Experimental System Scenarios

4.2. Experimental System Setting

4.3. Experimental Results and Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI