Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions

Zhao, Longqian; Chen, Bing; Hu, Feng

doi:10.3390/drones9020119

Open AccessArticle

Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions

by

Longqian Zhao

¹,

Bing Chen

^1,2,*

and

Feng Hu

¹

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

²

Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(2), 119; https://doi.org/10.3390/drones9020119

Submission received: 19 December 2024 / Revised: 30 January 2025 / Accepted: 2 February 2025 / Published: 6 February 2025

Download

Browse Figures

Versions Notes

Abstract

Under complex conditions, the collaborative control capability of UAV swarms is considered to be the key to ensuring the stability and safety of swarm flights. However, in complex environments such as forest firefighting, traditional swarm control methods struggle to meet the differentiated needs of UAVs with differences in behavior characteristics and mutually coupled constraints, which gives rise to the problem that adjustments and feedback to the control policy during training are prone to erroneous judgments, leading to decision-making dissonance. This study proposed a swarm control method for complementary collaboration of UAVs under complex conditions. The method first generates training data through the interaction between UAV swarms and the environment; then it captures the potential patterns of UAV behaviors, extracts their differentiated behavior characteristics, and explores diversified behavior combination scenarios with complementary advantages; accordingly, dynamic behavior allocations are made according to the differences in perception accuracy and action capability to achieve collaborative cooperation; and finally, it optimizes the neural network parameters through behavior learning to improve the decision-making policy. According to the experimental results, the UAV swarm control method proposed in this study demonstrates high formation stability and integrity when dealing with the collaborative missions of multiple types of UAVs.

Keywords:

end-edge-cloud collaboration; multi-agent reinforcement learning; decision dissonance; swarm control

1. Introduction

UAV swarm refers to a group of multiple identical or different types of drones that form an aerial intelligent system through local interactions and self-organizing mechanisms. By simulating the intelligent behaviors of biological populations in nature, the swarm achieves coordinated operations [1,2,3]. Under ideal conditions, assuming no constraints such as resource limitations, cost, or technical requirements, a UAV swarm can consist of entirely identical drones. However, due to objective requirements such as optimization configurations, cost reduction, and diverse business needs, heterogeneous UAV swarms make full use of their distinct device functions and data characteristics, leveraging the system’s efficiency and flexibility in collaboration. This avoids the limitations faced by a homogeneous swarm, such as a single architecture and redundant computational resources, which could prevent it from performing complex and varied flight tasks. For instance, in noisy environments, UAVs equipped with anti-jamming communication devices within a heterogeneous swarm can ensure the communication stability of other UAVs in the swarm, thereby improving the overall system’s task execution efficiency and operational stability [4]. As UAV swarms continue to expand into new scenarios such as environmental monitoring and search and rescue, leveraging the diversity and complementarity of heterogeneous UAV swarms will fully exploit the overall effectiveness of the heterogeneous system to meet complex business demands [5,6,7]. Currently, traditional swarm control methods are usually based on the implementation of hierarchical planning control. However, due to significant differences in capabilities such as model learning and data characterization, it requires the deployment of independent control models for different types of UAVs, which reduces the scalability of the models for application in swarms of multiple types of UAVs, thus making it difficult to accurately align the differences in the data distribution between various types of UAV behavior and state information, leading to different behavior habits and decision-making preferences. This increases the difficulty of convergence and searching for feasible solutions in the iterative process, consequently impacting the generalization ability during model training and making it easy to misjudge the adjustment and feedback of different control models, thus leading to decision-making dissonance and exposes UAV swarms to more complex formation control problems.

Compared to homogeneous multi-agent swarm systems composed of identical drones, heterogeneous UAV swarms demonstrate greater advantages under certain practical conditions. This is primarily because heterogeneous UAV swarms can fully exploit the diverse functions and performance characteristics of different types of drones, avoiding functional redundancy and the waste of computational resources. At the same time, by dynamically adjusting based on specific business needs, heterogeneous drones can achieve more efficient collaboration, significantly enhancing the flexibility and adaptability of the swarm system in complex scenarios. By carrying various equipment and sensor equipment, the UAV system platform of the swarm enables each UAV to give full play to its own advantages and collaboratively utilize swarm resources. For example, UAVs can be equipped with LiDAR and Time-of-Flight (ToF) sensors to enhance environmental awareness, or vision sensors and Global Positioning System (GPS) sensors to improve navigation capabilities [8,9]. Suppressing interference from noise signals is a critical component of UAV swarm systems, as it can significantly enhance the reliability and stability of the system. By using effective anti-jamming measures such as filtering methods and neural networks, it is possible to continuously suppress noise interference with normal signals, ensuring that the drones operate normally [10,11]. With the help of multi-source sensor fusion technology, the swarm can use sensor data from different members for mutual correction and compensation, achieving precise control over the speed, direction, and other states of each drone. This ensures that the swarm can maintain formation stability while quickly and safely reaching its destination. However, UAVs in complex environments have significant differences in terms of dynamical characteristics and load capacity, and additional collaboration mechanisms need to be introduced to effectively coordinate the behavior of individual agents. Meanwhile, the diversity of behavior patterns and state information affects the generalization ability during model training.

Common swarm control methods have their own advantages and disadvantages in the application of UAV clustering for mission execution under complex conditions. Based on the optimized swarm control method, which relies on established mathematical frameworks for UAV swarm in either the time or frequency domain, the operation of the UAVs according to the preset constraints can be ensured. Although the method fully takes into account the differences in the dynamics of UAVs, its application to online control missions with high real-time requirements is limited by dilemmas such as high computing complexity and difficulty in interpreting the principles of the algorithm. The graph theory-based swarm control method, on the other hand, constructs a topological graph in the airspace based on the dynamics of UAVs and differential sensing capabilities, generates formation structures and dynamically adjusts the node positions and the connection weights of the edges in order to realize the flexible adjustment of the formation morphology. However, the method has limitations in dealing with diverse constraints, which makes it difficult to effectively integrate various types of dynamic constraints and easy to fall into local optimal solutions. In addition, machine learning-based swarm control methods, such as reinforcement learning, focus on transforming the formation problem of UAVs into a decision-making problem. The method adapts to various UAVs’ perception capabilities, decision-making styles and action characteristics by learning empirical data. Overall, the existing swarm control methods are deficient in real-time, complex environment adaptability, and computing resource requirements [12,13,14]. The difficulty of effectively reusing what has been learned between different types of UAVs and the still complex parameter tuning for different missions and environments leads to a long computation time. Research has shown that as UAV swarm complexity continues to increase, the swarm control problem of coordination imbalance will become increasingly prominent.

In recent years, the wide application of multi-agent reinforcement learning methods has shown great potential in areas such as distributed decision-making tasks. Leveraging an end-edge-cloud collaborative framework, multi-agent deep reinforcement learning is employed to train a UAV swarm control model, with the following core contributions highlighted:

A self-adaptive behavior matching based on dual-layer imitation learning was proposed to enhance the formation collaboration capability of UAV swarm. Unlike traditional multi-objective optimization, this method combines implicit and explicit alignment processes to make full use of expert knowledge by dynamically balancing policy search so that policy generation can better balance multiple objectives, which in turn improves the understanding and imitation of formation behavior by student networks. In the process of behavior allocation, it adopts an adaptive feature embedding mechanism to ensure that the agents are flexibly divided according to their respective capabilities in the swarm formation, giving full play to the collaborative advantages of each UAV in the swarm formation. This method can effectively mitigate the nonlinear coupling effect induced by heterogeneity and enhance the mission execution capability in complex environments.
A behavior learning based on cognitive dissonance optimization was designed, aiming to improve the behavior learning efficiency of multi-agent under complex conditions by balancing individual cognitive dissonance loss and team cognitive dissonance loss. This method, combined with the individual behaviors assigned to each agent in the behavior allocation phase, is conducive to give full play to the overall advantages of UAV swarm formation, strengthen the complementary capabilities between platforms, effectively mitigate the decision-making miscalculation caused by the inconsistency of adjustment and feedback of the differentiated formation model, meet the collaborative decision-making needs of UAV swarm in complex environments, and ultimately realize the highly efficient collaborative control of the swarm.

Section 1 provides an overview of the background knowledge. Section 2 reviews relevant studies. Section 3 introduces the proposed hybrid multi-agent swarms control decision-making approach. Section 4 evaluates the effectiveness of the model through comparative experiments and analysis of experimental results. Section 5 summarizes the findings and outlines potential directions for future research.

2. Related Work

In recent years, UAV swarms have been increasingly applied in diverse fields due to their notable advantages. With the rise of artificial intelligence in decision-making, swarm control has been recognized as a vital research area, drawing extensive attention and analysis from global scholars.

Wu et al. [15] proposed a multi-step particle swarm optimization algorithm for formations composed of different types of UAVs, which was used to predict the particle motion states at multiple time steps and adjust the update rules of the particles according to the dynamic targets and environments to ensure that the relative distances between the UAVs satisfy the preset formation shape requirements. Meng et al. [16] proposed a swarm control method based on an elite ant colony, combining the kinematic characteristics of different types of UAVs, and through the global pheromone sharing and dynamic adjustment mechanism, it can realize collaborative formation among multiple UAVs in complex terrain. Kumar et al. [17] proposed a geometric structure-based formation control algorithm that dynamically adjusts the shape of the collision cone based on UAV speed and position information, and this method enables UAVs with different characteristics to adapt to different formation constraints. However, the control effect of these methods is highly dependent on parameter selection. Due to the differences in flight performance, sensor type and accuracy of UAVs, key parameters such as path point spacing, safe distance threshold and algorithm computation frequency need to be individually adjusted according to the specific characteristics of UAVs and the actual situation. In addition, parameter settings usually have to rely on experts’ accumulated experience and extensive experimental validation, leading to a complex and challenging process of parameter adjustment and optimization of algorithms in UAS. In order to break through the limitations of the environment, swarm control methods based on graph theory [18] and artificial potential field method [14] have been proposed, such as the Heterogeneous Multi-Agent Artificial Potential Field (HMAPF) method based on generalized predictive control [19]. The artificial potential field method achieves formation maintenance through gravitational fields and utilizes information sharing between individuals to achieve dynamic adjustment of UAV speed. However, this method makes it difficult to realize flexible adjustments of direction or speed. Due to the differences between different types of UAVs in terms of steering radius, power system response time, etc., smaller UAVs may encounter excessive fluctuations due to frequent path adjustments, while larger UAVs struggle to make timely adjustments due to motion inertia, which may cause them to deviate from the global coordination goal or affect the overall stability of the swarm. Such heterogeneous characteristics require fine optimization and adjustment for each type of UAV. Deep reinforcement learning methods for heterogeneous multi-agents can effectively coordinate different types of UAVs. For example, the fusion and sharing of information through cooperative sensing mechanisms can facilitate efficient communication and sensing between different UAVs [20,21]. Alternatively, by treating the UAV swarm control problem as a multi-stage decision-making process, disassembling it into multiple interrelated sub-problems, and coordinating different types of UAVs at each stage, it can ensure that the solution to the sub-problems can be provided for subsequent stages, e.g., the Hierarchical Actor-Critic Value Function (HACVF) method based on hierarchical Actor–Critic structure [22,23]. However, due to the limited computing power of UAV devices, it requires offloading missions to servers with more computing power. To address this, end-edge-cloud architectures have been introduced to optimize resource consumption and minimize delay overhead [24,25]. Due to differences in perception accuracy, decision-making styles, and operational capabilities of UAVs in complex conditions, it requires the deployment of individual control models for different types of UAVs, which makes it difficult to efficiently eliminate differences in their data distributions in terms of behavior patterns and state information. In addition, the adjustment and feedback of different control models are also prone to erroneous judgments, which in turn leads to inconsistent behavior of UAVs during formation flight and affects the coordination of swarm control. Meanwhile, the complexity of operations has been increasing with the diversity of UAV types, all of which have led to the growing problem of swarm control of UAV swarm in the presence of decision dissonance. In view of this, for the formation control scenario of UAV swarms, this study proposed a UAV swarm control method, the Hybrid Multi-Agent Collaborative Adaptive Swarm Decision-making Method (HACASDM), which is suitable for complex conditions.

3. UAV Swarm Control Methods Under Complex Conditions

This section outlines the proposed collaborative formation control method, details the key technologies involved, and concludes with the pseudo-code for UAV swarm decision-making.

3.1. Problem Statement

UAV swarm formation is a collaboration-based multi-agent system that can be composed of both UAVs under ideal conditions and UAVs under complex conditions, and its core objective is to efficiently accomplish flight missions through collaborative decision-making [26,27]. By using various types of UAV swarms in formation for collaborative decision-making, it not only helps to reduce functional redundancy in UAV swarms under ideal conditions, but also enables more flexible and autonomous decision-making by taking advantage of the differentiated behavior patterns of UAVs. By giving full play to the respective advantages of different types of UAVs, combined with precise environmental sensing and efficient information-sharing mechanisms, it is able to achieve the comprehensive goal of multiplying its effectiveness and significantly improving the response capability of UAV swarms in complex environments. In order to more fully utilize swarm advantages, improve system robustness and enhance complementary capabilities between different platforms, it is necessary to enable collaborative decision-making in UAV swarms. The end-edge-cloud architecture is composed of three main components: the UAV swarm in complex environments, the edge server, and the cloud server. UAV swarms under complex conditions consist of UAVs equipped with sensors of different accuracies and enable short-term model tuning through edge servers deployed on command vehicles, while cloud servers are responsible for long-term model optimization [28,29,30]. Depending on the geographical location of the device and the division of roles, these three parts take on different functions and work together to realize intelligent decision-making for UAV swarms. Figure 1 depicts the formation control problem of UAV swarm

In complex environments, the extensibility of a unified policy model for formation missions is limited by the significant differences in performance and sensor configurations of different types of UAVs in a UAV swarm. Meanwhile, in the face of frequently changing temporal and spatial requirements, it is necessary to reasonably coordinate the flight policy of each UAV in the presence of differences in the data distribution of behavior and state information, so as to effectively reduce the loss value of individual and inter-individual flight strategies within the UAV swarm. For this reason, multi-modal heterogeneous data must be fully utilized in the model training process to meet the collaborative decision-making needs of UAV swarms under diverse conditions. However, due to the tightly coupled constraints between UAVs, the adjustment and feedback of different control models may generate decision misjudgment, thus weakening the adaptive adjustment capability and real-time co-optimization performance of the models in the inference deployment phase. This will trigger uncoordinated UAV swarm flight behaviors, coupled with the increasing operational complexity with the diversity of UAV types, which will produce dissonance in decision-making, further exacerbating the difficulty of policy coordination and execution of UAV swarm in formation flight under complex conditions.

3.2. Problem Modeling and Analysis

A swarm system composed of different types of UAVs needs to fully integrate the relevant characteristics of various types of UAVs in order to realize the autonomous decision-making capability and the collaborative working mechanism so as to ensure efficient swarm formation flight [31]. To solve the formation flying mission of UAV swarm under complex conditions, the reinforcement learning process of multi-agents is usually modeled as the Heterogeneous Multi Agent Partially Observable Markov Decision Process (HMA-POMDP). Each agent independently chooses actions based on local observations, aiming to optimize the global goal [32]. HMA-POMDP is represented by the 8-tuple

< N, C, S, A, P, R, O, γ >

. Among them,

N

represents the number of agents;

C

is a finite set of all agent categories, and the index

j \in C

represents the category of the agent;

S = (s_{1}, s_{2}, \dots, s_{N})

is the joint state, and the state

S_{i}

of each agent

i

belongs to its corresponding state space

s_{i}

;

A

is the action space of the agent,

a_{i}^{j}

represents the action of agent

i

belonging to category

j

;

P

is the state transition probability; the reward function

R

guides the agent’s collaborative formation process, where

r

denotes the reward obtained after executing an action;

O

is the joint observation set;

γ

is the discount factor. In this swarm system, each agent obtains rewards by interacting with the environment and maximizes the global expected cumulative reward through collaboration.

V_{h}^{π} (s) = E_{π = {π_{n, h}}_{n \in N, h \in [H]}} [\sum_{h}^{H} r_{h} (s_{h}, a_{h}) | s_{h} = s],

(1)

where

V_{h}^{π} (s)

represents the expected future benefit of the agent at a certain state

s

and time step

h

.

h

represents the current time step.

H

represents the maximum time step in the entire decision-making process.

π

is a collection of strategies describing the decisions made by the agents at each time step. Due to the different reward functions and state spaces of multi-agents, the agents need to adjust their strategies through continuous information sharing and collaboration in the process of policy selection and learning in order to realize the optimal performance of the system.

UAV swarms under complex conditions need to be stabilized in formation through multi-UAV collaborative decision-making, giving full play to the specific capabilities of each UAV in the UAV swarm in order to concentrate the advantageous forces to enhance the flexibility and robustness of the system. When facing complex environments, the adjustment and feedback of control models in UAV swarm are prone to erroneous judgement, leading to dissonance in decision-making and exposing UAV swarms to complex formation control problems. Therefore, the formation control model needs to balance UAV capabilities to coordinate group and individual goals for collaboration.

\min (\sum_{j \in C} α_{j} L_{s c d}^{j} (w_{j}) + \sum_{j, k \in C} β_{j, k} L_{t c d}^{j, k} (w_{j}, w_{k})) s . t . \lim_{t \to \infty} ({p o s}_{i} (t) - {p o s}_{u} (t) = δ_{i u}), u \in N_{i}

(2)

where

α

and

β

are used to adjust the weights of individual Self-Cognitive Dissonance (SCD) loss and Team Cognitive Dissonance (TCD) loss in the overall loss function.

α_{j}

is the weighting coefficient of the agent of category

j

, and

β_{j, k}

represents the weighting coefficient between category

j

and category

k

.

L_{s c d}

represents the loss of individual cognitive dissonance, which helps individual agents remain consistent with the decision-making goals at different levels by reducing the differences between local and global perceptions. By reducing cognitive inconsistencies between agents,

L_{t c d}

prompts collaborative work.

w_{j}

represents the model parameters of the independent neural network of category

j

.

{p o s}_{i}

and

{p o s}_{u}

represent the position information of UAV

i

and UAV

u

, respectively;

δ_{i u}

represents the expected relative position between UAV

i

and UAV

u

.

In UAV swarms, UAVs with different capabilities, sensors, and maneuverability are typically assigned different behavior roles. For example, larger UAVs can carry more sensor equipment, while smaller UAVs specialize in reconnaissance or traversing narrow spaces. Appropriate utilization of variability can enhance the efficiency and flexibility of UAVs in mission execution. The problem of decision dissonance in multi-agent collaborative decision-making under complex conditions is rooted in the decision-maker’s unconscious or conscious modeling of various mission roles with biases. When each individual agent makes decisions, it is often influenced by its own experience and previous behavior, including historical choices and feedback. The deviation of behavioral roles affects the degree of matching between individual choices and inner beliefs, which leads to misjudgment of the control policy model in the adjustment and feedback process. Therefore, the decision-making of multi-agents is based on the accurate allocation of different task roles, and behavior representation is particularly important in this process.

z_{i} ∽ N (f (s_{i}; θ_{p})),

(3)

where the behavior role

z_{i}

of each agent

i

is obtained by sampling from a multivariate Gaussian distribution

N

, the mean and variance of which are determined by the state information

s_{i}

of the agent and the parameter

θ_{p}

of the neural network

f

.

In swarm systems of agents under complex conditions, improper distribution of behaviors can lead to the failure of agents to achieve results matching their capabilities or requirements when performing actions, thus increasing the decision-making burden. Behavior deviations will be gradually accumulated and amplified in the feedback loop of the agents, which ultimately negatively affects the overall performance of the system. Therefore, a reasonable role allocation is crucial to ensure that each agent chooses the most appropriate behavior based on its capabilities and current state, which in turn optimizes the system to complete the overall decision-making process.

3.3. System Structure

To realize the intelligence of UAV swarms and enhance the efficiency of each agent in group decision-making, it is necessary to define functional roles based on swarm state information as the behavior logics and specific application modes that different types of agents exhibit in the complex flight space. These behavior characteristics reflect the responsiveness and adaptability of various types of UAV devices under the stimulation of internal and external environmental factors. Then, based on the mission requirements and the capability differences of each agent, each individual agent in the UAV swarm is allocated a reasonable role to ensure that it undertakes duties matching its capabilities and is able to switch roles flexibly in different scenarios. Finally, through the adaptive shared learning mechanism, the multi-agent system continuously carries out trial-and-error and policy adjustment to learn the optimal modeling policy, so as to more efficiently realize the dynamic role allocation and collaborative division of work in multi-UAV swarms.

In the training phase of the model, in order to achieve efficient formation control of UAV swarms, it requires coordinating the handling of the three key links of behavior representation, behavior allocation and behavior learning, and at the same time, it needs to take into full consideration of the variability of each UAV in terms of perception accuracy, decision-making mode and action capability. In the behavior representation phase, the system generates potential behavior codes by optimizing the modeling policy of each agent, accurately inscribes the capability boundary and behavior pattern of each UAV, and provides inputs for the behavior combination of the UAV swarm formation. The behavior allocation phase is oriented to the overall formation goal of the swarm, combining the diversity and complexity of the input data, and reasonably assigning behavior roles to fit the characteristics of the agents and optimize the formation efficiency. In the behavior learning phase, based on the distribution results, it dynamically adjusts the neural network parameters by balancing the individual and team cognitive dissonance to maximize the overall system return and ensure that individual behaviors are collaboratively aligned with swarm formation goals. Figure 2 depicts the model training process of the algorithm

3.4. Adaptive Behavior Matching Based on Dual-Layer Imitation Learning

In agent swarm systems under complex conditions, reliable behavior matching can optimize the collaboration efficiency of swarm formations in complex mission objectives by coordinating the behavior patterns among individuals within the swarm, and maintain the stability of the swarm system by flexibly adjusting the behavior matching policy, thus enhancing the adaptive ability of the swarm system to environmental changes. Behavior matching covers the complete process from behavior representation to behavior allocation. Figure 3 depicts the behavior representation process based on dual-layer imitation learning.

Imitation learning achieves superior performance in UAV swarm formation missions by leveraging expert samples to learn effective behavior strategies [33,34]. For the UAV swarm formation mission, behavior representation imitates expert behaviors in the formation by drawing on expert knowledge in the form of behavior cloning. However, under complex conditions where UAV swarms work together in formations to accomplish missions, and are subject to the limitations of expert data coverage, the learner may generate actions that are not applicable to all types of UAVs, resulting in entering state regions not covered by the expert’s data, which will trigger distributional bias and gradually result in the accumulation of errors. Although imitation learning assumes that the learner’s data distribution will eventually converge with that of the expert, existing methods lack effective mechanisms to identify and correct for distributional biases that exist between the two. To this end, this study designed a dual-layer imitation learning structure that dynamically adjusts the optimization direction through explicit guidance strategies to efficiently generate collaborative strategies that meet the global objective. Meanwhile, to minimize the policy performance degradation caused by distributional bias, this study introduced improved cross entropy to make the learners’ state representations as close as possible to the behavior patterns of the expert samples.

In the mission of UAV swarm formation, behavior presentation constructs the teacher–student dataset by using an offline sampling method. Specifically, the teacher dataset is generated by the expert demonstration and covers collaborative interaction data such as formation pattern information, mission trajectory information and individual state information, while the dataset generated by the UAV’s self-operated operation is treated as the student dataset, which records the behavior characteristics of the individuals and their interactions with the other members during the execution of the formation mission. Figure 4 demonstrates the process of a swarm composed of different types of UAVs completing the formation under expert guidance.

Multi-objective optimization is slow and complex in high-dimensional spaces. From a given set of teacher data

D_{t e a}

and student data

D_{s t u}

, the sub-trajectories

τ_{t e a}

and

τ_{s t u}

are sampled. The potential behavior representations

z_{l a t}^{t e a}

and

z_{l a t}^{s t u}

of teachers and students are obtained by mapping the discrete skills

z_{j}

to the continuous potential space through the encoding function

f_{e} (\cdot; θ_{e})

. Then, the observation

o

, action

a

, and potential behavior representation

z_{l a t}

are combined into input vectors, which are passed to the decoders

f_{r} (\cdot)

and

f_{o} (\cdot)

, respectively, to estimate the environmental reward

\hat{r}

and environmental observation

\hat{o}

. Equation (4) balances the accuracy and diversity of behavior generation by coordinating multi-objective constraints in the potential space, aiming to improve the overall fitness.

\begin{array}{l} L_{i m p} (θ_{e}, θ_{r}, θ_{o}, θ_{r}^{'}, & θ_{o}^{'}) \\ = E_{i} [\sum_{i} (K L (f_{o} (z_{l a t}^{t e a}, o_{i}^{t e a}, a_{i}^{t e a} | | {\hat{o}}_{i}^{t e a})) \\ + K L (f_{o} (z_{l a t}^{s t u}, o_{i}^{s t u}, a_{i}^{s t u} | | {\hat{o}}_{i}^{s t u})] \\ + λ_{1} \sum_{i} ({(f_{r} (z_{l a t}^{t e a}, o_{i}^{t e a}, a_{i}^{t e a}) - {\hat{r}}_{i}^{t e a})}^{2} \\ + {(f_{r} (z_{l a t}^{s t u}, o_{i}^{s t u}, a_{i}^{s t u}) - {\hat{r}}_{i}^{s t u})}^{2}) \\ - λ_{2} \sum_{i, j, j'} \cos (f_{e} (z_{j}^{t e a}), f_{e} (z_{j'}^{s t u})), \end{array}

(4)

where the environmental reward

{\hat{r}}_{i}

and environmental observation

{\hat{o}}_{i}

are used to estimate the predicted values of the teacher network regarding the environmental dynamics and behavioral outcomes of the agent

i

.

L_{i m p}

is a loss function designed to improve the accuracy of behavioral representation learning and alignment performance. The first item calculates the KL divergence of the teacher network and the student network, encouraging decoder

f_{o} (\cdot)

to accurately predict observations on the potential representations

z_{l a t}^{t e a}

and

z_{l a t}^{s t u}

; the second item controls reward prediction errors through the weighting parameter

λ_{1}

, prompting the student network to imitate teacher behavior. The third item is weight parameter

λ_{2}

to maintain the consistency and diversity of potential behavior representations. Finally, the behavior characteristics are captured by the prediction error optimization encoder

f_{e} (\cdot; θ_{e})

. Through repeated training and iterative optimization, the behavior characteristics of the teacher network are gradually transformed into forms that can be understood and imitated by the student network, so as to align the potential behaviors in the implicit space, and improve the model’s accuracy in modeling environmental changes and differentiated behaviors of agents.

By adopting a dynamic balancing policy, the agents can search for the optimal solution in a unified policy space and coordinate the mission requirements, thus realizing dynamic balanced behavior generation and goal adaptation, which in turn avoids action anomalies caused by multiple goal inconsistencies, makes the generated actions smoother, improves global adaptability, and reduces the probability of entering the uncovered state region.

During the display alignment process, teacher behavior representation

z_{l a t}^{t e a}

and student behavior representation

z_{l a t}^{s t u}

are mapped to a set of learnable behavior prototypes

{z_{k}}_{k = 1}^{K}

, where each behavior prototype

z_{k}

represents a different skill category. By the minimization of cross-entropy loss

L_{e x p} = - E_{i j} [\sum_{k = 1}^{K} p_{i j}^{k} \log q_{i j}^{k}]

, it minimizes the difference between the predicted probability distribution

q_{s} = {q_{s}^{k}}_{k = 1}^{K}

and the target probability distribution

p_{t} = {p_{t}^{k}}_{k = 1}^{K}

and achieves an effective alignment of student and teacher behavior representations [35].

Combining the dual-layer optimization structure with the implicit and display alignment, it enhances the expressive ability of the overall model by both improving the ability of the student network to imitate the teacher’s network at the feature level and further optimizing its behavior representation at the output level through explicit alignment. Equation (5) describes the dual-layer optimization structure.

\overset{E x p l i c i t o p t i m i z a t i o n}{\overset{⏞}{\min_{θ \in U} L_{e x p} (θ, ω^{*} (θ))}}, s u b j e c t t o \underset{I m p l i c i t o p t i m i z a t i o n}{\underset{⏟}{ω^{*} (θ) = a r g \min_{ω} L_{i m p} (θ, ω)}},

(5)

where

θ

is the network parameters used for the explicit optimization problem,

ω

is the network parameters used for the implicit optimization problem, and

ω^{*} (θ)

is the optimal solution of the implicit alignment process.

Behavior allocation must consider UAV dynamics and adaptability, which are often overlooked in current schemes. In view of the situation, this study proposed a behavior allocation method based on an adaptive feature embedding mechanism.

Figure 5 demonstrates the behavior allocation process, which maps different types of input data into shared high-dimensional representations, while dynamically adjusting the attention weight of each agent and assigning the most appropriate combination of behaviors to each agent in an autoregressive manner. The observation sequences

(o_{1}, o_{2}, \dots, o_{n})

of the agents are first embedded into vector representations of the same dimensions

(e_{1}, e_{2}, \dots, e_{n})

and advanced to different levels of characterization by a state encoder with a multi-layer attention mechanism [36]. By processing each embedding vector

e_{i}

, the encoded state vector

({\hat{o}}_{1}, {\hat{o}}_{2}, \dots, {\hat{o}}_{n})

is generated for subsequent value function approximation. Then, the autoregressive process combines the behavior choices

(z^{t - m}, z^{t - m + 1}, \dots, z^{t})

from the previous

m

steps with the current observation sequence to participate in the behavior allocation decision. The output of the behavior decoder is passed to generate a behavior allocation policy

π_{h}

, which ensures that the behavior sequence

\hat{Z} = (z_{1}, z_{2}, \dots, z_{n})

has historical consistency and dynamic adaptive capability, thus realizing collaborative decision-making in UAV swarms. Equation (6) is the behavioral allocation strategy for heterogeneous multi-agents, as shown below:

π_{h} (z_{1 : N} | O, z^{t - m : t - 1}) = \prod_{i = 1}^{N} π_{h} (z_{i} | O, z_{1} : i - 1, z^{t - m : t}),

(6)

where

z_{1 : N}

represents the latent behavior variables assigned to all agents,

O

represents the observation sequence of all agents,

z_{1} : i - 1

represents the potential behavior variables from the 1st to the (

i - 1)

th agent,

z^{t - m : t}

includes the behavior assignments of these agents, representing the historical behavior assignments of the previous

m

steps and

π_{h}

represents the generated behavior allocation policy. To enhance the behavior diversity of UAV swarms, this paper introduces the behavior diversity index

r_{i \in N}^{i n t}

to assess the rationality of behavior allocation based on global implicit information and local observations [37].

The adaptive behavior matching method based on dual-layer imitation learning can ensure that the agents take on the appropriate responsibilities according to their own abilities, and provide clear learning goals for behavior learning of multi-agents. This method effectively enhances the comprehensive understanding of state information and the rapid decision-making ability of UAV swarms under complex conditions.

3.5. Behavior Learning Based on Cognitive Dissonance Optimization

In UAV swarm formation missions, there may be differences in the perception and understanding of state information by each UAV under complex conditions, and such differences will lead to cognitive inconsistency within the swarm, which will in turn weaken the overall collaborative effect of the swarm [38]. To address this problem, this study proposed a behavior learning method based on cognitive dissonance optimization

Figure 6 shows the specific behavior learning process. This method aims to effectively align the global information of different agents and reduce the cognitive dissonance between and among the agents by reducing the difference between the cognitive results in the local and global perspectives, so as to ensure that the agents are able to make the optimal decisions based on their own characteristics, and thus enhance the cooperative capability of UAV swarm formation in complex environments.

In the behavior allocation phase, the multi-agent system assigns differentiated behaviors

\hat{Z}

to each intelligent body. In the behavior learning phase, it provides for the clarification of the learning goals of each agent and the full use of its unique strengths. The Critic network generates a global target probability distribution

p (g_{g l b}^{i} | O, A, Z)

for an agent

i

based on the global state

S

, observation

O = \{o_{1}, o_{2}, \dots, o_{N}\}

and joint action

A = \{a_{1}, a_{2}, \dots, a_{N}\}

. Since each agent cannot know the real target distribution in advance, the probability distribution

p (s_{t}| g_{t}, z_{t})

generated by the global target

g_{g l b}^{i}

is approximately estimated, and the network parameters are updated by minimizing the team’s cognitive dissonance loss

L_{T C D}^{c l d}

.

\min \sum_{i \neq u} K L (p (g_{g l b}^{i} | O, A, z_{j}; ϱ_{i}) | | p (g_{g l b}^{u} | O, A, z_{k}; ϱ_{u})),

(7)

where

c l d

represents the process occurring in the cloud server, aiming to enhance the consistency of team collaboration by ensuring all agents can unify the target cognition of heterogeneous multi-agents from a global perspective. The global target probability distributions of agent

i

and

u

are

p (g_{g l b}^{i} | O, A, z_{j}; ϱ_{i})

and

p (g_{g l b}^{u} | O, A, z_{k}; ϱ_{u})

, and

z_{j}

and

z_{k}

are the results of the behavior allocation phase, and

ϱ

denotes the parameters of the independent neural network model for each type of agent.

When agents in a multi-agent system encounter cognitive inconsistencies, they rebuild internal consistency and stability by minimizing these inconsistencies. To leverage the behavioral advantages of coordinated collaboration among multi-agents from a local perspective, agent

i

selects an action

a_{i}

based on its assigned behavior

z_{j}

and local observation

o_{i}

using its policy parameter

θ_{i}

. Through real-time interaction with the environment, the agent obtains immediate rewards

r_{i}

and local observations

o_{i}^{'}

of the next state, which are used for policy updates. Each agent’s policy updates occur from a local perspective and are reflected through a self-cognitive dissonance loss function. To measure the discrepancy between the agent’s local behavioral choices and globally expected behavior, ensuring the local behavior remains as consistent with the global behavior as possible, the self-cognitive dissonance loss

L_{S C D}^{c l d}

is represented as

\min K L (p (g_{l o c}^{i} | o_{i}, z_{j}; σ_{i}) | | p (g_{g l b}^{i} | O, A, z_{j}; σ_{i})),

(8)

where

p (g_{l o c}^{i} | o_{i}, z_{j}; σ_{i})

is the local target probability distribution of agent

i

,

p (g_{g l b}^{i} | O, A, z_{j}; σ_{i})

is the global target probability distribution of agent,

z_{j}

is the behavior allocation result of agent

i

, and

σ_{i}

is the independent neural network model parameter of agent

i

. Agent

i

’s local policy updates gradually the approach optimality by improving the local

Q_{i}

value, with the ultimate goal of enhancing the overall system performance via policy learning, despite considering only its state and behavior from a local perspective.

During training, the agent

i

computes individual and team cognitive dissonance losses

L_{S C D}^{c l d}

and

L_{T C D}^{c l d}

through continuous interaction with the environment. In optimization, the policy gradient

\nabla_{σ_{i}}^{e d g} J_{t o t} (σ_{i})

incorporates local policy gains and cognitive dissonance uniformly into the parameter update, thus balancing individual and team behavior constraints under local observation conditions.

e d g

represents the edge server execution process and

\nabla_{σ_{i}} J (σ_{i})

is the policy gradient updated by the agent

i

based on the policy parameter

σ_{i}

, mainly used for optimizing the total rewards obtained by the agent under local conditions.

In summary, this study proposed a UAV swarm control method applicable to complex conditions for the formation control problem of UAV swarms under decision-making dissonance. Through the implicit and explicit alignment process in dual-layer imitation learning, it aims to enhance the level of understanding and imitation of formation behaviors by the student network, and achieve flexible division of work and enhance the collaboration capability of swarm formation through the adaptive feature embedding mechanism. Meanwhile, by combining cognitive dissonance optimization, it can achieve balancing individual and team cognitive dissonance loss, reinforcing complementary capabilities, effectively mitigating decision-making dissonance, and meeting the demand for collaborative control in complex environments. Through the organic integration of an end-edge-cloud collaborative architecture, a unified decision-making process is established, enhancing the coordinated control of UAV swarms.

3.6. Algorithm Pseudo-Code

A cluster of N drones operates in a complex environment, with sensors capturing state

O

and action sequences

A

. Algorithm 1 describes the formation control decision-making method for hybrid UAV swarms.

Algorithm 1. Formation control decision-making method for Hybrid UAV swarm.

function EdgeServer (

O, A

):
Compute swarm latent behavioral variables

Z

by the teacher and student datasets. with respect to Equation (5)
for

k = 0

to max_episode do
env.reset()
for

t = 0

to train_steps_limits do
Collect global state

S

and agents’ partial observations

O

from environments
if −

t

mod

k = 0

Sample skills

z^{1 : n} ~ π_{h} (z_{1 : N} | O, z^{t - m : t - 1})

. with respect to Equation (6)
For each agent

i

, choose action

a_{i} ~ π_{i}

, then extract the global action feature in environments
Concatenate

a_{t}^{i}, i \in [1, \dots, N]

into

a_{t}

Take

a_{t}

into UAV swarm graph and get

s_{t + 1}

, and save state-action history

τ

Compute reward

r_{i}^{i n t}

value for each drone
Store

(τ, \hat{Z}, a, r, τ')

in replay buffer

D

π

= CloudServer (

D

)
return

a_{t}

based on policy

π

ray.init(address=CloudServer_config[‘cloud_node_ip_address’]
@ray.remote
function CloudServer (D):
if |

D

| > batch_size then
for

t = 1

to

T

do
Sample minibatch

B

from

D

; Generate flight state information

s

Update

ϱ

by minimizing

L_{T C D} (ϱ)

. with respect to Equation (7)
Update

σ

by minimizing

L_{S C D} (σ)

. with respect to Equation (8)
Update policy network

π

return

π

The algorithm consists of two functions of EdgeSever and CloudSever, which run on edge and cloud servers, respectively, and utilizes the Ray framework to implement mission-based parallel computing [39]. The method first uses the teacher–student dataset to capture potential patterns in UAV behaviors, extracts differentiated behavior features and explores diverse behavior combination scenarios to meet the dynamic demands of UAV swarms in formation missions. Aiming at the differences in perception accuracy and action capability of different UAVs, the algorithm can dynamically assign behavior schemes to ensure the stability and collaborative efficiency of the swarm formation. Ultimately, the comment network is trained by sampling from the empirical replay array and minimizing the TD error. This process refines the neural network parameters and enhances the swarm system’s decision-making efficiency and overall performance in complex scenarios, aligning with the mission objectives of swarm formation. It is worth noting that during the process of collaborative decision-making, the behavior of the UAV swarm often exhibits characteristics of non-determinism and complexity. This complexity mainly manifests in aspects such as the dynamic changes in task requirements, the gradual accumulation of perception errors, and the dynamic adjustment of collaborative relationships. Meanwhile, the effectiveness of optimization algorithms in addressing these challenges heavily depends on the specific characteristics of the problem, such as the complexity of the problem scale, the diversity of the objective functions, and the non-linearity of the constraints [40,41]. Designing optimization strategies tailored to different problem types is crucial for enhancing the efficiency of collaborative decision-making in UAV swarms.

4. Experimental Analysis

This study focuses on the model training phase, using UAV swarm formation control as a research case. A complex simulation environment was designed to evaluate the performance of formation control based on the multi-agent swarm decision-making approach [42,43], with a focus on the following two research indicators:

Formation stability: To evaluate the relative position and attitude stability of various types of UAVs in a UAV swarm, it needs to take into account the differences in the flight characteristics and other aspects of different types of UAVs, which are measured by the position deviation and attitude deviation to ensure that the swarm remains smooth and steady during flight.
Formation integrity: It is used to evaluate the formation integrity of the UAV swarm during flight, aiming to ensure that when it encounters interference from the external environment, the various types of UAVs in the swarm can still maintain the overall formation, so as to collaborate in accomplishing the mission and improve the execution efficiency.

To validate the proposed HACASDM, its performance was compared with five similar swarm control algorithms—HMAPF and HACVF—and three heterogeneous multi-agent deep reinforcement learning baseline algorithms—Heterogeneous-Agent Proximal Policy Optimization(HAPPO), Heterogeneous-Agent Twin Delayed Deep Deterministic Policy Gradient(HATD3), and Heterogeneous-Agent Deep Deterministic Policy Gradient(HADDPG)—using the open-source simulation software Pybullet-3.2.5 under identical experimental conditions [44].

4.1. Experimental Setup

Using the UAV simulator toolkit from Pybullet-3.2.5, this study developed a UAV swarm simulation platform on an i7-1260P 2.4+GHz CPU, 16 GB RAM, GeForce RTX 3060 GPU, and Ubuntu 18.04. Python 3.9 was utilized to create intelligent decision models for collaborative training on edge and cloud servers, each with varying hardware and network setups. Data synchronization and collaborative processing were achieved through communication protocols connecting the three system components. Table 1 outlines the hardware and software configurations used in the experiments.

Pybullet-3.2.5 supports simulating UAVs with diverse hardware, sensors, and flight capabilities, along with a reinforcement learning-friendly API. Compared to real-world UAV training, the virtual simulation environment offers faster training and greater flexibility. In complex scenarios, UAV speed is constrained by energy limits and physical laws, making data collection time-intensive and hindering rapid model iteration and convergence. In addition, the simulation environment not only allows for the simulation of heterogeneous UAV swarms composed of different parameter configurations, but also provides a more comprehensive assessment of the swarm’s flight performance by adding obstacles. These elements effectively simulate the flight behavior of UAV swarms in complex environments. For example, sensors with varying levels of accuracy may impact the detection range and recognition capabilities of obstacles, especially in narrow spaces or complex terrain. Heterogeneous UAV swarms, through the complementary collaboration of their sensors, can compensate for the limitations of single-type sensors, enhancing overall performance. By configuring heterogeneous UAV swarms in the Pybullet-3.2.5 virtual simulation environment, it is possible to replicate their flight performance and decision-making capabilities in complex scenarios.

In a UAV swarm formation consisting of three low-resolution camera UAVs and two high-resolution camera UAVs, the low-resolution camera UAVs have difficulty accurately recognizing smaller or longer-range obstacles. However, UAVs with low-resolution cameras have faster image processing and data transfer capabilities for mission scenarios that require a rapid response. In contrast, UAVs with high-resolution cameras, while capable of recognizing smaller obstacles as well as complex terrain details, are instead not conducive to real-time-demanding missions due to the larger amount of data generated [45]. In the UAV swarm flight process, the UAVs with low-resolution cameras and the UAVs with high-resolution cameras can effectively complement their respective strengths and weaknesses through collaborative sensing and information sharing, thus improving the mission completion efficiency of the entire swarm system. Table 2 lists the parameter settings of each UAV in the UAV swarm.

{U A V}_{1}

{U A V}_{2}

{U A V}_{3}

{U A V}_{4}

{U A V}_{5}

The UAV swarm experiment designed in this study consists of five multi-rotor UAVs. Among them, the mass of a UAV directly affects its flight performance, and lighter UAVs are usually easier to achieve flexible control. The heterogeneity of UAV sizes places different requirements on swarm control, e.g., small UAVs are able to maneuver in tight spaces, while large UAVs require more space for obstacle avoidance. The resolution of the camera (such as megapixels, MP) determines the clarity of the image; the higher the resolution (such as 0.3 MP or 0.72 MP), the more detail the camera can capture. The camera viewing angle range reflects the breadth of the area it can cover, with wide-angle cameras suitable for wide-area surveillance and narrow-angle cameras more suitable for capturing detailed information. The maximum angular velocity is an important indicator to measure whether the UAV can quickly adjust its direction to respond to changes in the external environment.

In complex environments, multi-agent collaborative control significantly enhances the operational efficiency of UAV swarms. Equipped with diverse sensors, UAVs can gather essential data, including their position, heading, and external environmental information. Such multi-source heterogeneous information is crucial for the swarm control policy of multi-agent systems. It encodes differentiated behaviors through a variety of external information collected and rationally assigns behaviors through collaborative mechanisms within the swarm. Finally, the swarm system formulates an independent formation control policy for each UAV through behavior learning, thereby effectively solving the formation control problem of multi-agent under complex conditions.

4.2. Experimental Results

According to the experiment results, when the UAV swarm performs flight missions under complex conditions, various types of UAVs can realize collaborative cooperation, maintain the established formation and fly smoothly; in the face of changes in the flight environment, the UAVs can dynamically adjust the formation structure according to their own sensors’ detection ability and flight performance; after the changes are over, the swarm can quickly restore the established formation and continue to fly smoothly to the destination.

UAV swarms under complex conditions can continuously learn different behavior models through trial and error based on the differentiated characteristics and historical learning information of each agent. In order to reduce the time required in exploring the environment and learning process, it can imitate, learn from, and utilize schematic trajectories provided by human experts as a guide, as well as supervised learning in conjunction with existing empirical data, thus reducing the time required by the swarm in exploring the environment and learning process. Figure 7 demonstrates that by being guided by imitation learning, the agent learns more different potential behaviors, enhancing the agent’s ability to adapt to more diverse scenarios.

Figure 7a contains only three potential skills (No. 0 to 2), while the potential skills listed in Figure 7b increase to six (No. 0 to 5), indicating a significant increase in the diversity of behaviors learned by swarm of agents after being guided through imitative learning. Moreover, the skill effects are more refined and closer to the human expert’s schematic trajectory, allowing the model to learn more fine-grained skills. Such skill diversity enables swarm of agents to adapt to more complex mission requirements. Each agent can adapt to complex environments using different behaviors to better meet the specific mission requirements. By comprehensively considering the differentiated behavior preferences and habits of each agent, swarm combinations can utilize the learned skills to enhance their environmental exploration capabilities, and ultimately obtain the optimal formation control policy to maximize the future cumulative reward value.

There may be complex correlations between the behavioral variables of multi-agent systems, and analyzing a single behavior variable in isolation often fails to fully uncover the information contained in the data. Furthermore, excessively simplifying the analysis dimensions may result in the loss of key features, affecting the comprehensive understanding of the agents’ collaborative mechanisms and even leading to misleading conclusions. To address this, this study employs principal component analysis (PCA) to reduce the dimensionality of the multi-dimensional data, extract the primary trends of variation, and reveal the behavioral differences of multi-agents in task scenarios, thereby visually demonstrating the individualized behavior patterns between agents [46]. Figure 8 shows the distribution characteristics of different types of agents and their differentiated behavior patterns, reflecting their collaborative and independent traits during task execution.

In Figure 8b, the newly added behaviors 4 and 5 are clearly distributed in the feature space, indicating that imitation learning has introduced new behavioral patterns, significantly enriching the performance of the agent swarm during task execution. Additionally, behavior 3 in Figure 8b is positioned more toward the lower right compared to its location in Figure 8a, suggesting an increased independence of this behavior in the feature space. This may correspond to optimizations in strategy or task responsiveness in specific scenarios, making the division of labor among agents more clearly defined. Overall, imitation learning has enhanced the behavioral distribution characteristics of the agents, making the behavioral patterns clearer and more distinguishable, effectively improving the agents’ collaborative efficiency and task adaptability.

By using time series analysis, the changes in potential behaviors at different time steps can be visualized, which helps to observe trends, fluctuations, and patterns of change in the data [47] In addition, by comparing the behavior differences of different agents, it can analyze their dynamic relationships and mutual influences so as to optimize the collaborative polices of the agents. Figure 8 illustrates the time series of potential skill usage, which provides a reliable basis for decision-making in the collaborative optimization of multi-agent swarms.

Figure 9 demonstrates the potential behavior changes of the five agents at different time steps. In the swarm formation mission, depending on the mission requirements, each agent will select specific potential behaviors at different time steps, demonstrating the diversity of behavior selection patterns. For example, the agent

{U A V}_{1}

showed frequent behavior adjustments in the first 60 time steps, displaying a high degree of flexibility, while

{U A V}_{5}

was more stable in its behavior choices, showing a focus on specific behavior patterns. In addition, there are significant differences in behavior patterns between different agents. For example,

{U A V}_{2}

showed significantly different behavior choices than

{U A V}_{3}

and

{U A V}_{4}

in a given time interval. Such diversity of behavior choices not only reflects the differences in the roles of the agents in the swarm formation mission, but also provides a guarantee for UAV swarms to achieve efficient collaboration and flexible control under complex conditions.

In a task where a UAV swarm explores an unknown area with random obstacles and reaches multiple target points, the swarm must maintain the desired formation during flight. The primary metric for efficiency is the maximum total time for all drones to complete the task. Given the heterogeneous swarm’s varying performance metrics, such as flight speed, endurance, and response time, a constraint coefficient is introduced to account for these differences and ensure fair drone comparisons [48]. Furthermore, this paper also focuses on the total additional flight distance of the UAV swarm. This metric is defined as the total extra detour distance accumulated by the swarm during multiple task executions due to differences in the response speed and maneuverability of various drone types, particularly during the obstacle avoidance process.

Figure 10a shows the variation in task completion time for the UAV swarm under different algorithms as the number of obstacles increases. The experimental results indicate that as the number of obstacles grows, the obstacle avoidance process for the heterogeneous UAV swarm becomes more complex, and the task execution time accordingly increases. Among the algorithms, the average task completion time of the HACASDM is reduced by 23.92%, 15.69%, 11.52%, 28.63%, and 27.04% compared to HMAPF, HACVF, HAPPO, HATD3, and HADDPG, respectively. Even when the number of obstacles reaches 150, the HACASDM still outperforms the other algorithms, demonstrating higher efficiency and reliability. Figure 10b illustrates the change in the additional flight distance of the UAV swarm as the number of obstacles increases. When navigating around obstacles, the heterogeneous UAV swarm experiences a decrease in obstacle avoidance efficiency and an increase in additional flight distance due to performance differences among the drones. For example, slower drones or those with weaker maneuverability require more time to adjust their paths, leading to an overall increase in the swarm’s additional flight distance. The experimental results show that the HACASDM consistently maintains a lower additional flight distance across all test scenarios, demonstrating superior obstacle avoidance capabilities. Meanwhile, some algorithms experience rapid increases in additional flight distance when the number of obstacles exceeds 100, which may be related to inefficient task allocation strategies.

This paper applies multi-agent reinforcement learning and establishes evaluation metrics for UAV swarm formation control.

Research indicator 1: swarm stability

In experiments to assess the stability of UAV swarm formations, fully considering the constraints of differences in kinematics and dynamics of UAVs under complex conditions is helpful to more realistically simulate their kinetic characteristics in complex environments, thus enhancing the validity of the simulation results. In complex flight environments, swarms of UAVs of different shapes, types and dynamics need to meet the requirements of navigational constraints such as flight paths, airborne beacons and collision avoidance rules.

The trajectory data generated by the swarm need to conform to the kinematics and dynamics of each UAV node to ensure the rationality of path planning. The trajectory data include time, position, and attitude, and are expressed as a dense sequence or continuous function, which can be used for path tracking and control. Path curvature is an important geometric property that describes the degree of curvature of a path trajectory and quantifies the degree of trajectory curvature. The larger curvature indicates a more curved trajectory, while the smaller curvature suggests a straighter path [49,50]. In flight, different types of UAVs exhibit different curvature characteristics due to differences in their kinematic properties, and such differences can further affect the overall trajectory stability of the swarm.

\vec{l} (ρ) = \vec{l} (0) + \int_{0}^{ρ} \vec{v} (x) d x, κ (ρ) = | \vec{l}' (ρ) \times \vec{l} ″ (ρ) |,

(9)

where

\vec{l} (ρ)

represents the trajectory of a UAV in three-dimensional space,

\vec{l} (0)

denotes the initial position vector,

x

is the integral variable,

\vec{v} (ρ)

is the velocity vector at any point

ρ

along the trajectory and is constrained by dynamic such as mass, thrust, and air resistance,

κ (ρ)

is the path curvature of the curve at the trajectory parameter

ρ

and

\vec{l}' (ρ)

and

\vec{l} ″ (ρ)

denote the first-order and second-order derivatives of the trajectory parameter of the position vector. If the UAV performs maneuvers such as hovering during its movement, the curvature can be approximated using the turning angles between discrete points rather than relying directly on continuous derivatives.

The curvature at each UAV trajectory point is calculated and recorded as a sequence of data. Due to the differences in size, mass, power and maneuvering sensitivity, UAVs differ in their turning radius and flight curvature. In order to eliminate scale differences, the min–max normalization method is used to obtain

κ_{n o r}

. Figure 11 illustrates the variation in path curvature

The value of

κ_{n o r}

is in the range of

[- 1, 1]

. A value closer to 0 indicates higher flight stability of the UAV swarm, while values farther from 0 suggest lower stability. Formation stability is the ability of a swarm of UAVs to maintain the desired relative position and geometry as much as possible in the face of disturbances from the external environment. In the case of formation dispersion, each swarm gives full play to its respective advantages through collaborative cooperation, makes rapid adjustments according to its assigned behavior policy, re-establishes the desired formation configuration, and maintains overall stability during the flight. In complex environments, the advantages of HACASDM over similar formation control algorithms in terms of formation stability stand out significantly. The computing data from experiment simulations show that the formation stability of HACASDM is improved by 20.74%, 25.34%, 53.10%, 65.27%, and 72.49% compared to HMAPF, HACVF, HAPPO, HATD3, and HADDPG, respectively.

Research indicator 2: swarm integrity

In the experiment of UAV swarm formation integrity assessment, the swarm needs to keep its relative position and formation patterns stable during flight to ensure that different types of UAVs can perform their missions safely and efficiently. The UAV swarm is designed to move in formation along a predetermined route and maintain formation integrity by constantly translating and rotating to avoid various obstacles in the environment. The topology of a UAV swarm can be abstracted as a heterogeneous graph consisting of heterogeneous nodes and multiple types of edges [51]. In this heterogeneous graph, various types of UAV nodes are represented as different vertices in the graph, and the connectivity relationships between nodes are represented as multiple types of edges. In order to effectively assess the similarity of UAV swarm topology transformations during formation control, it can create specific labels for different types of nodes through the label propagation mechanism, thus distinguishing the positions of different nodes in the topology. In this study, the method of Weisfeiler–Lehman Subtree Kernel (WL) was used for measurement [52].

{G K}_{W L} (G, G') = \sum_{t = 0}^{T} \sum_{y \in Y} \sum_{y' \in Y'} ψ (ϕ_{t} (y), ϕ_{t} (y')) .

(10)

where

{G K}_{W L}

is the kernel function used to measure the similarity between two heterogeneous graphs.

G

and

G'

present the actual and desired heterogeneous graphs,

T

is the number of iterations, and

Y

and

Y'

are the sets of nodes of the actual and desired heterogeneous graphs.

y'

represents a node in the expected heterogeneous graph

Y'

, used for constructing entities and relationships. The function

ϕ_{t} (\cdot)

is the label of node

y

at the

t

th iteration, and the exponential function

ψ

is used to compute the number of matches of the same label in two heterogeneous graphs. Since the

{G K}_{W L}

values of different heterogeneous graphs usually have large gaps due to the differences in the number of nodes, edges and topologies, the use of normalization allows scaling these values to a standard range of

-

1 to 1, thus making the similarity results between different graphs more comparable.

{G K}_{W L}^{n o r}

values help quickly assess the difference between the actual swarm topology and the expected topology.

The UAV simulation data show significant timing characteristics. To fully leverage the capabilities of each UAV, enhance functional complementarity of heterogeneous platforms, and achieve effectiveness multiplication, the relative positions of UAVs must exhibit strong spatial correlation. Consequently, UAV formations display unique spatiotemporal distribution characteristics. To quantify spatial completeness, the similarity between UAV flight data and the target formation at each moment is compared; greater similarity reflects higher formation completeness.

Figure 12 demonstrates the effect of different algorithms on UAV swarm formation integrity control in a complex simulation environment. Due to the differences in UAV platforms, the relative positions of various types of UAVs in a UAV swarm are affected to different degrees by changes in the mission environment. Observing value trends reveals differences in the algorithms’ ability to maintain formation integrity. HACASDM is able to make full use of the different characteristics of UAVs in complex environments to achieve consistent constraints and flexible control of UAV group behaviors, so that the formation similarity can still be maintained at a high level during the flight. The computing data from the experiment simulations show that the formation similarity of HACASDM is improved by 14.65%, 17.08%, 20.59%, 24.89% and 36.15% compared to HMAPF, HACVF, HAPPO, HATD3 and HADDPG, respectively.

The model training process is critical for algorithm optimization and adaptability to complex environments. Analyzing training curves offers an intuitive understanding of neural network learning performance and helps evaluate their scalability across scenarios. To further assess the proposed algorithm’s performance, a comparative analysis of various training processes is conducted.

Figure 13a shows the relationship between the number of simulation steps and the average reward value for the three methods. As the number of training steps increases, the average reward values for all three methods show an upward trend. Among them, the HACASDM converges around 900,000 steps, with the average reward stabilizing around 0. Compared to the HACVF and HATD3 methods, the proposed method improves training speed by 17.24% and 28.38%, respectively. This improvement is attributed to the dual-layer imitation learning, which provides strategic insights to the agents, effectively narrowing the policy search space and enhancing the stability of policy optimization. Consequently, it accelerates model convergence, improves training efficiency, and reduces reward fluctuations during training. Figure 13b illustrates the average collision frequency of the HACASDM across different swarm sizes during model training. In the early stages of training, as the number of agents increases, the collision frequency initially rises from approximately 3 collisions per episode with 5 agents to 13 collisions per episode with 20 agents. However, as training progresses, the method effectively reduces the collision frequency, ultimately leading to convergence across different swarm sizes. Specifically, in a heterogeneous swarm system composed of three different types of UAVs, the collision frequency stabilizes at approximately 2 collisions per episode. This result demonstrates that the proposed method maintains stable obstacle avoidance capabilities in collaborative environments involving multi-type UAVs of different sizes.

5. Conclusions

This study explored the formation control problem for UAV swarms under complex conditions and proposed a decision-making method for UAV swarm control applicable to complex conditions based on a swarm system under ideal conditions. By capturing the underlying patterns of UAV behavior, the proposed method can accurately extract its differentiated behavior characteristics and explore diverse behavior combination schemes. It can dynamically assign the optimal behavior scheme to each UAV based on the differences between the UAVs in terms of sensing accuracy and action capability. Finally, it can continuously optimize the neural network parameters using training data and action goals to improve decision-making strategies through behavior learning operations. In the simulation environment, this study evaluated the key performance indicators of formation stability and integrity of UAV swarms, and verified that HACASDM is able to effectively mitigate the decision misjudgment triggered by the differential control model when the adjustment is inconsistent with the feedback, thus significantly reducing the decision dissonance. This method not only improves the formation control efficiency of UAV swarms in complex environments, but also enhances the accuracy of formation control. However, given the complexity and diversity of practical applications, future work will incorporate additional environmental factors, such as signal interference and strong wind disturbances, to broaden the research scope and enhance the diversity of the simulation environment.

Author Contributions

Conceptualization, L.Z.; methodology, L.Z.; software, L.Z.; validation, L.Z. and B.C.; formal analysis, L.Z., B.C. and F.H.; investigation, L.Z.; resources, B.C.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, L.Z. and B.C.; visualization, L.Z. and B.C.; supervision, L.Z., B.C. and F.H.; project administration, L.Z., B.C. and F.H.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the General Program of National Natural Science Foundation of China (Grant No. 62176122): Research on Security Mechanism of Edge Federated Learning for Unmanned Swarm Systems and the A3 Program of National Natural Science Foundation of China (Grant No. 62061146002): Future Internet of Things Technologies and Services Based on Artificial Intelligence.

Data Availability Statement

The original contributions presented in this study are included in the article material.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Şahin, E. Swarm robotics: From sources of inspiration to domains of application. In International Workshop on Swarm Robotics; Şahin, E., Spears, W.M., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 10–20. [Google Scholar]
Şahin, E.; Girgin, S.; Bayindir, L.; Turgut, A.E. Swarm Intell; Blum, C., Merkle, D., Eds.; Natural Computing Series; Springer: Berlin/Heidelberg, Germany, 2008; pp. 87–100. [Google Scholar]
Brambilla, M.; Ferrante, E.; Birattari, M.; Dorigo, M. Swarm robotics: A review from the swarm engineering perspective. Swarm Intell. 2013, 7, 1–41. [Google Scholar] [CrossRef]
Adnan, M.H.; Zukarnain, Z.A.; Amodu, O.A. Fundamental design aspects of UAV-enabled MEC systems: A review on models, challenges, and future opportunities. Comput. Sci. Rev. 2024, 51, 100615. [Google Scholar] [CrossRef]
Javaid, S.; Saeed, N.; Qadir, Z.; Fahim, H.; He, B.; Song, H.; Bilal, M. Communication and control in collaborative UAVs: Recent advances and future trends. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5719–5739. [Google Scholar] [CrossRef]
Peng, X.J.; He, Y. Aperiodic sampled-data consensus control for homogeneous and heterogeneous multi-agent systems: A looped-functional method. Int. J. Robust Nonlinear Control. 2023, 33, 8067–8086. [Google Scholar] [CrossRef]
Lu, Y.; Xu, Z.; Li, L.; Zhang, J.; Chen, W. Formation preview tracking for heterogeneous multi-agent systems: A dynamical feedforward output regulation approach. ISA Trans. 2023, 133, 102–115. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Chen, B.; Hu, F. Research on cooperative obstacle avoidance decision making of unmanned aerial vehicle swarms in complex environments under end-edge-cloud collaboration model. Drones 2024, 8, 461. [Google Scholar] [CrossRef]
Khoei, T.T.; Al Shamaileh, K.; Devabhaktuni, V.K.; Kaabouch, N. A comparative assessment of unsupervised deep learning models for detecting GPS spoofing attacks on unmanned aerial systems. In Proceedings of the 2024 Integrated Communications, Navigation and Surveillance Conference (ICNS), Herndon, VA, USA, 23–25 April 2024; pp. 1–10. [Google Scholar]
Zhuang, Y.; Sun, X.; Li, Y.; Huai, J.; Hua, L.; Yang, X.; Cao, X.; Zhang, P.; Cao, Y.; Qi, L.; et al. Multi-sensor integrated navigation/positioning systems using data fusion: From analytics-based to learning-based approaches. Inf. Fusion 2023, 95, 62–90. [Google Scholar] [CrossRef]
Sery, T.; Shlezinger, N.; Cohen, K.; Eldar, Y. Over-the-air federated learning from heterogeneous data. IEEE Trans. Signal Process. 2021, 69, 3796–3811. [Google Scholar] [CrossRef]
Li, Y.; Wu, Y.; Xue, X.; Liu, X.; Xu, Y.; Liu, X. Efficiency-first spraying mission arrangement optimization with multiple UAVs in heterogeneous farmland with varying pesticide requirements. Inf. Process. Agric. 2024, 11, 237–248. [Google Scholar] [CrossRef]
Sun, L.; Wang, J.; Wan, L.; Li, K.; Wang, X.; Lin, Y. Human-UAV interaction assisted heterogeneous UAV swarm scheduling for target searching in communication denial environment. IEEE Trans. Autom. Sci. Eng. 2024. early access. [Google Scholar] [CrossRef]
Adderson, R.; Pan, Y.-J. Continuously varying formation for heterogeneous multi-agent systems with novel potential field avoidance. IEEE Trans. Ind. Electron. 2024, 72, 1774–1783. [Google Scholar] [CrossRef]
Wu, Y.; Liang, T.; Gou, J.; Tao, C.; Wang, H. Heterogeneous mission planning for multiple UAV formations via metaheuristic algorithms. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 3924–3940. [Google Scholar] [CrossRef]
Meng, X.; Zhu, X.; Zhao, J. Obstacle avoidance path planning using the elite ant colony algorithm for parameter optimization of unmanned aerial vehicles. Arab. J. Sci. Eng. 2023, 48, 2261–2275. [Google Scholar] [CrossRef]
Kumar, H.; Datta, D.; Pushpangathan, J.V.; Kandath, H.; Dhabale, A. AGVO: Adaptive geometry-based velocity obstacle for heterogenous UAVs collision avoidance in UTM. In Proceedings of the IECON 2023—49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 16–19 October 2023; pp. 1–7. [Google Scholar]
Sellers, T.; Lei, T.; Luo, C.; Liu, L.; Carruth, D.W. Enhancing human-robot cohesion through hat methods: A crowd-avoidance model for safety aware navigation. In Proceedings of the 2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS), Toronto, ON, Canada, 15–17 May 2024; pp. 1–6. [Google Scholar]
Ghaderi, F.; Toloei, A.; Ghasemi, R. Heterogeneous formation sliding mode control of the flying robot and obstacles avoidance. Int. J. ITS Res. 2024, 22, 339–351. [Google Scholar] [CrossRef]
Geng, M.; Pateria, S.; Subagdja, B.; Tan, A.-H. HiSOMA: A hierarchical multi-agent model integrating self-organizing neural networks with multi-agent deep reinforcement learning. Expert Syst. Appl. 2024, 252, 124117. [Google Scholar] [CrossRef]
Wang, C.; Wei, Z.; Jiang, W.; Jiang, H.; Feng, Z. Cooperative sensing enhanced UAV path-following and obstacle avoidance with variable formation. IEEE Trans. Veh. Technol. 2024, 73, 7501–7516. [Google Scholar] [CrossRef]
Hou, P.; Jiang, X.; Wang, Z.; Liu, S.; Lu, Z. Federated deep reinforcement learning-based intelligent dynamic services in UAV-assisted MEC. IEEE Internet Things J. 2023, 10, 20415–20428. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, H.; Hua, M.; Wang, F.; Yi, J. P-DRL: A framework for multi-UAVs dynamic formation control under operational uncertainty and unknown environment. Drones 2024, 8, 475. [Google Scholar] [CrossRef]
Xia, X.; Chen, F.; He, Q.; Cui, G.; Grundy, J.; Abdelrazek, M.; Bouguettaya, A.; Jin, H. OL-MEDC: An online approach for cost-effective data caching in mobile edge computing systems. IEEE Trans. Mob. Comput. 2023, 22, 1646–1658. [Google Scholar] [CrossRef]
Chen, Q.; Meng, W.; Quek, T.Q.S.; Chen, S. Multi-tier hybrid offloading for computation-aware IoT applications in civil aircraft-augmented SAGIN. IEEE J. Sel. Areas Commun. 2023, 41, 399–417. [Google Scholar] [CrossRef]
Tao, M.; Li, X.; Feng, J.; Lan, D.; Du, J.; Wu, C. Multi-agent cooperation for computing power scheduling in UAVs empowered aerial computing systems. IEEE J. Sel. Areas Commun. 2024, 42, 3521–3535. [Google Scholar] [CrossRef]
Wu, R.-Y.; Xie, X.-C.; Zheng, Y.-J. Firefighting drone configuration and scheduling for wildfire based on loss estimation and minimization. Drones 2024, 8, 17. [Google Scholar] [CrossRef]
Poursiami, H.; Jabbari, B. On multi-task learning for energy efficient task offloading in multi-UAV assisted edge computing. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar]
Tang, J.; Wu, G.; Jalalzai, M.M.; Wang, L.; Zhang, B.; Zhou, Y. Energy-optimal DNN model placement in UAV-enabled edge computing networks. Digit. Commun. Netw. 2024, 10, 827–836. [Google Scholar] [CrossRef]
Ma, M.; Wang, Z.; Guo, S.; Lu, H. Cloud–edge framework for AoI-efficient data processing in multi-UAV-assisted sensor networks. IEEE Internet Things J. 2024, 11, 25251–25267. [Google Scholar] [CrossRef]
Raja, G.; Essaky, S.; Ganapathisubramaniyan, A.; Baskar, Y. Nexus of deep reinforcement learning and leader–follower approach for AIoT enabled aerial networks. IEEE Trans. Ind. Inform. 2023, 19, 9165–9172. [Google Scholar] [CrossRef]
Mao, S.; Jin, J.; Xu, Y. Routing and charging scheduling for EV battery swapping systems: Hypergraph-based heterogeneous multiagent deep reinforcement learning. IEEE Trans. Smart Grid 2024, 15, 4903–4916. [Google Scholar] [CrossRef]
Schegg, P.; Ménager, E.; Khairallah, E.; Marchal, D.; Dequidt, J.; Preux, P.; Duriez, C. SofaGym: An open platform for reinforcement learning based on soft robot simulations. Soft Robot. 2023, 10, 410–430. [Google Scholar] [CrossRef]
Chaysri, P.; Spatharis, C.; Blekas, K.; Vlachos, K. Unmanned surface vehicle navigation through generative adversarial imitation learning. Ocean Eng. 2023, 282, 114989. [Google Scholar] [CrossRef]
Spatharis, C.; Blekas, K.; Vouros, G.A. Modelling flight trajectories with multi-modal generative adversarial imitation learning. Appl. Intell. 2024, 54, 7118–7134. [Google Scholar] [CrossRef]
Grecov, P.; Prasanna, A.N.; Ackermann, K.; Campbell, S.; Scott, D.; Lubman, D.I.; Bergmeir, C. Probabilistic causal effect estimation with global neural network forecasting models. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 4999–5013. [Google Scholar] [CrossRef] [PubMed]
Yao, H.; Song, Z.; Zhou, Y.; Ao, T.; Chen, B.; Liu, L. MoConVQ: Unified physics-based motion control via scalable discrete representations. ACM Trans. Graph. 2024, 43, 1–21. [Google Scholar] [CrossRef]
Chen, X.; Liu, X.; Zhang, S.; Ding, B.; Li, K. Goal consistency: An effective multi-agent cooperative method for multistage tasks. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 172–178. [Google Scholar]
Zhang, C.; Meng, Y.; Prasanna, V. A framework for mapping DRL algorithms with prioritized replay buffer onto heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1816–1829. [Google Scholar] [CrossRef]
Wolpert, D.H.; Macready, W.G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Loscos, D.; Martí-Oliet, N.; Rodríguez, I. Generalization and completeness of stochastic local search algorithms. Swarm Evol. Comput. 2022, 68, 100982. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Q.; Shen, Y.; Dai, N.; He, B. Multi-AUV cooperative control and autonomous obstacle avoidance study. Ocean Eng. 2024, 304, 117634. [Google Scholar] [CrossRef]
Bahaidarah, M.; Marjanovic, O.; Rekabi-Bana, F.; Arvin, F. Improving formation in swarm robotics with a leader-follower approach. In Proceedings of the 2024 IEEE International Conference on Mechatronics and Automation (ICMA), Tianjin, China, 4–7 August 2024; pp. 1447–1452. [Google Scholar]
Zhong, Y.; Kuba, J.G.; Feng, X.; Hu, S.; Ji, J.; Yang, Y. Heterogeneous-agent reinforcement learning. J. Mach. Learn. Res. 2024, 25, 1–67. [Google Scholar]
Mesías-Ruiz, G.A.; Peña, J.M.; de Castro, A.I.; Borra-Serrano, I.; Dorado, J. Cognitive computing advancements: Improving precision crop protection through UAV imagery for targeted weed monitoring. Remote Sens. 2024, 16, 3026. [Google Scholar] [CrossRef]
Swathi, P.; Pothuganti, K. Overview on principal component analysis algorithm in machine learning. Int. Res. J. Mod. Eng. Technol. 2020, 2, 241–246. [Google Scholar]
Wang, Y.; Zhang, H.; Shi, Z.; Zhou, J.; Liu, W. Nonlinear time series analysis and prediction of general aviation accidents based on multi-timescales. Aerospace 2023, 10, 714. [Google Scholar] [CrossRef]
Zhu, Y.; Liang, Y.; Jiao, Y.; Ren, H.; Li, K. Multi-type task assignment algorithm for heterogeneous UAV cluster based on improved NSGA-II. Drones 2024, 8, 384. [Google Scholar] [CrossRef]
Sun, T.; Sun, W.; Sun, C.; He, R. Multi-UAV formation path planning based on compensation look-ahead algorithm. Drones 2024, 8, 251. [Google Scholar] [CrossRef]
Celestini, D.; Primatesta, S.; Capello, E. Trajectory planning for UAVs based on interfered fluid dynamical system and Bézier curves. IEEE Robot. Autom. Lett. 2022, 7, 9620–9626. [Google Scholar] [CrossRef]
Zhang, S.; Wu, Y.; Zhang, X.; Feng, Z.; Wan, L.; Zhuang, Z. Relation-Aware heterogeneous graph network for learning intermodal semantics in textbook question answering. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 11872–11883. [Google Scholar] [CrossRef]
Yin, Y.; Xie, K.; He, S.; Li, Y.; Wen, J.; Diao, Z.; Zhang, D.; Xie, G. GraphIoT: Lightweight IoT device detection based on graph classifiers and incremental learning. IEEE Trans. Serv. Comput. 2024, 17, 3758–3772. [Google Scholar] [CrossRef]

Figure 1. UAV swarm formation control problem with the end-edge-cloud collaboration model.

Figure 2. Training of multi-agent deep reinforcement learning algorithm.

Figure 3. Behavior representation process based on dual-layer imitation learning.

Figure 4. Expert template.

Figure 5. Behavior allocation process based on adaptive feature embedding mechanism.

Figure 6. The behavior learning process based on cognitive dissonance optimization.

Figure 7. Overall behavior utilization. (a) Without imitation learning; (b) with imitation learning.

Figure 8. Principal component analysis. (a) Without imitation learning; (b) with imitation learning.

Figure 9. Time series of latent behavior changes.

Figure 10. Swarm obstacle avoidance. (a) Average task completion time; (b) average extra flight distance.

Figure 11. Flight trajectory curvature of the UAV swarm.

Figure 12. UAV swarm integrity analysis results.

Figure 13. Scalability. (a) The curve of average reward values; (b) average number of collisions.

Table 1. Parameter settings for computing resources.

Type	Parameter	Value
Cloud Server	Operating system	Ubuntu 22.04
	Processor	Intel Core i7-1260P
	Memory	372 GB
	Hard disk	10 TB
	Network card	I350-US
	Graphics card	NVIDIA T4
Edge Server	Operating system	Ubuntu 20.04
	Processor	Intel Core i7-10700
	Memory	64 GB
	Hard disk	100 GB
	Network card	I219-V
	Graphics card	NVIDIA Tesla K80
UAV Simulation Platform	Operating system	Ubuntu 18.04
	Processor	Intel Core i7-1260P
	Memory	16 GB
	Hard disk	50 GB
	Network card	I219-V
	Graphics card	NVIDIA GeForce RTX 3060

Table 2. Drone parameter settings.

Type	ID	$m a x_V$	$M a s s$	$S i z e$	Resolution	FOV	$m a x_a n g l e_V e l o c i t y$
Multi-rotor	${U A V}_{1}$	$12 k m \cdot h^{- 1}$	0.8 kg	$25 \times 31 \times 12 {c m}^{3}$	0.72 MP	$150^{o}$	$150^{o} / s$
	${U A V}_{2}$	$12 k m \cdot h^{- 1}$	0.8 kg	$25 \times 31 \times 12 {c m}^{3}$	0.72 MP	$150^{o}$	$150^{o} / s$
	${U A V}_{3}$	$15 k m \cdot h^{- 1}$	0.4 kg	$14 \times 14 \times 55 c m$	0.3 MP	$85^{o}$	$200^{o} / s$
	${U A V}_{4}$	$15 k m \cdot h^{- 1}$	0.4 kg	$14 \times 14 \times 55 {c m}^{3}$	0.3 MP	$85^{o}$	$200^{o} / s$
	${U A V}_{5}$	$15 k m \cdot h^{- 1}$	0.4 kg	$14 \times 14 \times 55 {c m}^{3}$	0.3 MP	$85^{o}$	$200^{o} / s$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Chen, B.; Hu, F. Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions. Drones 2025, 9, 119. https://doi.org/10.3390/drones9020119

AMA Style

Zhao L, Chen B, Hu F. Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions. Drones. 2025; 9(2):119. https://doi.org/10.3390/drones9020119

Chicago/Turabian Style

Zhao, Longqian, Bing Chen, and Feng Hu. 2025. "Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions" Drones 9, no. 2: 119. https://doi.org/10.3390/drones9020119

APA Style

Zhao, L., Chen, B., & Hu, F. (2025). Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions. Drones, 9(2), 119. https://doi.org/10.3390/drones9020119

Article Menu

Research on Swarm Control Based on Complementary Collaboration of Unmanned Aerial Vehicle Swarms Under Complex Conditions

Abstract

1. Introduction

2. Related Work

3. UAV Swarm Control Methods Under Complex Conditions

3.1. Problem Statement

3.2. Problem Modeling and Analysis

3.3. System Structure

3.4. Adaptive Behavior Matching Based on Dual-Layer Imitation Learning

3.5. Behavior Learning Based on Cognitive Dissonance Optimization

3.6. Algorithm Pseudo-Code

4. Experimental Analysis

4.1. Experimental Setup

4.2. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI