Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model

Zhao, Longqian; Chen, Bing; Hu, Feng

doi:10.3390/drones8090461

Open AccessArticle

Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model

by

Longqian Zhao

,

Bing Chen

^*

and

Feng Hu

School of Computer, Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 461; https://doi.org/10.3390/drones8090461

Submission received: 12 July 2024 / Revised: 29 August 2024 / Accepted: 1 September 2024 / Published: 4 September 2024

Download

Browse Figures

Versions Notes

Abstract

Obstacle avoidance in UAV swarms is crucial for ensuring the stability and safety of cluster flights. However, traditional methods of swarm obstacle avoidance often fail to meet the requirements of frequent spatiotemporal dynamic changes in UAV swarms, especially in complex environments such as forest firefighting, mine monitoring, and earthquake disaster relief. Consequently, the trained obstacle avoidance strategy differs from the expected or optimal obstacle avoidance scheme, leading to decision bias. To solve this problem, this paper proposes a method of UAV swarm obstacle avoidance decision making based on the end-edge-cloud collaboration model. In this method, the UAV swarm generates training data through environmental interaction. Sparse rewards are converted into dense rewards, considering the complex environmental state information and limited resources, and the actions of the UAVs are evaluated according to the reward values, to accurately assess the advantages and disadvantages of each agent’s actions. Finally, the training data and evaluation signals are utilized to optimize the parameters of the neural network through strategy-updating operations, aiming to improve the decision-making strategy. The experimental results demonstrate that the UAV swarm obstacle avoidance method proposed in this paper exhibits high obstacle avoidance efficiency, swarm stability, and completeness compared to other obstacle avoidance methods.

Keywords:

end-edge-cloud collaboration; multi-agent reinforcement learning; topology; obstacle avoidance decision making

1. Introduction

An unmanned aircraft swarm is an intelligent joint system composed of a certain number of unmanned aircraft, which utilizes information interaction, feedback, motivation, and response to coordinate mutual behaviors and work together to accomplish specific tasks. Compared with a single UAV, a swarm system enables monitoring and complex operations with multidirectional, multibatch, and high-density coverage can be realized. Through the duplicate effect of quantity and scale, a robust flight capability can be formed, thus effectively completing flight tasks [1]. As UAV swarms continue to expand into new application scenarios such as environmental monitoring and search and rescue operations, objective demands, such as complex scenarios and high UAV maneuverability requirements, pose greater challenges to UAV swarm obstacle avoidance. The current traditional obstacle avoidance methods for drones are based on hierarchical planning and control. However, due to external obstacles that are not controlled by the swarm system, it is difficult to preset avoidance strategies for resource-limited UAV swarms. Additionally, the frequent changes in the temporal and spatial requirements of multiple drones operating in the same workspace lead to continuous adjustments in the avoidance strategies during the model training phase, gradually deviating from the ideal strategy. This results in unstable avoidance behaviors, leading to decision biases and presenting the drone swarm with increasingly complex swarm avoidance challenges.

In complex environments such as forest firefighting, mine monitoring, and earthquake disaster relief. UAV swarms face numerous challenges, including dynamic and varied terrain, dense obstacles, and various noise interferences. Although employing formation-based obstacle avoidance can effectively utilize swarm resources to enhance overall task efficiency, the frequent adjustments of avoidance strategies during model training to meet different temporal and spatial demands increase the complexity of swarm management. UAV swarms often focus excessively on immediate obstacle information while neglecting global critical information. Additionally, the payload capacity of UAVs limits their ability to process and analyze real-time data during flight, affecting their understanding and response to the overall situational awareness. These factors collectively reduce the effectiveness and performance of formation-based obstacle avoidance, preventing UAV swarms from fully leveraging their advantages in complex environments. To address the issue of swarm avoidance, it is necessary to balance the relationship between environmental exploration and model utilization among multiple agents during the training process. This involves improving sample efficiency and data utilization and increasing the number of iterations in the solution space to train the optimal swarm obstacle avoidance model for deployment in the reasoning phase. Therefore, utilizing edge computing capabilities can achieve real-time environmental perception and data processing, enhancing the model’s real-time performance and responsiveness. Additionally, leveraging edge-cloud collaboration capabilities can enable distributed computing during model training and optimization, accelerating the convergence speed and improving the performance of the model. Hence, adopting an end-edge-cloud collaboration model can provide UAV swarm terminal devices with rapid processing capabilities for ultra-large-scale cloud services and edge services, allocating the neural network model training phase to edge servers and cloud servers and performing inference operations at the UAV end, thereby achieving low latency and high-concurrency edge-cloud collaboration training [2,3,4].

In order to enable UAV swarms to learn optimal swarm obstacle avoidance strategies, existing research efforts have been carried out in three main areas [5,6,7]. Optimization-based swarm obstacle avoidance methods, which enable UAV swarms to operate under constrained conditions, are based on established mathematical models of UAV swarms in the time domain or frequency domain. However, this obstacle avoidance method is computationally complex and difficult to understand. The potential field-based obstacle avoidance method constructs a virtual potential field in the airspace, generates a navigation function, and transforms the motion law of the vehicle into the effect of the force between objects. It is characterized by a simple mathematical description structure and a planning algorithm. However, this method cannot consider various constraints in the obstacle avoidance process, and it is easy to fall into the local optimization; machine learning-based methods transform the UAV obstacle avoidance problem into a decision-making problem, and common methods include deep learning, reinforcement learning, and so on. However, in the model training phase, this method requires more computational resources to train massive data. Existing swarm obstacle avoidance methods have shortcomings in terms of real-time performance, adaptability to complex environments, and computational resource requirements. Algorithms such as the elliptical tangent graph path planning and discrete spatial consistency have shown good performance in specific environments but still require parameter adjustments for different tasks and environments, leading to long computation times. To address these issues, multi-agent task computing methods based on edge-cloud architecture have been proposed to improve resource utilization and reduce latency. However, challenges remain in obstacle avoidance accuracy and training efficiency. The above three research efforts can lead to discrepancies between the UAV swarm obstacle avoidance process and the expected or optimal obstacle avoidance scheme. Complex changing application scenarios put higher requirements on the stability and mission accomplishment rate of UAV swarm formation. As a result, the issue of swarm avoidance under decision biases will become increasingly pronounced.

In recent years, there has been some progress in research on obstacle avoidance for UAV swarms. However, due to the complexity of applications such as forest firefighting, mine monitoring, and earthquake disaster relief. Existing research work exhibits significant decision bias issues in the obstacle avoidance process. Factors such as fire spreading and terrain changes are difficult to predict and require rapid responses from UAV swarms. Multi-agent reinforcement learning can adapt to environmental changes, and its end-to-end structure can process high-dimensional state and action spaces in the environment by stacking deep neural networks, thereby better learning swarm obstacle avoidance strategies in complex environments. Based on this, this paper employs a multi-agent deep reinforcement learning approach using an end-edge-cloud collaboration mechanism to train the obstacle avoidance model for UAV swarms. The main contributions of this paper are as follows:

This paper proposes an environmental information prediction network leveraging dynamic spatiotemporal graph convolution, designed to overcome the challenge of low obstacle prediction accuracy in complex environments due to frequent spatiotemporal variations. By integrating the spatial relationships among drone nodes and applying constraints from dynamic models, the network achieves precise predictions of complex environmental states, thereby enhancing both the understanding and predictive capabilities in these challenging scenarios.
Utilizing the state information generated by the prediction network, this paper proposes a reward decomposition strategy based on hybrid credit allocation, which transforms sparse rewards in the flight environment into dense rewards. This approach provides more refined feedback for intelligent decision making, enabling more accurate assessment of multi-agent decisions and optimizing the strategy update process, ultimately leading to more reliable swarm control.
To expedite the learning and optimization of decision-making processes, this paper proposes a dual-experience pool with a prioritized parallel-replay structure. By categorizing experience data into safe and dangerous samples and prioritizing high-priority samples for training, this structure fully utilizes historical data, improving data learnability and accelerating model convergence.

This paper proposes an environmental information prediction network based on dynamic spatiotemporal graph convolution and a reward decomposition strategy based on hybrid credit allocation. This method is designed to ensure the continuity of environmental inputs while satisfying constraint conditions, minimizing deviations from the actual environmental state to achieve the goal of optimizing environmental interactions. Furthermore, this strategy decomposes the problem into a main problem and subproblems, coordinating the solutions to these problems to obtain the optimal solution, thereby effectively reducing the problem scale and complexity. This integrated approach improves the overall performance of multi-agent systems in complex environments through the joint optimization of prediction and reward strategies. Simultaneously, this paper proposes a dual-experience pool prioritized parallel-replay structure. Compared to traditional single-experience pool structures that do not account for differences in experience states under different conditions, this structure helps to fully utilize historical experience data and improve the learnability of experience data.

Section 1 introduces related work, Section 2 provides background knowledge, Section 3 presents the collaborative obstacle avoidance decision-making method proposed in this article, Section 4 verifies the effectiveness of the proposed model through comparative experiments and analysis of the experimental results, and finally, Section 5 gives a conclusion and briefly outlines future research directions.

2. Related Work

In recent years, UAV swarms have become increasingly popular in various fields due to their advantages. With the increasing use of artificial intelligence technology in the fields of environmental perception and decision making, obstacle avoidance technology for UAVs has become an important research direction. Many scholars at home and abroad have conducted extensive and in-depth analyses of the problem of swarm obstacle avoidance.

Various obstacle avoidance methods have emerged to enhance the swarm obstacle avoidance effect, aiming to address the shortcomings of existing methods. Liu et al. [8] developed a path planning algorithm based on elliptical tangent graphs. The algorithm (autonomous path planning algorithm based on a tangent intersection and target guidance strategy, APPATT) can quickly generate flight paths and enable UAV swarms to plan in real time even when environmental information is incomplete. Wu and Low [9] designed a discrete space consensus algorithm that efficiently generates UAV formation flight paths. They also combined asynchronous planning and packeting mechanism strategies to avoid collision problems between UAVs. Zhou et al. [10] designed a collision-free formation path generation method that achieves collaborative obstacle avoidance through decentralized information exchange and consideration of each other’s positions and trajectories. However, the effectiveness of obstacle avoidance using the aforementioned methods is highly dependent on the parameter settings. To ensure optimal performance, parameters such as the number of waypoints, waypoint intervals, safe distance thresholds, algorithm search step sizes, and algorithm computation frequency must be made to adjusted different flight missions and environments. Adjusting the algorithm’s parameters requires specialist experience and experimental validation. This presents challenges when dealing with complex flight environments, as the calculation time is long and millisecond-level response cannot be achieved. To address the challenges posed by obstacle-filled environments, researchers have proposed two methods: artificial potential field methods (collaborative obstacle avoidance algorithm based on improved artificial potential field and consensus protocol, APFCP) [11,12] and graph theory-based methods (Path planning with Information-encoding and Constraint Optimization, PICO) [13]. These methods enable UAV swarms to avoid obstacles and maintain formation, ensuring safe flight in complex environments. However, the potential field function imposes constraints on the UAV, requiring it to avoid obstacles and dangerous areas through the use of a repulsive force field. As a result, the UAV may be unable to adjust its direction or speed near the target point, causing it to miss the intended target and potentially fall into local optimality [14]. The method (Multi-Agent Double Dueling Deep Q-Network, MA2D3QN) based on graph theory selects the dynamic environment state to be encoded into a fixed-length information frame and combines it with the deterministic policy gradient method to solve the constrained optimization problem in path planning [15]. Alternatively, a multi-agent deep reinforcement learning method based on obstacle avoidance priority achieves efficient obstacle avoidance by balancing individual obstacle avoidance and group obstacle avoidance [16,17]. However, because of the limited energy and computing capacity of UAV equipment, tasks must be offloaded to servers with stronger computing capabilities. To address the issue of resource energy consumption and delay overhead, Xia et al. [18] and Chen et al. [19] propose a multi-agent task computing method based on end-edge-cloud architecture. However, frequent resource requests from UAV swarms can cause a decrease in obstacle avoidance accuracy for edge servers and cloud servers in complex real-time environments as the state-space dimension increases. Existing obstacle avoidance strategies rely on accurate environmental state information, leading to increasingly prominent decision-making deviations in the UAV swarm during the formation obstacle avoidance process. This can make model training time-consuming, difficult to converge, and even prevent completion of training. This paper proposes a method for UAV swarm obstacle avoidance decision making based on the end-edge-cloud collaboration model for autonomous obstacle avoidance scenarios of UAV swarms (collaborative obstacle avoidance decision-making method, COADM). Table 1 compares the advantages and limitations of some existing representative methods with our proposed work.

Table 1 details the advantages and limitations of each algorithm in the context of UAV swarm formation obstacle avoidance scenarios. This comparison helps identify the applicability of different algorithms in practical applications and potential areas for improvement.

3. UAV Swarm Obstacle Avoidance Decision-Making Method Based on End-Edge-Cloud Collaboration Model

This section first provides an overview of the collaborative obstacle avoidance decision-making method proposed in this article, followed by introductions to the key technologies involved. Finally, the pseudo-code for asynchronous training of the UAV swarm obstacle avoidance decision-making method based on the end-edge-cloud collaboration model is presented.

3.1. Problem Statement

In complex environments such as forest firefighting, mine monitoring, and earthquake disaster relief, drone swarms encounter a variety of obstacles, necessitating the optimization of swarm strategies to utilize the advantages of collective operation. To minimize the gaps and overlaps between drones and allow each drone to operate for effective collaboration, it is critical to maintain the continuity and efficiency of the mission, while enhancing obstacle avoidance capabilities and reducing the risk of mission failure. Accordingly, it is imperative to ensure safe passage through complex obstacle environments in the swarm. Particularly, in forest terrains, maintaining a swarm, such as adopting a hexagonal pattern in densely wooded areas, is essential for coordinated action and addressing complex obstacles effectively. Given the limited payload capacities of individual drones, the deployment of an end-edge-cloud architecture facilitates the sharing of informational and hardware resources among the swarm members, thereby improving the overall dynamic system’s obstacle avoidance performance. This architecture supports each drone in making more informed avoidance decisions based on the perceived obstacle data. As depicted in Figure 1, the end-edge-cloud architecture consists of three principal components: the drone swarm, edge servers, and cloud servers. The drone swarm is categorized according to the configurations of the drone flight platforms, divided into multi-rotor and fixed-wing drones. Edge servers primarily comprise command vehicles at the edge, orchestrate the flight missions of the drone swarm and provide sustained computational power [20]. They focus on rapid model adjustments in the short term and swiftly disseminate these models to the swarm, while concurrently updating the status information generated during the obstacle avoidance tasks. Cloud servers primarily concentrate on long-term model optimization and centralized training [21]. These three entities, differentiated by their geographic locations and roles, collaborate to fulfill distinct functions, thereby facilitating the intelligent operation of the drone swarm.

In complex environments, the high dynamism of intelligent agents can lead to unpredictable consequences for others resulting from one drone’s obstacle avoidance behavior, as well as challenges in meeting frequently changing temporal and spatial demands. This temporal inconsistency necessitates that swarm avoidance algorithms be adaptable to various timing scenarios, ensuring that even with delays in information reception or processing, the obstacle avoidance strategies of each drone are coordinated reasonably. Therefore, it is imperative to continuously monitor and dynamically adjust the swarm’s avoidance strategies to meet diverse temporal and spatial task requirements. However, frequent adjustments of the avoidance strategies may lead to a decline in the quality of the training data samples, thereby impairing the model’s ability to generalize effectively during the inference deployment phase. This instability in avoidance behavior results in discrepancies between the trained avoidance strategies and the expected or optimal solutions, leading to adverse outcomes or performance deterioration and introducing decision biases that present increasingly complex swarm avoidance challenges for the drone swarm.

3.2. Problem Modeling and Analysis

Based on the end-edge-cloud architecture, UAV swarm obstacle avoidance is managed through collaborative decision making among multiple drones, focusing on maintaining formation and controlling avoidance. Formation maintenance requires drones to precisely maintain specific relative positions and attitudes, forming a stable and orderly formation to concentrate advantages and improve task efficiency [22]. Obstacle avoidance control emphasizes the need for drones to respond quickly and adjust their flight paths when facing complex obstacles, ensuring that the entire formation can safely and effectively bypass obstacles without rearranging the formation, thus maintaining continuity and efficiency in task completion. Real-time model training and updating strategies often struggle to converge effectively in complex environments, leading to decision biases that present complex formation avoidance challenges for the drone swarm. Therefore, the formation avoidance model needs to consider the relative positions and attitudes among UAVs, individual behaviors and interactions, and environmental perception information to achieve efficient swarm obstacle avoidance. The above optimization objectives can be described as follows:

m i n \sum | ϕ (f ({I n p}_{i}^{l o c} (t), {R e s}_{i}^{e d g} (t)), g ({D e c}^{l o c} (t), {R e s}^{c l d} (t))) - π_{i}^{*} (t) | s . t . (p_{i} (t) - p_{j} (t) = d_{i j}), j \in N_{i} ‖ p_{i} (t) - p_{m}^{o b s t} (t) ‖ \geq R + r_{m}^{o b s t}, m \in N_{o b s t},

(1)

where

f

is a function that represents the computation of each drone’s local obstacle avoidance strategy based on its state inputs and the resources available on the edge servers,

{I n p}_{i}^{l o c}

is the output data of the

i

-th UAV, and

{R e s}_{i}^{e d g}

is the available resources for the

i

-th UAV on the edge server.

g

is a function that calculates the global obstacle avoidance strategy based on the centralized computing resources available in the cloud server. The coordination function

ϕ

is used to adjust the local obstacle avoidance decision of each drone and minimize the discrepancy between each drone’s avoidance decisions and the optimal strategy

π^{*}

,

{D e c}^{l o c}

is the local decisions, and

{R e s}^{c l d}

is the available resources on the cloud server.

p_{i}

is the position information of the drone

i

,

d_{i j}

is the desired relative position between drones

i

and

j

,

p_{m}^{o b s t}

is the position of obstacle

m

,

r_{m}^{o b s t}

is the threat radius of obstacle

m

, and

N_{o b s t}

is the set of obstacles.

The problem modeling described involves a complex state space and action space, with many agents participating, leading to exponential growth in computational complexity as the state space expands and the number of agents increases. As accurately solving this problem becomes extremely difficult, approximate methods must be employed to develop decision-making strategies that satisfy practical operational requirements.

UAV swarm obstacle avoidance refers to each UAV being able to autonomously perform a series of collision-free movements from its current position to a designated location. For collaborative multi-agent systems, partially observable Markov decision processes (POMDPs) are used to describe them, specifically expressed using the seven-tuple

< N, S, A, M, R, O, γ >

.

N

is the number of agents;

S

is the joint observation state space of all agents, with each agent state represented by

s

, mainly considering key information such as obstacles, the current speed of agents, and task objectives;

A

is the action space of agents, with each agent action represented by

a

;

M

is the state transition probability;

R

is the immediate reward function, representing the immediate reward obtained by an agent after performing an action, used to guide the learning of collaborative obstacle avoidance strategies;

O

is the joint observation set of all agents; and

γ

is the discount factor. In the interaction process between each agent and the environment, each agent updates its decision based on the continuously evolving environmental state and receives reward signals [23]:

V (o) = \sum_{o^{'}} T (o, a) [(R (o, a) + γ V (o^{'})],

(2)

where

o

is the current state observed by each agent, composed of the UAV’s yaw angle, roll angle, pitch angle, position information, and perceived environmental features, providing a more comprehensive and accurate description of the environment state for deep reinforcement learning algorithms. This enhances safety and efficiency in complex environments, enabling drones to choose a path that not only avoids obstacles and threats but also remains within operational constraints.

o^{'}

is the future state that each agent may reach given the current state

o

and action

a

,

V (o)

is the expected total return that an agent can obtain in the current state

o

, and

V (o^{'})

is the expected total return that an agent can obtain in the future state

o^{'}

.

Equation (2) describes the value function update process in the partially observable Markov decision process. The strategy learned by agents in complex environments can converge to a stable state, and even if the environment changes within a certain range, the decision-making process can remain relatively stable. Each agent makes stable decisions based on the currently observed state and the learned value function. Specifically,

V (o)

is related only to the agent’s state and is independent of the action. Decision bias issues in multi-agent decision making stem from decision makers unconsciously or consciously treating biased and incomplete observation samples as representative of the entire population. During decision making, each agent often cannot directly perceive the underlying attributes of the external environment or decision criteria, such as risk, success, and satisfaction metrics. Instead, they can only approximate these decision criteria through sampled state information data. Because agent decisions are based on acquired sample data, sampling becomes particularly important in the agent decision-making process:

δ^{i} = |Q^{i} (o_{t}^{i}, a_{t}^{i}) - Q_{t a r g e t}^{i} (o_{t}^{i}, a_{t}^{i})| + ϵ^{i}, i \in N,

(3)

where

δ^{i}

is the sample priority of the data

(o_{t}^{i}, a_{t}^{i})

,

Q^{i} (o_{t}^{i}, a_{t}^{i})

is the Q-value of agent

i

when in state

o_{t}^{i}

and taking action

a_{t}^{i}

at time t;

Q_{t a r g e t}^{i}

is the target Q-value of agent

i

, and

ϵ^{i}

is a positive number that ensures each data tuple has a chance of being sampled.

According to Equation (3),

δ^{i}

used to measure the priority of the current sample relative to the target sample in the decision-making process of agent

i

. Each agent

i

calculates the ideal return value

Q_{t a r g e t}^{i}

based on the update strategy to guide the learning process of the agent. Additionally, since the update strategy does not involve considering unknown state–action pairs, it does not introduce additional bias to the obstacle avoidance strategy.

3.3. System Structure

Centralized cloud computing can no longer meet the requirements in terms of network delay and business agility. To meet the requirements of the diversified business scenarios of a UAV swarm, the end-edge-cloud architecture has become an important development trend [24]. Each UAV operates as an independent agent to enhance UAV swarm intelligence. These agents gather data from their surroundings, such as flight status and obstacles, and receive rewards based on their actions and environmental interactions. Each agent learns optimal strategies for obstacle avoidance through trial and error and mutual cooperation. The training phase of multi-agent deep reinforcement learning involves environment interaction, reward calculation, and policy updates. Figure 2 depicts the training process of the deep reinforcement learning algorithm based on the end-edge-cloud collaboration model.

For the environment interaction process in the model training phase, the agent needs to overcome the control delay while flying at a high speed. To effectively perform swarm obstacle avoidance control, the agent needs to predict the environment information so as to make an accurate obstacle avoidance response in advance. This requires the agent to perceive the state information in the complex environment in real-time and predict dynamic changes in the environment around the UAV in advance during the training process. The output result of the environment interaction is the state information that predicts future moments.

For the reward calculation process in the model training phase, by considering the relationship between the individual contribution of each agent and the group goal, the balance between the local reward and the global reward is optimized, so that each agent can pursue the individual best while also helping to achieve the overall goal of the team. This approach requires that the action of each agent be optimized according to the individual reward. It also ensures that these actions work together at a global level to maximize the long-term reward of the entire swarm. The result of the output of the reward calculation is the evaluation signal for the action and state information.

For the policy update process in the model training phase: By analyzing the results’ output using the agent in the environment interaction phase and reward calculation phase, it is possible to identify which actions lead to higher or lower rewards and adjust the results represented by the deep neural network weights and biases for obstacle avoidance strategies to optimize the agent’s behavior. At the same time, an exploration mechanism is added to prevent the policy update process from falling into a local optimal solution and ensures that the agent can explore a wider action space to improve the generalization of the policy model.

3.4. Resources and Environment

Aiming at environmental interaction, an environmental information prediction network is established, and the input data of the prediction network are a topological map composed of environmental element information [25,26]. The prediction network consists of two stages: encoding and decoding. The encoding process samples, transmits, and aggregates the information sensed by each UAV node based on a graph convolutional neural network, realizing the rapid feature extraction of dynamic graph data in high-dimensional state space. The decoding process aims to reconstruct the original node information using the features learned from the encoding process, and the long short-term memory network is used for timeseries predictions of dynamic graph data. The output data of the prediction network is the predicted state information at a future time. Compared with traditional methods, end-to-end neural network learning reduces the reliance on manual feature engineering, enabling faster inferences and decision making in a shorter timeframe. This accelerates response times and is suitable for applications requiring real-time decision making. Figure 3 depicts the architecture of an environmental information prediction network.

Presently, real-time access to flight environment data aims to perceive the possible environmental changes and the state information constituted by the safe flight of a UAV. This environmental information is the abstraction of the dynamic element information of the flight environment and forms the element information constraints together with the UAV dynamic model

O = [o_{1}, o_{2}, o_{3}, o_{4}, \dots, o_{n}]

based on the UAV dynamic model, where

[o_{1}, o_{2}, o_{3}, o_{4}, \dots, o_{n}]

represents the observed state quantity information such as the UAV position, speed, acceleration, and angular velocity.

The UAV swarm is defined as an undirected graph

G = (V, E)

, where V is the node set, consisting of each drone node and the obstacle nodes in the environment; E is the edge set, which describes the connection between the drone and the obstacles. Regarding the connection relationship, when the distance between drones is less than

ε

, an edge is added between the drones. Graph G includes the data vector

F = {[O_{1}, O_{2}, \dots, O_{n}]}^{T}

, where

O_{n}

represents the status information of the n-th drone.

The data vector X, composed of the above state information, is the state description of the UAV swarm in Euclidean space, and it is dynamically updated with the change in time and space. Since the UAV swarm is a strong dynamic system composed of space and time dimensions, in order to describe the state distribution of UAV nodes in space and time dimensions, the sampling function is used to describe the distribution and change in key information. For each UAV node in the spatiotemporal graph, the sampling function P(X) is given as

{[P (X)]}_{n} = \sum_{j = n, j \in B_{n}} q_{n j} O_{n}, | j - t | \leq Γ,

(4)

where

P (X)

is a sampling function that constructs an adjacency matrix based on the state information

X

of each UAV node, reflecting the distribution of and variation in information flow in the UAV swarm in both spatial and temporal dimensions.

B_{n}

is the adjacency matrix of the

n

-th UAV, reflecting the connection relationships and topological structure of the UAV swarm in the network.

q_{n, j}

is the association probability between the

n

-th and

j

-th UAVs,

O_{n}

is the status information of the

n

-th UAV, and

Γ

is a time window within which the information interaction between UAVs occurs around the time point

t

.

The changes in the number and location of agents over time cause graph G to continuously change, posing difficulties in the graph convolution process. To solve this problem, all agent feature vectors are merged into a feature matrix X of size

N L

in the order of index, where N is the number of drones in the UAV swarm and L is the length of the feature vector. Then, an adjacency matrix of size

(|B_{n}| + 1) N

for agent

n

is constructed. Finally, the eigenvector of the local area of the n-th node is obtained:

S = σ (L S T M (\sum_{k}^{K_{s}} W (B_{m} F (P (X)), H_{t - 1})),

(5)

where

σ

is an activation function that performs nonlinear transformation operations on each neuron in the neural network.

B_{m}

is the adjacency matrix constructed by agent

n

.

F \in R^{N C_{i n} T}

and

S \in R^{N C_{o u t}}

are the input flight environment feature data and the output feature data for future moment prediction, respectively. N,

C_{i n / o u t}

, and T are the number of drones, the number of input/output features channels, and the total number of frames, respectively.

K_{s}

is size of the convolution kernel,

W

is the weight matrix that maps the feature vectors to a new feature space, and

H_{t - 1}

is the hidden state of the long short-term memory network at time step t

-

1.

B_{m}

aggregates the information from each local adjacency matrix by utilizing

B_{n}

to construct and process feature data for extracting and predicting the state information of UAVs. After extracting the spatial information of each UAV node, a LSTM network is employed to learn and remember long-term dependencies in the environment and to accurately predict future states of the environment based on the extracted features.

By combining the predicted state information S for the future moment, the corresponding reward value can be calculated for each drone. Accurate reward evaluation is crucial for guiding the drone to take more effective action strategies. The innovative environment prediction network based on dynamic spatiotemporal graph convolution combines dynamic environment modeling, continuous environment input processing, spatial relationships, and dynamic models based on UAV nodes to achieve precise predictions of obstacles in complex environments. Its design minimizes deviations from the actual environmental state while ensuring compliance with constraints, thereby optimizing the environmental interaction process and enhancing the comprehensive understanding and prediction capabilities of the UAV swarm.

3.5. Reward Calculation

For reward calculation, this paper proposes a reward decomposition strategy that uses the state information of the UAV swarm in the flight environment in the future time and the decision action taken by the UAV to accurately allocate rewards. The reward decomposition strategy consists of two phases: master problem optimization and subproblem evaluation. The optimization objective of the master problem is to maximize the global information reward of the whole system, and the calculated reward signal is fed back into the reward design to further optimize the solution. The subproblem evaluation phase evaluates the effect of its solution for each subproblem in a specific state. Master problem optimization and subproblem evaluation are performed by continuous iteration until convergence. Finally, the reward decomposition strategy can provide an immediate evaluation signal for the policy update in the model training phase. Figure 4 depicts the reward decomposition strategy.

Each UAV randomly selects an action

a \in A

from the policy

π

. At the same time, combined with the predicted state information S at the future moment, each drone obtains the corresponding reward value

R (s, a) : S A ⟼ R

. Upon receiving the predicted state information

s \in S

, the drone determines the reward function outcome after executing action

a

. In this case, the reward function obtained after executing action

a

provides a quantitative evaluation of the effect of the UAV’s actions and guides the UAV to learn efficient obstacle avoidance strategies. Assigning specific instant rewards to different combinations of actions and status information helps the UAV swarm optimize its decision-making process.

\{\begin{matrix} R_{v}^{n} = e x p (\frac{{| | v_{n} | |}_{2} - {| | v_{n}^{'} | |}_{2}}{v_{n}^{o p t}}) \\ R_{ω}^{n} = 1 - \frac{{|| ω_{n}| |}_{2} - {| | ω_{n}^{'} | |}_{2}}{360} \\ R_{o}^{n} = \frac{F_{g r a} (n) + F_{r e p} (n)}{{| | F_{g r a} (n) + F_{r e p} (n) | |}_{2}} c_{0}^{n} \\ R^{n} = {ϖ_{v}^{n} R}_{v}^{n} + {ϖ_{ω}^{n} R}_{ω}^{n} + {ϖ_{o}^{n} R}_{o}^{n} \end{matrix}, n \in N,

(6)

where

R_{v}^{n}

,

R_{ω}^{n}

, and

R_{o}^{n}

are the flight speed reward, flight angle reward, and collision distance reward for the

n

-th UAV, respectively.

v_{n}

,

v_{n}^{'}

, and

v_{n}^{o p t}

are the current speed, desired speed, and optimal speed of the

n

-th UAV, respectively.

ω_{n}

and

ω_{n}^{'}

are the current angular velocity and desired angular velocity for the

n

-th UAV, respectively.

F_{g r a} (n)

,

F_{r e p} (n)

, and

c_{0}^{n}

are the gravitational force, repulsive force, and the minimum safe distance between the

n

-th UAV and obstacles, respectively.

ϖ_{v}^{n}

,

ϖ_{ω}^{n}

, and

ϖ_{o}^{n}

are the weights of the flight angle reward, flight speed reward, and collision distance reward for the

n

-th UAV, respectively.

Consider

N

UAVs flying in a complex environment with the desired speed

v

and heading angle

ω

, autonomously forming a predetermined formation. The control constraint for the

n

-th UAV

u_{n} = \sum_{j}^{N_{n}} I_{n j} (p_{j} - p_{n}), n, j \in \{1,2, \dots, N\}

. Here,

p_{n} = {[x_{n}, y_{n}, z_{n}]}^{T}

is the position of the

n

-th UAV in the world coordinate system,

p_{j}

is the position of the

j

-th UAV in the world coordinate system, and

I_{n j}

is the gain matrix. Each UAV flies with a linear velocity

v

and an angular velocity

ω

to achieve the desired fromation. When the position of the

n

-th UAV is

p_{n}

and the destination is

p_{n}^{g o a l}

, each UAV’s target position is set as the center of the gravitational field, while other UAVs and environmental obstacles act as the centers of repulsive fields. Thus, the gravitational force on the

n

-th UAV at this point is

F_{g r a} (n) = - η p_{n}

. Simultaneously, the repulsive force acting on the

n

-th UAV is

F_{r e p} (n) = β (1 / d - 1 / d_{0}) 1 / d^{2} {(p_{n} - p_{n}^{g o a l})}^{ρ} \partial d / \partial X

. Here,

η

is a constant coefficient used to adjust the strength of the gravitational force,

β

is the position gain function,

d

is the minimum distance between two UAVs,

d_{0}

is the maximum influence distance of other UAVs and environmental obstacles as the centers of repulsive fields, and

ρ

is a positive constant. The term

{(p_{n} - p_{n}^{g o a l})}^{ρ}

is used to ensure that the entire potential field at the target point is the global minimum. Setting the obstacle avoidance reward value can effectively help the UAV swarm make trade-offs between short-term and long-term results and provide instant, accurate information feedback for each drone’s individual behavior.

The reward decomposition strategy based on mixed credit assignment provides more reasonable action feedback. This enables agents to more accurately assess the merits of each action and achieves more reliable cluster control. By decomposing the problem into main problems and subproblems and coordinating the solutions to these two problems, this strategy demonstrates higher flexibility and reduces problem scale and complexity, which brings new innovations to the field of cluster control. The goal of the optimization phase of the master problem is to maximize the global information reward of the whole system and feed the calculated reward signal back into the reward design, which will further optimize the solution. Each UAV adjusts its flight action based on the calculated reward value to optimize the overall obstacle avoidance performance. Specifically, each UAV obtains the reward value

R (s, a)

of the state information for the next step from the predicted state information and action

(s, a)

at the future time. In multi-agent reinforcement learning, the goal of the UAV is to obtain the optimal policy

π^{*}

as much as possible, thus maximizing the cumulative reward value

Q_{t o t a l}

during the task execution of the UAV swarm. This is achieved by optimizing the cumulative reward value

Q (s_{n}, a_{n})

for each UAV at each time step, where

Q (s, a)

is the expected cumulative reward obtained when taking action

a

in state

s

. The total reward is the sum of the cumulative rewards of all UAVs. By optimizing each UAV’s policy

π

, we can maximize the overall performance during the mission as follows:

m a x \sum_{n = 1}^{N} Q_{n} (s_{n}, a_{n}) s . t . \sum_{n = 1}^{N} \{R_{n} (s_{n}, a_{n}) + γ \sum_{s^{'}}^{S} M (s_{n}, a_{n}) π (s^{'})\}, \forall s_{n} \in S B e n d e r s C u t s : θ_{s} \geq R (s, a_{s}) + γ \sum_{s^{'}}^{S} M (s^{'} | s, a_{s}), \forall s \in S,

(7)

where

M

is the transition probability of a UAV moving from state

s

to state

s^{'}

after taking action

a

,

π

is the policy function adopted by the UAV at the corresponding time step, and

θ_{s}

is the value function in the Bellman equation for state

s

.

In considering Markov decision processes (MDPs) with linear rewards, the multi-agent reward allocation problem can be handled by decomposing the decision-making process of the entire system into its own MDP subproblem for each agent. This decomposition method allows for a reasonable distribution of reward values among agents, thereby promoting collaboration and efficiency improvements. To solve the coordination problem of reward distribution among agents, the Lagrangian relaxation method is used to evaluate the subproblems decomposed by different solutions in the above main problem [27], enabling a reasonable distribution of reward values among agents. The solution of each subproblem provides evaluation information of action values to the main problem and forms corresponding Benders Cuts. These Benders Cuts are added to the constraint set of the main problem, thus iteratively improving the solution.

m i n \sum_{n = 1}^{N} \sum_{s}^{S} μ_{s} v_{s} s . t . \sum_{n = 1}^{N} (V_{s} - γ \sum_{s^{'}}^{S} M (s_{n}, a_{n}) v_{s^{'}}) \geq \sum_{n = 1}^{N} R_{n} (s_{n}, a_{n}), \forall s_{n} \in S, a_{n} \in A V_{s} \geq 0, \forall s \in S,

(8)

where

μ_{s}

is the weight of state

s

, and

V_{s}

is the value of the current state

s

.

For each piece of state information s, the expected return of taking action in this state under the current strategy is calculated, and the action effect is evaluated based on the state information s, the state value

V_{s}

, and the state transition probability M. Subproblem evaluation determines the action evaluation value of each agent under the given model and constraints. These evaluation values consider the long-term value and short-term reward of each state’s information, as well as the cooperative relationship between drones. Through continuous iteration of main problem optimization and subproblem evaluation, the behavior of each drone can be fine-tuned to maximize the efficiency of the entire system and ensure that the UAV swarm obtains the best obstacle avoidance performance when performing flight tasks.

In a UAV swarm system, the behavior of each UAV affects other UAVs, so simply using a single global reward signal to guide the autonomous learning of all UAVs leads to inefficient model training. Providing a more accurate reward value for each drone helps improve the speed of individual learning and the accuracy of obstacle avoidance strategies during the policy update process, helping the UAV swarm achieve better obstacle avoidance effects.

3.6. Policy Update

For the policy update, a dual-experience pool replay structure is designed, which utilizes the sample optimization mechanism to improve the data quality of the model training samples. In the strategy update stage, the instant reward information output from the reward calculation stage is used to adjust the priority of samples in the experience pool, enabling the UAV to learn a reliable obstacle avoidance strategy from the key historical experience data. The historical experience data are divided into safe-sample experience and dangerous-sample experience, and a batch of sample data with the highest priority in the dangerous-sample experience pool and the safe-sample experience pool is used to train the network model according to the proposed proportion. The experience is fed back to the learning network and the target network, and the loss is calculated by using the error between the learning network and the target network. Thus, the purpose of updating the model network parameters regularly is accomplished. The above policy update effectively combines two different types of experiences by utilizing a batch of samples with the highest priority from both the safe-sample experience pool and the dangerous-sample experience pool, in a predetermined proportion, for network model training. This approach not only improves the quality of the training samples but also enhances the model’s generalization ability, enabling drones to learn more reliable obstacle avoidance strategies from critical historical experiences. Figure 5 depicts the flowchart of the policy update process during model training.

In the dual-experience pool replay structure, the experience pool is divided into a safe-sample experience pool and a dangerous-sample experience pool, and the experience similarity is used to evaluate the similarity between the new learning sample and the samples that have been stored in the experience pool, to decide in which experience pool the new sample should be stored. When the agent selects the empirical data of UAVs for model training, the dangerous-sample empirical data have greater selection and reference value, and the dual-experience pool replay structure can make the sample empirical data of the dangerous state more likely to be extracted, thus making the strategy update more efficient.

f l a g = ⌊ \frac{\sum_{i = 1}^{Ω} {d a n}_{i} {e x p}_{i}}{\sqrt{\sum_{i = 1}^{Ω} {d a n}_{i}^{2}} \sqrt{\sum_{i = 1}^{Ω} {e x p}_{i}^{2}}} ⌋,

(9)

where

f l a g

is a metric used to evaluate the similarity between the current learning sample and the samples already stored in the experience pool.

Ω

is the total number of experience data points in the current environment state,

d a n

refers to the data in the dangerous-sample experience pool, and

e x p

is the currently collected UAV state information.

During the experience replay phase, the current UAV state information is compared with each experience state

k e y

in the dangerous-sample experience pool. If the similarity is higher, the current state information is placed in the dangerous-sample experience pool; if the similarity is lower, the current state information is placed in the safe-sample experience pool. The primary function of the

f l a g

is to help determine which experience pool the current learning sample should be stored in, thereby optimizing the structure of the experience pool and enhancing the efficiency of policy updates.

\{\begin{matrix} {s a f}_{i} = ε_{0} l_{i} t_{i} + ε_{1} \frac{1}{{n u m}_{i}} + ε_{2} \\ {d a n}_{i} = ε_{3} (1 - l_{i}) t_{i} + ε_{4} \frac{1}{{n u m}_{i}} \end{matrix},

(10)

where

{s a f}_{i}

is the priority weight of safe samples, used to measure the importance and selection priority of each safe sample in the experience pool.

{d a n}_{i}

is the priority weight of dangerous samples, used to measure the importance and selection priority of each dangerous sample in the experience pool.

t_{i}

is the time when the sample is generated,

{n u m}_{i}

is the number of times the sample has been used,

l_{i}

is the sample label, where

l_{1}

is safe samples and

l_{0}

is dangerous samples.

ε_{0}

and

ε_{3}

are time influence coefficients, used to evaluate the impact of time on the sample priority weight.

ε_{1}

and

ε_{4}

are sample usage influence coefficients, used to evaluate the impact of sample usage on the sample priority weight.

ε_{2}

is the priority influence coefficient in safe samples, providing a base priority for all safe samples.

By adjusting the parameters of

ε_{0}

,

ε_{1}

,

ε_{2}

,

ε_{3}

, and

ε_{4}

, the impact of timestamps and usage counts on the sample weight is controlled, and the weights for safe and dangerous samples are calculated. The sampling probability of dangerous samples in the dangerous-sample experience pool is denoted as

ε_{d a n}

, and the weights of safe samples in the safe-sample experience pool are denoted as

ε_{s a f}

. The aggregated experience data

D

are then sent to the intelligent agent network, where

D = ε_{s a f} s a f + ε_{d a n} d a n

. Here,

D

is the aggregated experience sample data,

s a f

is the data in the safe-sample experience pool, and

d a n

is the data in the dangerous-sample experience pool. Then, the collected samples are substituted into the loss function for updates, enabling the sampling probability of the sample to be dynamically controlled by adjusting the weight of each sample. Moreover, the sampled sample data are substituted into the loss function for updating. Compared with the traditional single-experience pool, the dual-experience pool priority parallel-replay structure considers the differences in the state of experience under different conditions, fully utilizes historical experience data and improves the learnability of experience data. Through differential processing using a sample selection mechanism, it enhances the data quality of samples, thereby accelerating the convergence speed of the decision model and improving the learning rate of the experience.

Since most terminal devices are limited by computing resources, efficient collaboration between edge servers and cloud servers with heterogeneous processing capabilities is beneficial for optimizing the obstacle avoidance efficiency of the entire cluster system [28]. In this paper, edge servers with different computing capabilities and energy efficiencies are considered to provide reliable computing resources to connected UAV swarm devices. In the Actor deep neural network based on value functions, the main function of the Actor network deployed in the edge server is to receive status information and determine the obstacle avoidance strategy

π

based on the status information and reward signals. The strategy

π (s_{t}, a_{t})

can be approximated as parameters. The

θ

function

π (s_{t}, a_{t}) \approx π (a_{t} | s_{t}; θ_{t})

is used to continuously optimize the strategy

π

in the Actor network by adjusting

θ_{t}

.

{π^{*} = {a r g m a x}_{π} E}_{s ~ D, a ~ π}^{e d g} [\sum_{t = 0}^{T} γ^{t} (r (s_{t}, a_{t}) - \sum_{n = 1}^{N} α_{n} l o g π_{n} (a_{n, t} | s_{n, t})],

(11)

where

e d g

indicates that the process is carried out on the edge server, the objective is to learn an optimal policy

π^{*}

that maximizes the expected cumulative reward, and

α_{n}

is the entropy temperature.

During the real-time interaction between the UAV swarm and the environment, training datasets are continuously generated and stored in edge servers and cloud servers at the same time. The Critic network deployed in the cloud server allows the Actor network to train by evaluating the value of state behavior. Therefore, the loss function based on the value function of the Critic deep neural network is as follows:

L_{Q}^{c l d} = E_{s, a, s^{'} ~ D} {(Q_{t o t} (s, a) - (r (s, a) + γ V (s^{'}))}^{2},

(12)

where

c l d

means that the process is carried out on the cloud server.

D

is the total experience sample data, including both safe and dangerous samples.

Q_{t o t}

is the total Q-value function, used to evaluate the cumulative reward obtained by taking action

a

in state

s

.

In order to enable the UAV swarm to more fully explore the state space, the obstacle avoidance strategy should be prevented from falling into local optimality, and at the same time, more feasible solutions should be explored to complete the designated tasks and improve the anti-interference ability. Therefore, while maximizing the cumulative income of the strategy, the entropy value of the strategy is also kept greater than the threshold, giving obstacle avoidance stronger strategy expression capabilities.

L_{α}^{c l d} = E_{s ~ D, a_{n} ~ π_{n}} [\sum_{n = 1}^{N} l o g π (a_{n} | τ_{n}) - \bar{H}],

(13)

where

L_{α}^{c l d}

is the loss function corresponding to the obstacle avoidance strategy taken by the UAVs, and

E_{s ~ D, a_{n} ~ π_{n}}

is the expectation over states

s

sampled from the experience pool

D

and actions

a_{n}

sampled from the policy

π_{n}

.

τ_{n}

denotes the state trajectory of the

n

-th UAV.

\bar{H}

is the average entropy, ensuring the diversity of the policy and preventing it from falling into local optima.

\bar{H}

is the predefined minimum policy entropy threshold to ensure that during the model training process, the agent not only focuses on immediate rewards but also maintains the diversity and flexibility of obstacle avoidance strategies by maintaining a certain degree of exploitability. The above policy update mechanism helps the agent avoid falling into the local optimum. To optimize the agent’s obstacle avoidance strategy, it is necessary to estimate the policy gradient and continuously adjust the network parameters through backpropagation, so that the model can be operated in an asynchronous manner. This facilitates comprehensive exploration of optimal solutions to problems among different possibilities. By maintaining policy entropy above a predefined threshold to ensure exploration diversity, employing asynchronous training to explore a broader solution space, and leveraging a distributed end-edge-cloud collaboration model to integrate global information, these key mechanisms collectively prevent the algorithm from prematurely converging on suboptimal solutions.

In summary, this paper adopts a collaborative obstacle avoidance decision-making method and uses the environmental information prediction network to predict the state information of the future time according to the current state information of the agent. Then, the reward decomposition strategy is used to accurately evaluate the behavior of each agent according to the state information of the future moment and the decision action taken by the agent. Finally, the sample optimization mechanism is used to improve the data quality of the model training sample, and the reliable obstacle avoidance strategy is learned from the historical experience data. Furthermore, this paper achieves a balance between the overall system performance and the individual agent’s flight maneuverability through reward design, state normalization, policy sharing, and other strategies, specifically, by using state normalization for agents with different dynamic parameters to reduce the impact of differences and combining system and individual objectives through reasonable reward functions and hierarchical reward mechanisms. Additionally, in a framework of centralized training and decentralized execution, agents learn strategies using global information during training and adapt based on local information during execution to ensure a balance between system performance and individual maneuverability.

3.7. Pseudo-Code of Collaborative Obstacle Avoidance Decision-Making Algorithm

The goal of the UAV swarm obstacle avoidance decision algorithm proposed in this paper based on the end-edge-cloud collaboration model is to fully use the computing resources of the edge server and cloud server to solve the problem of decision bias in the existing research [29,30].

Assume that there is a cluster of N drones performing flight missions in a complex environment. The original frame status information and action sequence collected by the sensors deployed on them are

O = [o_{t}^{1}, o_{t}^{2}, \dots, o_{t}^{n}], t \in T, n \in N

and

A = [a_{t - 1}^{1}, a_{t - 1}^{2}, \dots, a_{t - 1}^{n}], t \in T, n \in N

, respectively. The obstacle avoidance decision-making method of the UAV swarm is based on the end-edge-cloud collaboration model. Algorithm 1 describes the obstacle avoidance decision-making method for UAV swarms based on the end-edge-cloud collaboration model.

Algorithm 1. Obstacle avoidance decision-making method for UAV swarms.

function EdgeServer (

O, A

):
Observe state

O

satisfies the constraints of the drone dynamics model
for k = 0 to train_steps_limits do
env.reset()
for t = 0 to max_episode do
For each agent i, choose action

a_{i} ~ π_{i}

Extracted the global action feature based on GCN

S = σ

(

O, A

) w.r.t Equation (5)
Concatenate

a_{t}^{i}, i \in [1, \dots, N]

into

a_{t}

Take

a_{t}

into UAV swarm graph and get

s_{t + 1}

save state-action history

τ

for agent i in N do
The total reward

r_{t o t a l}^{i}

value for each drone. w.r.t Equation (6)
end for
Iterative solve master problem and sub problems w.r.t Equations (7) and (8)
Store (

τ

, a, r,

τ^{'}

) in replay buffer D w.r.t Equations (9) and (10)
end for

π

= CloudServer(D)
end for
select action

a_{t}

according to the current policy

π

return

a_{t}

  end function
  ray.init(address = CloudServer_config[‘cloud_node_ip_address’]
  @ray.remote
  function CloudServer (D):
  if |D| > batch_size then
for t = 1 to T do
  Sample minibatch B from D
  Generate flight state information s
  Update critic network

θ_{T D} \leftarrow η \nabla L_{T D} (θ_{T D})

w.r.t Equation (12)
Update policy network

π \leftarrow η \nabla L (π)

Update encoding network

θ_{s} \leftarrow η \nabla L_{s} (θ_{s})

Update temperature parameter

α \leftarrow η \nabla α

w.r.t. Equation (13)
if time_to_update_target_network then

θ^{-} \leftarrow θ

  end if
end for
  end if
  return

π

end function

This algorithm comprises the EdgeSever function and the CloudSever function. The EdgeSever function operates on edge servers, while the CloudSever function runs on cloud servers and utilizes the Ray framework’s ray, which is the remote function for task-based parallel computation. Initially, the algorithm determines the type of unmanned aerial vehicle platform, distinguishing between multi-rotor UAVs and homogeneous fixed-wing UAVs, and it then initializes the UAV flight environment. Starting with the first game, the algorithm iterates over each UAV, selecting actions from the obstacle avoidance strategy and processing the original frame state information for rapid feature extraction. The actions of all UAVs are compiled into a collective action set, from which each UAV selects the corresponding action and obtains a new predicted state from the environment, while also preserving the historical data of states and actions. Furthermore, the algorithm records the flight angle reward, flight speed reward, and collision distance reward obtained by each UAV. Finally, the algorithm samples from the experience replay array, trains the Critic network by minimizing the temporal difference (TD) error, updates the encoding and target networks, and dynamically adjusts the entropy to adapt to different task requirements.

4. Experimental Analysis

This paper focuses on the model training stage and uses UAV swarm obstacle avoidance as a case study. It involves designing a complex simulation environment to evaluate the performance of the asynchronous model training method, which is based on a collaborative obstacle avoidance decision method. The three following research indicators are focused on:

Obstacle avoidance efficiency: The obstacle avoidance efficiency of the UAV swarm is evaluated, specifically in terms of the swarm’s ability to quickly and safely navigate around obstacles within a predefined operational timeframe.
Formation stability: The stability of the relative position and attitude among the UAVs in the swarm is evaluated, which can be measured by the stability of the position deviation, and attitude deviation of the swarm.
Formation integrity: The degree to which the overall structure of a drone swarm maintains its integrity during flight is evaluated, ensuring the safety and effectiveness of the drone swarm while performing tasks.

To verify the proposed COADM, a comparison is made using the open-source simulation software Pybullet-3.2.5 between the performance of COADM and seven similar obstacle avoidance algorithms, APFCP, APPATT, PICO, and MA2D3QN, and the deep reinforcement learning baseline algorithms QMIX, MATD3, and MADDPG under the same experimental conditions [31,32,33]. This paper conducted all experiments under identical environmental conditions and used standardized evaluation metrics. We also tuned parameters for optimal performance and conducted multiple runs with statistical analysis to ensure robust and reliable results. These measures ensure that any performance differences are solely due to the algorithms themselves, making the comparisons fair and accurate.

4.1. Experimental Setup

This study builds a UAV swarm simulation platform using the Pybullet-3.2.5 UAV simulator toolkit on a system configured with an Intel Core i7-1260P processor, 8 GB memory, a single GPU GeForce RTX 3060, and the Ubuntu 18.04 operating system [34,35]. According to the requirements of UAV experiments and algorithm performance optimization strategies, intelligent decision-making models were collaboratively trained using Python 3.9 on edge servers and cloud servers with different hardware configurations and network conditions. Data transmission among the three parts was achieved through communication protocols to ensure real-time data synchronization and collaborative processing. During the UAV swarm simulation process, the complexity of the flight environment, the variability of task demands, and the unpredictability of unexpected events make it challenging to achieve the desired effect by setting a static collaborative obstacle avoidance strategy for each unit in the UAV swarm in advance. Therefore, it is necessary for each flying unit to dynamically make independent decisions based on the perceived environmental information and obstacle avoidance strategies to effectively avoid various collision risks during the execution of flight missions.

Compared with training with real UAVs in actual scenarios, using the Pybullet-3.2.5 virtual simulation environment in this paper offers advantages such as shorter training times and greater flexibility. In real environments, the speed of UAVs is limited by energy constraints and physical laws, making each data collection process time-consuming and unfavorable for rapid model iteration and convergence. Additionally, in the simulation environment, hardware and software parameters such as the position and accuracy of UAV body sensors can be randomly adjusted to simulate the flight performance of real UAVs under various environmental conditions. The software and hardware computational resources used in the simulation experiments are listed in Table 2.

Consider a swarm formation composed of seven UAVs. During the simulation, the swarm encounters unexpected unknown obstacles during forward movement. The UAVs need to cooperate with each other and adopt various obstacle avoidance strategies to improve task completion efficiency and safety. Table 3 presents the parameter settings for each UAV [36].

S i z e

is the length, width, and height of the UAV, and the drag coefficient

C_{d}

is a key parameter for measuring the efficiency of air resistance during flight, while the maximum angular velocity is a crucial indicator of a drone’s ability to rapidly adjust its direction in response to changes in the external environment. UAV swarms gather substantial amounts of timeseries trajectory data. In the preprocessing of drone trajectory data, we employed the Z-score standardization method. This method standardizes trajectory data by transforming it into a standard normal distribution with a mean of 0 and a standard deviation of 1, effectively handling trajectory features of different magnitudes. In drone trajectory data, features like position and velocity may have significant scale differences, and Z-score standardization can eliminate these differences, ensuring equal importance is given to feature information during training. Moreover, this method can enhance model convergence, accelerate the optimization process, exhibit strong robustness to outliers, and ensure the effectiveness of data preprocessing while preserving the original distribution shape of the trajectory data.

The UAV swarm experiment designed in this paper includes two groups consisting of seven multi-rotor and seven fixed-wing UAVs, respectively. These two groups of UAVs will alternately conduct experiments to comprehensively evaluate and compare their respective obstacle avoidance capabilities. In complex environments, cooperative obstacle avoidance technology will greatly improve the operation efficiency of a UAV cluster in forest mission operations. Through various sensors deployed on the UAV cluster, the UAV can collect a variety of critical information, such as its own position, heading, and various external information in the environment. This information is essential for the command staff to fully understand the current situation and specify an effective flight strategy.

Table 4 shows some of the important algorithm parameters during reinforcement learning, which are crucial for the performance of the model and the training process [37].

The parameter values in Table 4 are selected based on a combination of theoretical considerations, empirical testing, and best practices in reinforcement learning. For instance, the learning rate (0.0003) is chosen to ensure stable convergence while preventing destabilizing large updates during training. The discount rate (0.99) emphasizes long-term rewards, which is crucial for environments requiring strategic planning over extended time horizons. The recurrent neural network (RNN) hidden-layer dimensions (128) are set to balance model complexity with computational efficiency, ensuring the network can capture essential patterns without overfitting. Additionally, buffer sizes (50,000) are large enough to provide a diverse range of experiences, which is critical for learning robust policies in dynamic environments.

In the experiment, the deployment and execution of the end-edge-cloud architecture primarily depend on distributed parallel computing framework components [38,39]. Simulation tasks run correctly, and specified API interfaces send task requests from the simulation swarm to the edge server, triggering model training tasks on the edge server. Using end-edge-cloud collaborative processing, the entire execution and data processing process of the simulation experiment in the end-edge-cloud environment is completed.

4.2. Experimental Results

When not encountering obstacles, the UAV swarm maintains stable flight. Upon encountering obstacles, it dynamically adjusts to navigate around them. After clearing the obstacles, the cluster quickly reverts to stable formation to continue to the destination.

In deep reinforcement learning, a time step represents a single iteration in the training process of reinforcement learning algorithms. The representation of time steps aids in understanding the training efficiency of different algorithms and evaluating their performance in complex environments. Figure 6 shows that the average reward value of the COADM method reaches the desired reward value after

1.5 \times 10^{6}

iterations, which means that the UAV swarm can achieve obstacle avoidance in complex environments through reinforcement learning [40]. In the early stage of training, the reward value is low because the Critic network cannot correctly evaluate the value of actions and states. When the Critic network in the COADM method quickly completes the approximation of the value function in the early stage, the average reward begins to rise rapidly. Match the exploration function to set the desired effect. At the same time, because the dual-experience pool replay strategy accelerates the approximation of the Critic network, the overall training time is shortened by 25.71%, and the reward value is increased by 64.71%.

By observing and analyzing the curve of global reward over time, when it no longer shows significant fluctuations and remains stable within a certain range, it can be inferred that the multi-agent system has reached a stable state, indicating algorithm convergence. The utilization of dimensionless reward values facilitates the accurate identification of algorithm convergence. It also enables clearer comparisons of the relative performance of algorithms across different environments and tasks without requiring unit conversion. This approach simplifies both system design and implementation, enhancing the versatility, adaptability, and overall efficiency of the algorithm development process.

Under the background of applying the multi-agent deep reinforcement learning method to improve the intelligent obstacle avoidance ability of a UAV swarm, this paper completes the construction of the following evaluation indexes for the obstacle avoidance problem of a UAV swarm:

Research indicator 1: Obstacle avoidance efficiency

To evaluate the efficiency of obstacle avoidance in this experiment, a swarm of UAVs is required to explore an unknown environment in a swarm among randomly generated static obstacles and reach a specified number of target points. The overall task completion time is the maximum task completion time in the UAV swarm, which reflects the task completion efficiency of the UAV swarm system [41].

\bar{C T} = \frac{1}{e p i} \sum_{i = 1}^{e p i} \max_{n = 1,2, \dots, N} {(T i m e}_{n}),

(14)

where

\bar{C T}

is the average task completion time of the UAV swarm,

e p i

is the number of completed obstacle avoidance experiments by the UAV swarm,

{T i m e}_{n}

is the task completion time of the

n

-th UAV, and

N

is the total number of UAVs.

Figure 7 shows the average task completion time for all experiments. Experimental results show that as the number of obstacles increases, the average task completion time of a UAV swarm in complex environments gradually increases, but the average task completion time using the COADM method is the smallest, maximizing the efficiency and reliability of task completion.

To evaluate the efficiency of obstacle avoidance strategies for UAV swarms under different tasks and environments, the average extra flight distance of the UAV swarm is introduced as a metric. The average extra flight distance refers to the additional average distance flown by the UAV swarm over multiple missions due to the need to avoid obstacles. When the UAV swarm encounters obstacles in its flight path, it must adjust and detour to achieve safe avoidance, resulting in a longer flight path compared to flying directly to the target point.

\bar{e x t} = \frac{1}{e p i} \sum_{i}^{e p i} \sum_{n}^{N} ({d i s}_{i, n, o b s} - {d i s}_{i, n, d i r}),

(15)

where

\bar{e x t}

is the average extra flight distance of the UAV swarm,

{d i s}_{i, n, o b s}

is the total flight distance of the

n

-th UAV in the

i

-th mission after avoiding obstacles, and

{d i s}_{i, n, d i r}

is the direct flight distance of the

n

-th UAV to the target point in the

i

-th mission without obstacles.

Figure 8 depicts the average additional flight distance incurred by multi-rotor and fixed-wing drones when avoiding different numbers of obstacles. The average extra flight distance of the UAV swarm using the COADM method is slightly smaller compared to the other six methods, and it can always maintain high performance as the number of obstacles increases. At the same time, it is noted that the average extra flight distance of the fixed-wing UAV swarm is larger than that of multi-rotor UAVs. This is because fixed-wing UAVs fly faster than multi-rotor UAVs, and the risk of obstacle avoidance continues to increase. Therefore, a larger extra flight distance is needed to ensure the safety of the UAV swarm. Therefore, the path planned for the fixed-wing UAV swarm will gradually move away from the obstacles as the UAV speed increases, which fully reflects the fixed-wing UAV swarm.

Research indicator 2: swarm stability

In the experiments evaluating obstacle avoidance stability, the UAV swarm needs to interact with the environment in real-time while performing stable obstacle avoidance maneuvers and achieving the predetermined flight mission objectives. During the UAV swarm’s flight, adjustments and detours are necessary to ensure safe obstacle avoidance. Curvature is a geometric property that describes the curvature of the generated trajectory, quantifying the rate of change in the path. The larger curvature, the more curved the trajectory; the smaller curvature, the straighter the trajectory. The UAV swarm’s curved motions at specific moments for avoiding collisions with obstacles in the environment reflect the smoothness of the trajectory generated by the UAV’s obstacle avoidance system. The UAV swarm should be able to smoothly and safely navigate through obstacle passages in flight environments with complex geometric constraints. Evaluating the curvature of the flight path is essential for analyzing the obstacle avoidance performance of a UAV swarm. Additionally, it provides crucial data for optimizing obstacle avoidance algorithms. Therefore, the path curvature of the UAV swarm is used to evaluate the stability of the formation flight [42].

c u r = \frac{2 s i n (\sum_{n}^{N} a r c c o s (\frac{T V 1 \cdot T V 2}{|T V 1| |T V 2|}) / N)}{| T V |},

(16)

where

c u r

is the curvature of the UAV swarm’s flight trajectory at each time point during the task execution process.

T V 1

and

T V 2

are the tangent direction vectors between adjacent points in the flight trajectory of each UAV. The angle between the tangent direction vector

T V

of the trajectory and the next tangent direction vector is calculated using the dot product and cross product.

For each trajectory point in the UAV, the corresponding curvature is calculated and recorded to form a curvature data sequence. Since the trajectories of UAV swarms have different scales or shapes, the normalization method described in Equation (17) is typically employed to eliminate scale differences.

{c u r}_{n o r} = \frac{2 (c u r - {c u r}_{m i n})}{{c u r}_{m a x} - {c u r}_{m i n}} - 1,

(17)

where

{c u r}_{n o r}

is the normalized curvature value,

c u r

is the original curvature, and

{c u r}_{m a x}

and

{c u r}_{m i n}

are the maximum and minimum values of the curvature data, respectively.

The value range of

{c u r}_{n o r}

is [−1, 1]. When the

{c u r}_{n o r}

value is closer to 0, it means the flight stability of the UAV swarm is higher; and vice versa, the stability is lower. The COADM method exhibits significant advantages in swarm stability compared with similar swarm obstacle avoidance algorithms in complex environments. Experimental simulation calculation data show that, for multi-rotor UAVs, the swarm stability of the COADM method is improved by 47.13%, 129.75%, 62.75%, 24.54%, 72.12%, 87.45%, and 117.54%, respectively, compared with APFCP, APPATT, MA2D3QN, PICO, QMIX, MATD3, and MADDPG; for fixed-wing UAVs, the swarm stability of the COADM method is improved by 83.45%, 515.83%, 159.27%, 32.94%, 219.35%, 332.82%, and 615.08%, respectively, compared with APFCP, APPATT, MA2D3QN, PICO, QMIX, MATD3, and MADDPG. Figure 9 depicts the curvature variation in the flight trajectory of the multi-rotor and fixed-wing UAV swarms.

Research indicator 3: swarm integrity

When evaluating the integrity of the swarm, the UAV swarm should maintain a stable relative position and swarm shape during the flight to ensure that each UAV in the swarm can perform the flight task safely and effectively. The UAV swarm moves along the mission route in the form of swarm over time, during which the swarm avoids various obstacles in the environment by constantly translating and rotating. The integrity of the swarm is taken as the goal to generate a complete and reliable swarm path for the UAV swarm. The Structure Similarity Index Measure (SSIM) is often used to measure the similarity of images, and it can be used to evaluate the similarity of geometric spatial structures such as UAV nodes [43].

\bar{V_{S S I M}} (c o m, r e f) = \sum_{t i m e}^{t i m e_s t e p} \frac{(2 ϱ_{c o m} ϱ_{r e f} + ψ_{1}) (2 ι_{c o m, r e f} + ψ_{2})}{(ϱ_{c o m}^{2} + ϱ_{r e f}^{2} + ψ_{1}) (ι_{c o m}^{2} + ι_{r e f}^{2} + ψ_{2})} / t i m e_s t e p,

(18)

where

\bar{V_{S S I M}}

is the structural similarity measure used to evaluate the similarity of UAV swarm nodes in spatial geometry.

ϱ_{c o m}

,

ϱ_{r e f}

ι_{c o m}

,

ι_{r e f}

, and

ι_{c o m, r e f}

are the local mean, standard deviation, and covariance of the comparison data com and the reference data ref, respectively;

ψ_{1}

and

ψ_{2}

prevent the divisor from being 0. The non-zero constant term time_step represents the total number of steps for the UAV swarm to fly from the starting point to the end point.

As UAV simulation data have timeseries characteristics and the relative positions have a spatial correlation, the UAV swarm shows spatiotemporal distribution characteristics. To describe the spatial integrity of the swarm, the similarity of the swarm data of the UAV flight at each moment is compared with the standard swarm. The higher the similarity, the more complete the swarm. The structural similarity is measured for the swarm data at each sampling moment. Figure 10 depicts the variation curves of swarm integrity for multi-rotor and fixed-wing UAV swarms.

The above analysis demonstrates the impact and efficiency of different algorithms on the control performance of drone swarms in a complex simulation environment. To evaluate the formation integrity of various algorithms, this study uses the average similarity measure as a key evaluation parameter. This measure reflects the coordination among drones within the swarm, showing how well each algorithm maintains the desired formation during flight. The average similarity measure quantifies how closely drones adhere to their intended positions relative to one another, ensuring that the swarm’s overall structure remains intact. Additionally, this metric allows for a comparative analysis of different algorithms’ performance, highlighting their strengths and weaknesses in maintaining formation integrity under various conditions. Table 5 demonstrates the effectiveness of each algorithm in preserving the coordinated behavior in UAV swarms.

Equation (18) indicates that the

\bar{V_{S S I M}}

value ranges from −1 to 1, which is a dimensionless number. When the value approaches 0, it indicates a higher formation similarity; conversely, the further the value is from 0, the lower the similarity. Table 5 shows that the advantages of the COADM method in swarm integrity are significantly prominent compared with similar swarm obstacle avoidance algorithms in complex environments. The experimental simulation calculation data show that for multi-rotor UAV, the swarm similarity of COADM method is 31.54%, 32.31%, 24.62%, 23.08%, 48.46%, 61.54%, and 78.46% higher than that of APFCP, APPATT, MA2D3QN, PICO, QMIX, TD3, and MADDPG, respectively. For fixed-wing UAVs, the swarm similarity of the COADM method is 43.65%, 68.53%, 49.24%, 23.39%, 80.71%, 115.23%, and 102.54% higher than that of APFCP, APPATT, MA2D3QN, PICO, QMIX, TD3, and MADDPG, respectively. Rotary-wing drones demonstrate better maneuverability and swarm stability in precise operations and complex environments, while fixed-wing drones have advantages in flight speed and endurance. However, in densely populated obstacle environments, fixed-wing drones may experience reduced obstacle avoidance efficiency due to increased detour distances. The choice between drone types depends on the task requirements and environmental complexity.

5. Future Work

This paper studies the obstacle avoidance problem in drone swarm formations and proposes a drone swarm obstacle avoidance decision-making method based on an end-edge-cloud collaboration model. Although this study primarily focuses on drone swarm obstacle avoidance, the current simulation environment does not fully capture the complexity of real-world tasks. Considering the complexity and diversity of practical applications, future work will integrate specific operational tasks into the simulation environment and explore the design and optimization of drone swarm formation obstacle avoidance control methods in cross-domain joint scenarios. Further research is needed to integrate various task objectives and environmental constraints and to develop more comprehensive and adaptable control strategies. The limitation of the proposed algorithm is that advanced perception algorithms are required to improve the drones’ perception accuracy in complex environments, thereby enhancing obstacle avoidance performance. The effectiveness of this algorithm is constrained by the drones’ ability to perceive environmental changes and sensor noise. Reliable perception algorithms are necessary to enhance the adaptability of drone swarms under different conditions. In future research, particularly to support formation obstacle avoidance control for heterogeneous UAV swarms, further exploration will be conducted on the behavior of the formation obstacle avoidance algorithm under both ideal and delayed/intermittent communication conditions. This will allow for a more comprehensive evaluation and discussion of the algorithm’s applicability and robustness across diverse operational scenarios. In future research, particularly aimed at supporting obstacle avoidance control for heterogeneous drone swarms, we will further explore the behavior of obstacle avoidance algorithms under ideal conditions and delay/interrupted communication scenarios to more comprehensively evaluate and discuss the applicability and robustness of obstacle avoidance algorithms.

6. Conclusions

This paper studies the decision-making deviation problem of UAV swarm obstacle avoidance and proposes a UAV swarm obstacle avoidance decision-making method based on the end-edge-cloud collaboration model. Firstly, the method uses the environmental information prediction network to predict the state information of the future time according to the current state information of the agent. Then, the reward decomposition strategy is used to accurately evaluate the behavior of each agent according to the state information of the future moment and the decision action taken by the agent. Finally, the sample optimization mechanism is used to improve the data quality of the model training sample, and the reliable obstacle avoidance strategy is learned from the historical experience data. By evaluating the key performance indicators of UAV swarms such as obstacle avoidance efficiency and swarm stability and integrity in a simulation environment, this study verifies that the proposed UAV swarm obstacle avoidance decision-making method based on the end-edge-cloud collaboration model effectively narrows the gap between the actual obstacle avoidance strategy and the desired or optimal obstacle avoidance scheme, thereby significantly reducing the decision bias problem.

Author Contributions

Conceptualization, L.Z.; methodology, L.Z.; software, L.Z.; validation, L.Z. and L.Z.; formal analysis, L.Z., B.C. and F.H.; investigation, L.Z.; resources, B.C.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, L.Z. and B.C.; visualization, L.Z. and B.C.; supervision, L.Z., B.C. and F.H.; project administration, L.Z., B.C. and F.H.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the General Program of National Natural Science Foundation of China (Grant No. 62176122): Research on Security Mechanism of Edge Federated Learning for Unmanned Swarm Systems and the A3 Program of National Natural Science Foundation of China (Grant No. 62061146002): Future Internet of Things Technologies and Services Based on Artificial Intelligence.

Data Availability Statement

Data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hacohen, S.; Shoval, S.; Shvalb, N. Navigation function for multi-agent multi-target interception missions. IEEE Access 2024, 12, 56321–56333. [Google Scholar] [CrossRef]
D’Ippolito, F.; Garraffa, G.; Sferlazza, A.; Zaccarian, L. A hybrid observer for localization from noisy inertial data and sporadic position measurements. Nonlinear Anal. Hybrid Syst. 2023, 49, 101360. [Google Scholar] [CrossRef]
Alvarez-Horcajo, J.; Martinez-Yelmo, I.; Rojas, E.; Carral, J.A.; Noci-Luna, V. MuHoW: Distributed protocol for resource sharing in collaborative edge-computing networks. Comput. Netw. 2024, 242, 110243. [Google Scholar] [CrossRef]
John, J.; Harikumar, K.; Senthilnath, J.; Sundaram, S. An efficient approach with dynamic multiswarm of UAVs for forest firefighting. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 2860–2871. [Google Scholar] [CrossRef]
Marek, D.; Paszkuta, M.; Szyguła, J.; Biernacki, P.; Domański, A.; Szczygieł, M.; Król, M.; Wojciechowski, K. Swarm of drones in a simulation environment—Efficiency and adaptation. Appl. Sci. 2024, 14, 3703. [Google Scholar] [CrossRef]
Phadke, A.; Medrano, F.A.; Chu, T.; Sekharan, C.N.; Starek, M.J. Modeling wind and obstacle disturbances for effective performance observations and analysis of resilience in UAV swarms. Aerospace 2024, 11, 237. [Google Scholar] [CrossRef]
Phillips, G.; Bradley, J.M.; Fernando, C. A deployable, decentralized hierarchical reinforcement learning strategy for trajectory planning and control of UAV swarms. In Proceedings of the AIAA SCITECH 2024 Forum, Orlando, FL, USA, 8–12 January 2024; American Institute of Aeronautics and Astronautics: Orlando, FL, USA, 2024; p. AIAA 2024-2761. [Google Scholar] [CrossRef]
Liu, H.; Li, X.; Fan, M.; Wu, G.; Pedrycz, W.; Nagaratnam Suganthan, P. An autonomous path planning method for unmanned aerial vehicle based on a tangent intersection and target guidance strategy. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3061–3073. [Google Scholar] [CrossRef]
Wu, Y.; Low, K.H. Discrete space-based route planning for rotary-wing UAV formation in urban environments. ISA Trans. 2022, 129, 243–259. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Wang, Z.; Wen, X.; Zhu, J.; Xu, C.; Gao, F. Decentralized spatial-temporal trajectory planning for multicopter swarms. arXiv 2021, arXiv:2106.12481. [Google Scholar] [CrossRef]
Zhang, Z.; Dai, W.; Li, G.; Chen, X.; Deng, Q. Cooperative obstacle avoidance algorithm based on improved artificial potential field and consensus protocol. J. Comput. Appl. 2023, 43, 2644–2650. [Google Scholar] [CrossRef]
Ko, Y.-C.; Gau, R.-H. UAV velocity function design and trajectory planning for heterogeneous visual coverage of terrestrial regions. IEEE Trans. Mobile Comput. 2023, 22, 6205–6222. [Google Scholar] [CrossRef]
Quan, L.; Yin, L.; Xu, C.; Gao, F. Distributed swarm trajectory optimization for formation flight in dense environments. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Philadelphia, PA, USA, 2022; pp. 4979–4985. [Google Scholar] [CrossRef]
Tong, X.; Yu, S.; Liu, G.; Niu, X.; Xia, C.; Chen, J.; Yang, Z.; Sun, Y. A hybrid formation path planning based on A* and multi-target improved artificial potential field algorithm in the 2D random environments. Adv. Eng. Inf. 2022, 54, 101755. [Google Scholar] [CrossRef]
Yan, C.; Wang, C.; Xiang, X.; Lan, Z.; Jiang, Y. Deep reinforcement learning of collision-free flocking policies for multiple fixed-wing UAVs using local situation maps. IEEE Trans. Industr. Inform. 2022, 18, 1260–1270. [Google Scholar] [CrossRef]
Lu, C.; Shi, Y.; Zhang, H.; Zhang, M.; Wang, T.; Yue, T.; Ali, S. Learning configurations of operating environment of autonomous vehicles to maximize their collisions. IEEE Trans. Softw. Eng. 2023, 49, 384–402. [Google Scholar] [CrossRef]
Wang, T.; Du, X.; Chen, M.; Li, K. Hierarchical relational graph learning for autonomous multirobot cooperative navigation in dynamic environments. IEEE Trans. Comput. Aided Des. Integr. Circuit Syst. 2023, 42, 3559–3570. [Google Scholar] [CrossRef]
Xia, X.; Chen, F.; He, Q.; Cui, G.; Grundy, J.; Abdelrazek, M.; Bouguettaya, A.; Jin, H. OL-MEDC: An online approach for cost-effective data caching in mobile edge computing systems. IEEE Trans. Mobile Comput. 2023, 22, 1646–1658. [Google Scholar] [CrossRef]
Chen, Q.; Meng, W.; Quek, T.Q.S.; Chen, S. Multi-tier hybrid offloading for computation-aware IoT applications in civil aircraft-augmented SAGIN. IEEE J. Sel. Areas Commun. 2023, 41, 399–417. [Google Scholar] [CrossRef]
Sagor, M.; Haroon, A.; Stoleru, R.; Bhunia, S.; Altaweel, A.; Chao, M.; Jin, L.; Maurice, M.; Blalock, R. DistressNet-NG: A resilient data storage and sharing framework for mobile edge computing in cyber-physical systems. ACM Trans. Cyber Phys. Syst. 2024, 8, 37. [Google Scholar] [CrossRef]
Saifullah, M.; Papakonstantinou, K.G.; Andriotis, C.P.; Stoffels, S.M. Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management. arXiv 2024, arXiv:2401.12455. [Google Scholar] [CrossRef]
Wu, R.-Y.; Xie, X.-C.; Zheng, Y.-J. Firefighting drone configuration and scheduling for wildfire based on loss estimation and minimization. Drones 2024, 8, 17. [Google Scholar] [CrossRef]
Sönmez, S.; Rutherford, M.J.; Valavanis, K.P. A survey of offline- and online-learning-based algorithms for multirotor UAVs. Drones 2024, 8, 116. [Google Scholar] [CrossRef]
Sharma, M.; Tomar, A.; Hazra, A. Edge computing for industry 5.0: Fundamental, applications and research challenges. IEEE Internet Things J. 2024, 11, 19070–19093. [Google Scholar] [CrossRef]
Zhou, X.; Yu, X.; Guo, K.; Zhou, S.; Guo, L.; Zhang, Y.; Peng, X. Safety flight control design of a quadrotor UAV with capability analysis. IEEE Trans. Cybern. 2023, 53, 1738–1751. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Zhang, X.; Zhao, X.; Zhong, Z.; Zhu, X. Land battlefield intelligent information system design under distributed operation conditions. J. Command Control 2023, 9, 192–203. [Google Scholar] [CrossRef]
Shao, J.; Lou, Z.; Zhang, H.; Jiang, Y.; He, S.; Ji, X. Self-organized group for cooperative multi-agent reinforcement learning. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Neural Information Processing Systems Foundation: New Orleans, LA, USA, 2022; pp. 5711–5723. [Google Scholar]
Duan, J.; Guan, Y.; Li, S.E.; Ren, Y.; Sun, Q.; Cheng, B. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6584–6598. [Google Scholar] [CrossRef] [PubMed]
Hu, S.; Zhong, Y.; Gao, M.; Wang, W.; Dong, H.; Liang, X.; Li, Z.; Chang, X.; Yang, Y. MARLlib: A scalable and efficient multi-agent reinforcement learning library. J. Mach. Learn. Res. 2023, 24, 1–23. [Google Scholar]
Zhou, M.; Wan, Z.; Wang, H.; Wen, M.; Wu, R.; Wen, Y.; Yang, Y.; Yu, Y.; Wang, J.; Zhang, W. MALib: A parallel framework for population-based multi-agent reinforcement learning. J. Mach. Learn. Res. 2023, 24, 1–12. [Google Scholar]
Wang, Q.; Ju, F.; Wang, H.; Qian, Y.; Zhu, M.; Zhuang, W.; Wang, L. Multi-agent reinforcement learning for ecological car-following control in mixed traffic. IEEE Trans. Transp. Electrif. 2024. [Google Scholar] [CrossRef]
Guo, W.; Liu, G.; Zhou, Z.; Wang, L.; Wang, J. Enhancing the robustness of QMIX against state-adversarial attacks. Neurocomputing 2024, 572, 127191. [Google Scholar] [CrossRef]
Zhao, E.; Zhou, N.; Liu, C.; Su, H.; Liu, Y.; Cong, J. Time-aware MADDPG with LSTM for multi-agent obstacle avoidance: A comparative study. Complex Intell. Syst. 2024, 10, 4141–4155. [Google Scholar] [CrossRef]
Zhao, R.; Liu, X.; Zhang, Y.; Li, M.; Zhou, C.; Li, S.; Han, L. CraftEnv: A flexible collective robotic construction environment for multi-agent reinforcement learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, London, UK, 29 May–2 June 2023; pp. 1164–1172. [Google Scholar]
Tariverdi, A.; Côté-Allard, U.; Mathiassen, K.; Elle, O.J.; Kalvoy, H.; Martinsen, Ø.G.; Torresen, J. Reinforcement learning-based switching controller for a milliscale robot in a constrained environment. IEEE Trans. Autom. Sci. Eng. 2024, 21, 2000–2016. [Google Scholar] [CrossRef]
Teng, M.; Gao, C.; Wang, Z.; Li, X. A communication-based identification of critical drones in malicious drone swarm networks. Complex Intell. Syst. 2024, 10, 3197–3211. [Google Scholar] [CrossRef]
Chen, D.; Qi, Q.; Fu, Q.; Wang, J.; Liao, J.; Han, Z. Transformer-based reinforcement learning for scalable multi-UAV area coverage. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10062–10077. [Google Scholar] [CrossRef]
Zeng, T.; Zhang, X.; Duan, J.; Yu, C.; Wu, C.; Chen, X. An offline-transfer-online framework for cloud-edge collaborative distributed reinforcement learning. IEEE Trans. Parallel. Distrib. Syst. 2024, 35, 720–731. [Google Scholar] [CrossRef]
Wang, S.; Feng, T.; Yang, H.; You, X.; Chen, B.; Liu, T.; Luan, Z.; Qian, D. AtRec: Accelerating recommendation model training on CPUs. IEEE Trans. Parallel. Distrib. Syst. 2024, 35, 905–918. [Google Scholar] [CrossRef]
Luo, F.-M.; Xu, T.; Lai, H.; Chen, X.-H.; Zhang, W.; Yu, Y. A survey on model-based reinforcement learning. Sci. China Inf. Sci. 2024, 67, 121101. [Google Scholar] [CrossRef]
Grosfils, P. Information transmission in a drone swarm: A temporal network analysis. Drones 2024, 8, 28. [Google Scholar] [CrossRef]
Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-art and future research challenges in UAV swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
Galliera, R.; Möhlenhof, T.; Amato, A.; Duran, D.; Venable, K.B.; Suri, N. Distributed autonomous swarm formation for dynamic network bridging. arXiv 2024, arXiv:2404.01557. [Google Scholar] [CrossRef]

Figure 1. A description of drone swarm obstacle avoidance issues under the end-edge-cloud collaboration model.

Figure 2. The training process of the deep reinforcement learning algorithm based on the end-edge-cloud collaboration model.

Figure 3. Environmental information prediction network.

Figure 4. Reward decomposition strategy.

Figure 5. Policy update flowchart during model training.

Figure 6. Model training.

Figure 7. Average task completion time. (a) Multi-rotor UAV; (b) Fixed-wing UAV.

Figure 8. Average extra flight distance. (a) Multi-rotor UAV; (b) Fixed-wing UAV.

Figure 9. Curvature of the flight trajectory of the UAV swarm. (a) Multi-rotor UAV; (b) Fixed-wing UAV.

Figure 10. UAV swarm integrity analysis results. (a) Multi-rotor UAV; (b) Fixed-wing UAV.

Table 1. Comparison of swarm avoidance algorithms.

Method	Advantages	Limitations
APPATT	Uses the elliptical tangent graph method to quickly generate two obstacle avoidance trajectories when encountering obstacles.	Requires many parameters and relies on expert experience and experimental validation, leading to difficulties in parameter tuning.
APFCP	Uses normalization and high-order exponential scaling transformations to address the oscillation failure of potential field forces.	Repulsive field effects make it difficult for UAVs to adjust the direction near trajectory points when avoiding obstacles.
PICO	Adapts the formation shape in narrow spaces adaptively by measuring formation similarity.	Requires a large storage space, leading to issues such as flight trajectory fluctuations and paths that are not flyable.
MA2D3QN	Improves model learning efficiency through adaptive mechanisms and reference point guidance strategies.	Complex neural network structures and multi-agent parallel learning require significant computational resources.
COADM	Proposes a cluster obstacle avoidance method with high efficiency and strong flexibility, reducing decision biases.	Advanced perception algorithms are needed to enhance UAV perception accuracy in complex environments.

Table 2. Computing resource parameter settings.

Type	Parameter	Value
Edge server	Operating system	Ubuntu 18.04
	Processor	Intel Core i7-1260P
	Memory	8 GB
	Hard disk	50 GB
	Network card	I219-V
	Graphics card	NVIDIA Tesla K80
Cloud server	Operating system	Ubuntu 20.04
	Processor	Intel Xeon Gold 6328H
	Memory	372 GB
	Hard disk	10 TB
	Network card	I350-US
	Graphics card	NVIDIA T4
UAV simulation platform	Operating system	Ubuntu 18.04
	Processor	Intel Core i7-1260P
	Memory	8 GB
	Hard disk	50 GB
	Network card	I219-V
	Graphics card	NVIDIA Tesla K80

Table 3. Drone parameter settings.

Type	ID	$m a x_V$	$m a x_H$	$S i z e$	Mass	$C_{d}$	$m a x_a n g l e_V e l o c i t y$
Multi−rotor	${U A V}_{1}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
	${U A V}_{2}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
	${U A V}_{3}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
	${U A V}_{4}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
	${U A V}_{5}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
	${U A V}_{6}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
	${U A V}_{7}$	$15 k m \cdot h^{- 1}$	100 m	$35 \times 28 \times 11 c m$	1.05 kg	0.48	$200^{o} / s$
Fixed-wing	${U A V}_{1}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$
	${U A V}_{2}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$
	${U A V}_{3}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$
	${U A V}_{4}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$
	${U A V}_{5}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$
	${U A V}_{6}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$
	${U A V}_{7}^{'}$	$30 k m \cdot h^{- 1}$	250 m	$243 \times 145 \times 18 c m$	3.2 kg	0.054	$25^{o} / s$

Table 4. Reinforcement learning training parameter settings.

Parameter	Value	Description
actor_learning_rate	0.0003	Actor network learning rate
critic_learning_rate	0.0003	Critic network learning rate
alpha_learning_rate	0.0003	Alpha network learning rate
agent_num	7	Number of agents
safety_buffer	50,000	Safe-sample experience pool
danger_buffer	50,000	Dangerous-sample experience pool
network_noise	0.8	Network noise
rnn_hidden_dim	128	RNN hidden-layer dimensions
qmix_hidden_dim	128	Mixed-network hidden-layer dimensions
gamma	0.99	Discount rate
entropy	0.2	Policy entropy
episodes	60,000	Episodes
net_update_rate	0.001	Network update rate
stepsize	3,000,000	Maximum step size for a single experiment
simulation_stepsize	0.01	Simulation step size/s

Table 5. Average similarity index analysis results.

Type	Algorithm	$\bar{V_{S S I M}}$
Multi-rotor	APPATT	0.173
	APFCP	0.171
	MA2D3QN	0.162
	PICO	0.160
	QMIX	0.193
	MATD3	0.210
	MADDPG	0.232
	COADM	0.130
Fixed-wing	APPATT	0.332
	APFCP	0.283
	MA2D3QN	0.294
	PICO	0.243
	QMIX	0.356
	MATD3	0.425
	MADDPG	0.399
	COADM	0.197

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Chen, B.; Hu, F. Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model. Drones 2024, 8, 461. https://doi.org/10.3390/drones8090461

AMA Style

Zhao L, Chen B, Hu F. Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model. Drones. 2024; 8(9):461. https://doi.org/10.3390/drones8090461

Chicago/Turabian Style

Zhao, Longqian, Bing Chen, and Feng Hu. 2024. "Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model" Drones 8, no. 9: 461. https://doi.org/10.3390/drones8090461

APA Style

Zhao, L., Chen, B., & Hu, F. (2024). Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model. Drones, 8(9), 461. https://doi.org/10.3390/drones8090461

Article Menu

Research on Cooperative Obstacle Avoidance Decision Making of Unmanned Aerial Vehicle Swarms in Complex Environments under End-Edge-Cloud Collaboration Model

Abstract

1. Introduction

2. Related Work

3. UAV Swarm Obstacle Avoidance Decision-Making Method Based on End-Edge-Cloud Collaboration Model

3.1. Problem Statement

3.2. Problem Modeling and Analysis

3.3. System Structure

3.4. Resources and Environment

3.5. Reward Calculation

3.6. Policy Update

3.7. Pseudo-Code of Collaborative Obstacle Avoidance Decision-Making Algorithm

4. Experimental Analysis

4.1. Experimental Setup

4.2. Experimental Results

5. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI