Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning

Li, Dong; Yang, Panfei

doi:10.3390/app142311282

Open AccessArticle

Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning

by

Dong Li

and

Panfei Yang

^*

Software and Systems Research Institute, China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou 511370, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 11282; https://doi.org/10.3390/app142311282

Submission received: 6 September 2024 / Revised: 18 October 2024 / Accepted: 21 October 2024 / Published: 3 December 2024

(This article belongs to the Section Aerospace Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

In the process of the collaborative work of Unmanned Aerial Vehicle (UAV) clusters, the cluster communication node test is often carried out by a single-node test, which leads to poor topology and robustness of the overall network system, an imbalanced communication network load, and high complexity of the communication test, which seriously affects the diversified needs of current users and the efficiency of large-scale task processing. To solve this problem, a distributed method for UAV cluster testing, called UTDR (distributed UAV cluster Testing method by using Deep Reinforcement learning), based on the Deep Deterministic Policy Gradient (DDPG) is proposed in this work. The system management node is used to monitor the status of the UAV testing task execution node and bandwidth resources. By taking advantage of the method of continuous interaction between the agent and the environment, the future state of the node after processing the current task to be assigned is predicted and evaluated from the perspective of interpretability, so as to achieve the effectiveness and stability of the UAV cluster testing task collaborative execution. The experimental results show that our proposed method can ensure the stable operation of the UAV cluster, accurately predict the future state, and reduce the load degree and bandwidth resource consumption of the large-scale test task network system.

Keywords:

Unmanned Aerial Vehicle; communication test; deep reinforcement learning; Deep Deterministic Policy Gradient; task collaborative execution

1. Introduction

In recent years, the improvement of Unmanned Aerial Vehicle (UAV) [1,2,3,4] performance and the increase in task complexity have promoted the development of distributed cluster technology [5,6,7]. The UAV flight control system [8] is a strong real-time embedded system with high computational complexity. It includes not only the software processing system [9] composed of flight control software and an operating system, but also the hardware circuit system [10] composed of an optical fiber communication chip and a high-performance processor. Therefore, in the process of UAV flight control task testing, the system has strict constraints on the combination logic and sequential logic when software and hardware coordinate the operation. Among them, the panoramic test of the UAV cluster with a complex and changeable working environment and a long working cycle becomes very important.

In the UAV cluster test system, there are two important problems that need to be solved urgently: the single UAV testing and the testing task scheduling. First of all, in the test system, a large number of scenarios exist in which there are constraints among the values of the same sensor parameters and different sensor parameters. Directly ignoring these constraints and conditions will make the test incentive data invalid and even reduce the overall test coverage of the system. At the same time, the UAV cluster control system has strict time sequence requirements for multi-task scheduling. The input sequence and parameter values of task data will result in different test processing results at different times. If a single test node is used to process the testing task, the overall performance requirements of the UAV cluster test system cannot be satisfied due to the upper limit of the ability of the test tool at the single end. On the other hand, the scheduling method of testing tasks is an important part of the UAV cluster test system. The ideal scheduling method can greatly improve the processing efficiency and resource utilization of test nodes, and is conducive to improving the overall performance of the system. Testing task scheduling is generally applied to scenarios where the number of test cases is large and the mutual constraint relationship is complex, or the system has high real-time requirements. Therefore, an unreasonable scheduling method may lead to unreasonable test case allocation, imbalanced resource allocation, and lower work efficiency and reliability of test nodes. Therefore, it is necessary to research and propose a reasonable and reliable testing task scheduling method.

In order to explore the strategy of large-scale testing task execution and resource scheduling, this work designs and proposes an innovative distributed test method for UAV clusters. Firstly, a testing task scheduling problem in the UAV cluster environment is proposed based on the current UAV cluster testing problem, and then the mathematical model of testing task execution in a distributed network environment is established by using the Deep Deterministic Policy Gradient (DDPG). The deep reinforcement learning network is constructed by means of iterative training, and better testing task deployment strategy decisions are derived. The experimental results show that the method proposed in this work can achieve load balancing of the test system while minimizing bandwidth resource consumption and assisting the system to obtain long-term benefits.

The main contributions of this work are as follows:

(1): A distributed testing task execution problem in the UAV cluster environment is proposed;
(2): By taking advantage of the DDPG, the mathematical problem model of testing task execution in a distributed network environment is established. Deep reinforcement learning networks are iteratively trained to derive better testing task deployment strategy decisions. Through continuous interaction with the state environment, the reinforcement learning network is continuously optimized to achieve system load balancing and minimize bandwidth resource costs;
(3): Compared with cutting-edge testing methods in distributed environments, the advantages of the method proposed in this work in task execution are evaluated.

The rest of this paper is organized as follows: Section 2 introduces the related work of this paper. In Section 3, we formalize the problems raised in this work and establish the corresponding mathematical model. In Section 4, the architecture of the method is presented, and then the design and implementation of the main algorithm are introduced in detail. In Section 5, the experimental results are obtained through several groups of experiments, proving that UTDR can effectively overcome the proposed problem. Finally, this work is summarized in Section 6, and future research work is proposed.

2. Related Work

Recently, there have been numerous studies on UAV testing based on emerging technologies such as artificial intelligence. In this section, UAV cluster testing, distributed task scheduling, and deep reinforcement learning are discussed, respectively.

Stroner et al. [11] proposed a method using precise georeferent targets covered with high-reflectivity foils that can easily extract targets from the cloud while efficiently determining the central position of the target, and used it to calculate the systematic displacement of lidar point clouds. Sartaj [12] proposed a method for automated system-level testing of UAVs. The proposed approach utilizes model-based testing and artificial intelligence techniques to automatically generate, execute, and evaluate various test scenarios. In response to the need for robustness and integrity in UAV design and operation, Kim et al. [13] proposed a robotic operating system-based multi-degree-of-freedom flight test framework for the safe development, verification, and validation of UAVs. Li et al. [14] proposed a cooperative control mechanism for UAV-UGV systems. In order to realize collaborative trajectory tracking, a leader–follower strategy based on a centralized control structure was established based on application scenarios. Santoso et al. [15] proposed a hybrid nonlinear control system composed of a traditional proportional differential controller and PD fuzzy logic autopilot for UAV trajectory tracking. Li et al. [16] proposed a solution that optimizes UAV deployment locations and user resource allocation, the goal of which was to maximize traffic offloading and minimize UAV energy consumption simultaneously. They introduced a hierarchical intelligent traffic offloading network optimization framework based on Deep Federated Learning (DFL). Xi et al. [17] proposed a lightweight, Reinforcement-Learning-based (RL-based) real-time path-planning method for UAVs, which optimizes the training process, network architecture, and algorithmic models. Although the above work has made some progress and performance gains in automation and robustness, it still does not consider the overall performance improvement of the system from the aspects of distributed testing and test coverage.

Arunarani et al. [18] provided a comprehensive review of task scheduling strategies and related metrics applicable to cloud computing environments. They discussed the various problems associated with scheduling methods and the limitations that need to be overcome. Different schedulers were then studied to discover which features should be included in a given system and which can be ignored. Yuan et al. [19] proposed a two-objective differential evolution algorithm based on simulated annealing and solved it mathematically to obtain an approximate Pareto optimal set. The minimum Manhattan distance method was used to select a feasible solution, which specifies the Pareto optimal task service rate for each time slot and the task distribution among network providers. Saleem et al. [20] proposed an MEC based on Device-to-Device (D2D) cooperation to accelerate task execution by mobile users by utilizing proximity-aware task offloading. At the same time, by comprehensively considering user mobility, distributed resources, task attributes, user device energy constraints, and other factors, the task allocation and power allocation scheme was jointly formulated to minimize the total task execution delay. Hosseinioun et al. [21] proposed an energy-sensing approach that utilizes Dynamic Voltage Frequency Scaling (DVFS) technology to reduce energy consumption. At the same time, in order to construct effective task sequences, a hybrid evolutionary algorithm for invasive weed optimal culture was used to achieve reasonable and efficient task scheduling. The above work has achieved some system benefits in distributed task scheduling, but the existing advanced artificial intelligence methods are not naturally adapted to the existing distributed task scheduling problems, so there is still room for optimization.

Deep reinforcement learning (DRL) is an artificial intelligence technology that combines deep learning and reinforcement learning. It optimizes the reinforcement learning algorithm through deep learning technology such as neural networks, so that the agent can learn the optimal behavior strategy in a complex environment. Deep learning has strong perception ability, but lacks decision-making ability; reinforcement learning has the ability to make decisions, but it is helpless to the perception problem. Therefore, combining the two and complementing each other’s advantages provides a solution to the perception and decision problem of complex systems. Ladosz et al. [22] reviewed exploration techniques in deep reinforcement learning. When it comes to solving the sparse reward problem, the importance of exploring techniques is obvious. In the sparse reward problem, the reward is rare, meaning that the agent does not often obtain a reward by acting randomly. In this context, learning the link between reward and behavior in reinforcement learning becomes challenging. Therefore, more complex exploration methods need to be designed to optimize the algorithm. Xiao et al. [23] first introduced a Markov Decision Process (MDP) model to capture the dynamic state transitions of the network. Aiming to collectively reduce operating costs for NFV providers and maximize the total throughput of requests, the researchers proposed an adaptive, online, deep reinforcement learning approach to automatically deploy business function chains for requests with different QoS requirements. Zhou et al. [24] studied computing task scheduling in the Sky-Earth Integrated Network (SAGIN) for latency-oriented Internet of Things (IoT) services. Firstly, the online scheduling problem was described as an energy-constrained Markov Decision Process (MDP). Then, considering the task arrival dynamics, a new deep risk-sensitive reinforcement learning algorithm was developed. The algorithm performs a risk assessment for each state, that is, measures energy consumption that exceeds constraints, and searches for optimal parameters that trade off delay and risk minimization while learning the optimal strategy. Aiming at the problem of the limited computing power and energy of UAVs, Zhao et al. [25] studied a multi-UAV [26,27] multi-edge collaborative mobile edge computing system. Through the joint design of the UAV flight trajectory, calculation of task allocation, and communication resource management, the problem of mission unloading was solved, and the execution delay and energy consumption were minimized. To solve the above non-convex optimization problem, the Markov Decision Process of the multi-UAV-assisted moving edge computing system was established. In order to obtain joint strategies for trajectory design, task assignment, and dynamic management, a multi-agent cooperative deep reinforcement learning framework was studied. The double delay DDPG algorithm is used for high-dimensional continuous action space. Zhou et al. [28] proposed a UAV swarm-based cooperative tracking architecture to systematically improve the UAV tracking performance and design an intelligent UAV swarm-based cooperative algorithm for consecutive target tracking and physical collision avoidance. Further, Zhou et al. [29] proposed a cyber-twin-based distributed tracking algorithm to update and optimize a trained digital model for real-time MTT and then design a distributed cooperative tracking framework to promote MTT performance. The above optimization and application of deep reinforcement learning has made some progress, but in many practical applications, it has not paid attention to the training of deep reinforcement learning itself and the deeper learning and improvement of the network architecture.

3. The Formalization of UTDR

The problem presented in this work can be formalized to allocate the received testing task requests to the UAV cluster network in the time window Δt, so as to achieve node load balance and minimize bandwidth resource cost in the system. To solve this problem, this work defines a quintuple W = {ND, TK, C_cpu, C_mem, C_link} to describe this problem scenario. ND indicates a set of available processing nodes. ND(n, t) = {nd₁, nd₂, …, nd_n}. t indicates the start time of task assignment. TK represents the set of task requests from users in the Δt time window, TK (m, Δt, t) = {tk₁, tk₂, …, tk_m}. C_cpu is the set of remaining CPU resources of n UAV nodes in ND, C_cpu(n, t) = {

C_{c p u}^{1}

,

C_{c p u}^{2}

, …,

C_{c p u}^{n}

}. C_mem is the collection of memory resources currently remaining in n UAV nodes in ND, C_mem(n, t) = {

C_{m e m}^{1}

,

C_{m e m}^{2}

, …,

C_{m e m}^{n}

}. If the DAG is known, bw(i, i′) represents the set of bandwidth resource costs for the communication between node i and its neighbors i′ in the cluster. In view of the task assignment problem of the UAV test proposed in this work, the final solution we obtain should be the optimal network task assignment scheme that can solve both computing resources and bandwidth resources. A description of the main functions is shown in Table 1. The first objective function in this section is based on load balancing. We use the load balancing degree to measure the load balancing effect in a multi-UAV cluster environment. The load balancing degree of the ith UAV node is calculated as follows:

\begin{array}{l} W L^{i f} = \partial W L_{c p u}^{i f} + β W L_{m e m}^{i f}, i \in {1, 2, 3, \dots, n}, \\ f \in {1, 2, 3, \dots, z}, \partial + β = 1 . \end{array}

(1)

where WL_cpu^ip and WL_mem^ip, respectively, represent the CPU resource utilization and memory resource utilization of the ith UAV processing node after the task is completed according to the solution vector f. Thus, the resource utilization of the ith node can be expressed as follows:

U^{i f} = W L^{i f} / R^{i}, i \in \{1, 2, 3, \dots, n\}, f \in \{1, 2, 3, \dots, z\} .

(2)

Rⁱ can be expressed as follows:

R^{i} = α R_{c p u}^{i} + β R_{m e m}^{i}, i \in \{1, 2, 3, \dots, n\} .

(3)

where R_cpu^if and R_mem^if represent the total CPU resources and memory resources of the ith node, respectively. This work uses the standard deviation of resource utilization to measure load balancing A. The load balancing effect can be reflected by the load balancing degree. U^if indicates the current resource utilization of the ith UAV node, and n indicates the number of processing nodes. The formula of average resource utilization in a certain period of time is as follows:

M e a n (U) = \sum_{i = 1}^{n} U^{i f} / n .

(4)

We can obtain the formula of the load balancing degree, which is as follows:

A = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(U^{i f} - M e a n (U))}^{2}} .

(5)

By the above formulas, the first objective function presented in this work can be minimized and defined as follows:

A^{f} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (\frac{α W L_{c p u}^{i f} + β W L_{m e m}^{i f}}{α R_{c p u}^{i} + β R_{m e m}^{i}} - \frac{\sum_{e = 1}^{n} \frac{α W L_{c p u}^{e f} + β W L_{m e m}^{e f}}{α R_{c p u}^{e} + β R_{m e m}^{e}}}{n})^{2}} .

(6)

Thus, we can achieve the goal of load balancing by minimizing A^f.

The second objective function of this work is to reduce the cost of bandwidth resources required for inter-node communication. The mapping between task and processing nodes can be expressed as KM [n, m, q, t(e)], q ∈ {1, 2, 3, …, z}, where q represents any mapping in the set, and e represents an integer that increases with task deployment, representing the testing task deployment status in a certain period of time. In a dense network environment, based on the original intention of minimizing bandwidth resource costs and through repeated demonstration in this work, we can define the total bandwidth resource cost required in the collaborative processing of multi-UAV testing tasks as follows:

B (L_{D A G}, t) = \sum_{l (i, i') \in L_{D A G}} w_{j, k} \sum_{j, k \in N D} f_{j, k}^{i, i'} b w (i, i') .

(7)

where w_j_,k represents the weight of the wireless network link. When physical node j and physical node k belong to the same base station range, w_j_,k is set to α; when node j and node k are not in the same base station range, w_j_,k is set to β. It can be expressed as follows:

w_{j, k} = \{\begin{cases} α, j, k \in I n P_{x}; 1 \leq y \leq D \\ β, j \in I n P_{x}, k \in I n P_{y}, x \neq y; 1 \leq x, y \leq D \end{cases} .

(8)

When the DAG communication link l_DAG (i, i′) can establish a data communication link, and the network link l (j, k) is allocated to the required bandwidth resource, then

f_{j, k}^{i, i'}

will be set to 1; otherwise it will be set to 0. So far, the problems presented in this work can be formalized into constrained multi-objective optimization problems:

\begin{array}{l} \min A^{f} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (\frac{α W L_{c p u}^{i f} + β W L_{m e m}^{i f}}{α R_{c p u}^{i} + β R_{m e m}^{i}} - \frac{\sum_{e = 1}^{n} \frac{α W L_{c p u}^{e f} + β W L_{m e m}^{e f}}{α R_{c p u}^{e} + β R_{m e m}^{e}}}{n})^{2};} \\ \min B (L_{D A G}, t) = \sum_{l (i, i') \in L_{D A G}} w_{j, k} \sum_{j, k \in N D} f_{j, k}^{i, i'} b w (i, i'); \\ i \in \{1, 2, 3, \dots, n\}, f \in \{1, 2, 3, \dots, z\} . \end{array}

(9)

4. The Proposed Algorithm

4.1. System Architecture

Aiming at the current scheduling and distribution problem of large-scale UAV testing task data in a distributed system, this work explores a collaborative processing method of distributed testing tasks based on reinforcement learning for the task assignment problem of big data processing, combined with existing research results, load balancing, and bandwidth minimization analysis. The rational assignment of testing tasks is realized through DDPG, so as to realize the comprehensive management of multi-task computing and bandwidth resources.

Compared with the limitation of a single-machine test system, this work adopts a distributed test system which includes multiple processing nodes. As shown in Figure 1, multiple test nodes are connected to the management node and the standby management node through the bus, assuming that there is no information interaction between the processing nodes and they are independent of each other. When the test case data are sent to the management node, the management node execution engine simulates the task assignment scheme and performs assignment scheduling through the reinforcement learning algorithm and the current UAV survival state, and allocates a reasonable number of test cases to each test node in the current system. At the same time, each test node maintains status and information synchronization through communication strategies. When the system fails or a node breaks down, the standby management node will immediately adjust the test case scheduling scheme to ensure the effectiveness and reliability of the UAV cluster test to the greatest extent.

4.2. Main Idea

Aiming at the computing and communication resource problems encountered in the actual operation of UAVs, this work simplifies the engineering complexity of the communication and testing task execution of UAV cluster ground stations and each machine in the cluster, and provides a resource optimization scheme for large-scale adaptive UAV cluster tests. In this study, the architecture of a UAV cluster communication simulation system is designed and constructed to simulate communication and match the testing tasks performed by the whole cluster, so as to reduce the workload of single and overall debugging during the flight of Unmanned Aerial Vehicles. Specifically, we divide the overall system into a ground controller and a cluster of UAVs containing multiple individual machines. In the system, we study and design a distributed testing task allocation method for UAV clusters based on reinforcement learning, aiming to achieve an optimal task allocation scheme, system load balancing, and bandwidth minimization during the UAV cluster test. The central server at the ground controller is responsible for training the distributed algorithm and model, calculating the current optimal mission deployment scheme according to different computing resources and communication quality, and monitoring the current UAV survival condition. After receiving the signal instruction, the UAV terminal plans the current task to be executed through its own calculation and planning component to ensure the safe flight and predetermined trajectory of the UAV while completing the task. Finally, the UAV transmits the communication and task status information during the mission execution back to the ground controller for fault handling and model updating.

4.3. Implementation of UDTR

This work uses intelligent experiential learning and decision strategies to design and define a deep reinforcement learning framework for virtual network mapping. Our goal is to minimize the overhead of virtual link mapping while satisfying system load balancing. The reinforcement learning system is shown in Figure 1. A deep reinforcement learning framework can be defined as follows:

State: The state of deep reinforcement learning represents a space that accurately and objectively reflects the current environment. The mapping process can be performed according to a given directed acyclic graph (DAG). The status of UAV testing task processing node i is expressed as follows:

S_{i}^{n} = \{(l q_{i}^{n}, l d_{i}^{n}, a b_{i}^{n}, n q_{i j}^{n}) | j \in N b_{i}\} .

(10)

where lq_iⁿ is the link quality of the whole network where task execution node i resides, ld_iⁿ is the link load under the current network where node i resides, ab_iⁿ is the current remaining network resources available under the network where node i resides, and nq_ijⁿ is the queue size of adjacent node j. Nb_i is a set of adjacent nodes of i.

Action: In this work, the action taken by node i at the current moment is recorded as a_iⁿ (S_iⁿ) ∈ A_i, which is related to its current state. A_i is defined as

A_{i} = \{l k_{i}^{C_{h}}, C_{h} \in C_{a v l}\} .

(11)

where lk_i^ch represents i selecting the path selection scheme C_h to complete the entire testing task assignment.

Immediate Cost: The Agent can use the current state and action to obtain a value of immediate cost from the environment as feedback. The cost function in this work is closely related to the multi-objective function, so the bandwidth resource and link energy consumption are integrated into the final cost function.

R_{i}^{n} (S_{i}^{n}, a_{i}^{n} (S_{i}^{n})) = \{\begin{cases} k_{1} \cdot \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (\frac{α W L_{c p u}^{i f} + β W L_{m e m}^{i f}}{α R_{c p u}^{i} + β R_{m e m}^{i}} - \frac{\sum_{e = 1}^{n} \frac{α W L_{c p u}^{e f} + β W L_{m e m}^{e f}}{α R_{c p u}^{e} + β R_{m e m}^{e}}}{n})^{2}} \\ + k_{2} \cdot \sum_{l (i, i') \in L_{D A G}} w_{j, k} \sum_{j, k \in N D} f_{j, k}^{i, i'} b w (i, i'), t a s k a l l o c a t i o n s u c c e e d s; \\ k_{f a i l}, t a s k a l l o c a t i o n f a i l s . \end{cases}

(12)

where k₁ and k₂ represent weight factors. The algorithm considers both network cost and load factors, and can greatly help to minimize the learning of bandwidth cost and system load balance.

As shown in Figure 2, the main process of the routing strategy design based on the DDPG is described. Generally, the overall flow of the DDPG algorithm can be divided into three main parts: sampling, training, and parameter update. In the sampling process, the algorithm randomly selects state s, inputs it into the current Actor network, selects action a according to state s, and inputs action a into the environment, which returns a reward r and the next state s′. The quintuple (S, A, R, S′, γ) is stored in the experience replay buffer for subsequent training. When the data in the experience replay buffer reach a certain amount, the algorithm enters the training process, takes out batch data from the replay buffer, and inputs them into the current Critic network and Target-Critic network, respectively, then calculates the loss, and updates the network parameters. Finally, the parameter update process adopts the soft update method, and the network parameters are updated every turn to ensure the stability and convergence of the algorithm. When implementing the DDPG algorithm, four networks need to be constructed first: Actor network, Target-Actor network, Critic network, and Target-Critic network. The current Actor network inputs the current state, and outputs the current action. The Target-Actor network inputs the next state and outputs the next action. The current Critic network inputs the current status and current action, and outputs the current Q value. The Target-Critic network inputs the next state and next action, and outputs the next Q value. These four networks work together to enable the algorithm to learn the optimal policy.

In this work, we establish the target network and obtain the target value of training the Critic network. The target network includes the Target-Actor network and the Target-Critic network. The input of the target network can be sent to the experience replay buffer as the next state (i.e., S_t+₁). A critical value can be obtained as the output of the target network for Critic training. The experience replay buffer is used to store experience tuples, including the current state, selection actions for task assignment decisions, immediate cost, and next state. A well-designed experience tuple can be randomly sampled for the training of the main network and the target network, which helps to reduce the influence of data correlation.

In Figure 2, we can also see that the agent in the UTDR method proposed in this paper can explore the environment according to the current network conditions and design the corresponding path selection decision actions. The path selection scheme is then sent to the selected available nodes to achieve connectivity across the network. Using the path selection scheme in action, the UAV nodes can realize network resource planning after the task assignment process and cooperate with the subsequent link scheme selection. Finally, the system feedback is used to calculate the cost of the distributed testing task assignment strategy adopted. The procedure is as follows:

Step 1: Pass the state of the network environment to the primary network Actor of the main network and perform replay buffering. In general, during a training cycle, transition information for each interaction with the environment is stored in the experience replay buffer. At the same time, the learning batch of the neural network consists of the transition process information sampled from the experience replay buffer.

Step 2: Based on the current state and experience tuple, the agent uses the deep model of the main network and the target network to make a decision on the next action. The action of the agent is to select the path set of the calculation result transmission in order based on the given DAG. The Target-Actor and Target-Critic in the target network are constructed, and the Target policy value θ_μ_′ and the target Q value θ_Q_′ are updated according to the empirical tuple, respectively. Therefore, by taking the Target-Critic of the target network, we can calculate that the target Q is

T a r g e t_Q = r_{t} + γ Q' (s_{t + 1}, μ' (s_{t + 1} | θ_{μ'}) θ_{μ'}) .

(13)

We can then set the result to the critical value of the primary network. In order to update the Critic module in the deep network, the Critic network realizes the update of the primary Q value θ_Q by overcoming the optimization problem of minimizing the Loss function Loss(θ_Q) after receiving Target-Q. Using the tuple θ_Q in the experiential replay buffer, the result of the policy gradient is computed by the Actor of the main network to calculate the next action.

Step 3: The UTDR agent uses the determined actions to try out the available policy on each communication network path of the UAV cluster. After execution path selection, each state of the agent is passed from one state to another. The agent then receives feedback from the environment with immediate system costs based on the new state. When the current link scheme fails to be established, the agent receives the cost return value k_fail. Otherwise, the agent whose link scheme establishment process is successful receives an immediate overhead with a specific value and updates the state of the environment. The whole reinforcement learning process is shown in the Algorithm 1.

Algorithm 1: Reinforcement Learning Procedure of UTDR

1: Initialize the parameters of all networks in the UTDR model, θ_Q, θ_Q_′, θ_μ, θ_μ_′;

2: Initialize the experience replay buffer B;

3: for episode = 1, 2, … Max, Max is the number of training cycles

4: Initialize all agents’ state s_t;//Initialize the state

5: Each agent selects action a_t according to the current policy;//Select current action

6: Perform action a_t to obtain reward r_t and new state s_t+₁;//Obtain the reward and the next
state after performing the action

7: Store (s_t, a_t, r_t, s_t+₁) in B;//Store the quadruple in B

8: s_t = s_t+₁;//Update the state

9: for agent i = 1, 2, …, N

10: Randomly extract some experience samples from B;//Random sampling

11: Calculate the loss of Critic network via equation

L o s s = \frac{1}{N} \sum_{x} {(y_{x} - Q (s_{x}, a_{x}| θ_{Q}))}^{2}

;

12: Using equation

\nabla_{θ_{μ}} J \approx \frac{1}{N} \sum_{x} \nabla_{a} Q (s, a| θ_{Q}) |_{s = s_{x}, a = μ (s_{x})} \nabla_{θ_{μ}} μ | s θ_{μ} |_{s_{x}}

to calculate the gradient of Actor network;//Calculate the loss

13: Update parameters for the Actor network and Critic network;//Update parameters

14: Update parameters of Target network;//Update parameters

15: end for

16: end for

17: until convergence

5. Results

In this section, to verify the capabilities of the proposed method in terms of load balancing and bandwidth resource saving, we compare the algorithm proposed in this paper with two distributed intelligence testing algorithms, LogiScope [30] and AutoRunner [31], in the following aspects: (1) load balancing effect; (2) bandwidth consumption; (3) system external service performance; (4) testing task allocation failure rate.

In order to verify the performance of the distributed test algorithm for UAV clusters, this paper adopts a Master–Slave structure to run the test algorithm. The method proposed in this work is compared with LogiScope and AutoRunner in a UAV cluster. We evaluate and test them through four different sets of experiments. Finally, the experimental results show that the proposed method can not only make the system have a better load balancing effect, but also have higher bandwidth resource utilization in data communication, and greatly reduce the probability of distributed system test failure. Compared with other test methods, the method proposed in this work has advantages in the effectiveness and stability of testing task allocation.

5.1. Experimental Setup

In this work, the UAV flight control code is simplified in a distributed system, and the UAV initialization function module in the flight control system is selected to test and verify the performance of each method. It includes the following functions: turn on the operation button, motor status, power, position information, and other flight control parameters, in order to verify the UAV test mission scheduling and adjustment. We use five hosts to simulate the experimental environment. The two test nodes serve as the management node and the standby management node. The remaining nodes act as processing nodes in the distributed system and use them to set up virtual nodes. The test cases are obtained by UML and a state machine, then the results of test sequence instantiation are integrated, and the test case data set is generated as the data input of the distributed system test.

5.2. Load Balancing Effect

In this section, we compare the capabilities of LogiScope, AutoRunner, and the methods proposed in this work in terms of the degree of system load balancing. This work uses the standard deviation value of the workload of each testing task execution node to represent the load balancing degree. It can be seen from Figure 3 that the standard deviation value of each method gradually decreases with the change in time. The standard deviation values of LogiScope are consistently higher than those of the other two methods. The load balancing effect proposed in this work is basically the same as that of AutoRunner in the initial stage, and is gradually superior to the other two methods in the later stage. This is mainly because the method proposed in this work can calculate the effect and overall cost of each feasible scheme according to the deep reinforcement learning network, and in the process of searching for the optimal scheme, based on the current optimal scheme obtained by each iteration, gradually search and try to find the scheme with a lower load balancing degree and lower bandwidth resource cost. The appropriate task processing node is found for the corresponding testing task. This effectively ensures the resource utilization of the whole field to a great extent. In general, we can observe that the method proposed in this work has good performance in load balancing. It reduces the additional consumption of valuable computing resources, enables efficient processing of distributed testing tasks, and improves the overall benefits of the system in the long run.

5.3. Bandwidth Consumption

In this set of experiments, we compare the performance of three methods in terms of bandwidth resource cost in distributed networks. As shown in Figure 4, we can observe that three methods, including LogiScope, consume less bandwidth resources in the early stages of the testing task execution. However, with the change in time, the bandwidth resource cost of all three increases. The bandwidth cost of LogiScope in the later stage is gradually higher than the other two methods. For the AutoRunner method, the bandwidth resource cost of UAV communication in a distributed environment is higher in the initial stage, but in the later stage, its value is lower than LogiScope, and it is always greater than the cost of the method proposed in this work. When t = 600 s, the growth rate of the bandwidth resource cost of these two methods began to gradually decline, and the later cost value was lower than LogiScope. For the method proposed in this work, the bandwidth cost is not much different from the other two methods in the early stage, and is gradually lower than the other two methods in the later stage. From the economic aspect of multi-UAV cluster communication bandwidth resources, this method has certain advantages compared with the other two methods. The method proposed in this work is based on deep reinforcement learning, which not only considers the economy of the communication bandwidth resources of multi-UAVs, but also considers the load balancing of the cluster. This algorithm implements many optimization strategies proposed in this paper, and can effectively search for the optimal deployment scheme of the current system testing task. In the long run, we can conclude that the method proposed in this work is very effective in saving bandwidth resources in the process of distributed testing task processing.

5.4. System External Service Performance

In this section, the methods presented in this work are compared with the other two methods in terms of the external service performance of the domain. In this paper, system throughput is regarded as an important evaluation criterion of external service performance. As shown in Figure 5, the external service performance of each of the three methods used for comparison is different. For the LogiScope approach, the external service performance of the system appears to be relatively good in the early stages, but not very stable in the later stages. For AutoRunner, the throughput value is low in the initial phase. Over time, its system performance gradually stabilizes and outperforms the LogiScope method. The algorithm proposed in this work can reasonably assign UAV cluster testing tasks, and its external service performance is generally better than the other two distributed test methods. With the passage of external service time, its throughput value gradually becomes stable, and in the later period, it is slightly higher than the throughput value of LogiScope and AutoRunner. On the whole, we can conclude that the method proposed in this work is relatively efficient and stable in terms of external service effects, which helps UAVs to efficiently perform large-scale distributed testing tasks.

5.5. Testing Task Allocation Failure Rate

In the actual deployment of testing tasks, because some selected testing task processing nodes are difficult to meet the corresponding task requests, the deployment of some distributed testing tasks often fails. In this section, we use CloudSim to simulate some dynamic node failures and compare them with the other two approaches in terms of the number of task deployment failures. As shown in Figure 6, the number of task deployment failures of LogiScope and AutoRunner increased rapidly as the task size increased. For the algorithm proposed in this work, with the expansion of the task scale, the number of task deployment failures increased significantly slower than that of the other two methods, and the number of task deployment failures remained at a minimum. This is mainly because the other two methods cannot conduct a reasonable and comprehensive concrete analysis of the current residual resource information of each testing task processing node in the distributed UAV cluster network and the testing task requirements. If the resource requirement of a task is greater than or equal to the remaining resources of the node that processes the task, the testing task deployment event may fail. On the contrary, the method proposed in this work can ensure that the remaining resources of each selected physical node are greater than the resource demand of the corresponding task and obtain certain system benefits by using the trial and return comparison of the current environment based on the deep reinforcement network model. Therefore, to a large extent, it can dynamically and adaptively find the appropriate task processing node from the cluster for most of the testing task requests to complete the efficient execution of the testing task.

6. Conclusions

This work designs and proposes an innovative distributed test method for UAV clusters, which is used to explore large-scale testing task execution and resource scheduling strategies. In this work, on the basis of proposing the task execution problem in the UAV cluster environment, we construct the task assignment method by using the DDPG, and establish the mathematical model with the goal of optimizing the system load balance degree and the communication bandwidth resources among UAVs. Through the continuous interaction of agents with the state environment, the system feedback is obtained to achieve the iterative optimization of the testing task allocation scheme.

Since the DDPG is still inefficient and unstable in the actual testing process, it is necessary to study a periodic updating mechanism and its influence on the final result. At the same time, we plan to use mobile edge computing to extend the algorithm proposed in this work to a larger-scale experimental environment to perform more testing tasks to verify its effectiveness and stability.

Author Contributions

Conceptualization, D.L. and P.Y.; methodology, D.L.; software, P.Y.; validation, D.L. and P.Y.; formal analysis, P.Y.; investigation, D.L.; resources, D.L.; data curation, P.Y.; writing—original draft preparation, D.L.; writing—review and editing, P.Y.; visualization, P.Y.; supervision, D.L.; project administration, D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project “Artificial Intelligence Software and Hardware to Adapt Migration Technology and Tools”, grant number NIVY227201160.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available within the paper. For additional information, please contact the corresponding author.

Acknowledgments

The authors are grateful to China Electronic Product Reliability and Environmental Testing Research Institute for the multifaceted support provided in this study.

Conflicts of Interest

Author was employed by the company China Electronic Product Reliability and Environmental Testing Research Institute. All authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Mu, J.; Zhang, R.; Cui, Y.; Gao, N.; Jing, X. UAV meets integrated sensing and communication: Challenges and future directions. IEEE Commun. Mag. 2023, 61, 62–67. [Google Scholar] [CrossRef]
Azari, M.M.; Geraci, G.; Garcia-Rodriguez, A.; Pollin, S. UAV-to-UAV communications in cellular networks. IEEE Trans. Wirel. Commun. 2020, 19, 6130–6144. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, L.; Zhu, L.; Zhang, T.; Xiao, Z.; Xia, X.-G. 3D deployment of multiple UAV-mounted base stations for UAV communications. IEEE Trans. Commun. 2021, 69, 2473–2488. [Google Scholar] [CrossRef]
Zhao, J.; Gao, F.; Jia, W.; Yuan, W.; Jin, W. Integrated sensing and communications for UAV communications with jittering effect. IEEE Wirel. Commun. Lett. 2023, 12, 758–762. [Google Scholar] [CrossRef]
Cheng, X.; Shi, W.; Cai, W.; Zhu, W.; Shen, T.; Shu, F.; Wang, J. Communication-efficient coordinated RSS-based distributed passive localization via drone cluster. IEEE Trans. Veh. Technol. 2021, 71, 1072–1076. [Google Scholar] [CrossRef]
Ramdane, Y.; Boussaid, O.; Boukraà, D.; Kabachi, N.; Bentayeb, F. Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance. Parallel Comput. 2022, 111, 102918. [Google Scholar] [CrossRef]
Myint, K.N.; Zaw, M.H.; Aung, W.T. Parallel and distributed computing using MPI on raspberry Pi cluster. Int. J. Future Comput. Commun. 2020, 9, 18–22. [Google Scholar] [CrossRef]
Gu, W.; Valavanis, K.P.; Rutherford, M.J.; Rizzo, A. UAV model-based flight control with artificial neural networks: A survey. J. Intell. Robot. Syst. 2020, 100, 1469–1491. [Google Scholar] [CrossRef]
Huang, J.; Tian, G.; Zhang, J.; Chen, Y. On unmanned aerial vehicles light show systems: Algorithms, software and hardware. Appl. Sci. 2021, 11, 7687. [Google Scholar] [CrossRef]
Susanto, T.; Setiawan, M.B.; Jayadi, A.; Rossi, F.; Hamdhi, A.; Sembiring, J.P. Application of Unmanned Aircraft PID Control System for Roll, Pitch and Yaw Stability on Fixed Wings. In Proceedings of the 2021 IEEE International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE), Banyuwangi, Indonesia, 27–28 October 2021. [Google Scholar]
Štroner, M.; Urban, R.; Línková, L. A new method for UAV Lidar precision testing used for the evaluation of an affordable DJI ZENMUSE L1 scanner. Remote Sens. 2021, 13, 4811. [Google Scholar] [CrossRef]
Sartaj, H. Automated approach for system-level testing of unmanned aerial systems. In Proceedings of the 2021 IEEE 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 15–19 November 2021. [Google Scholar]
Kim, S.; Philip, O.A.; Tullu, A.; Jung, S. Development and verification of a ros-based multi-dof flight test system for unmanned aerial vehicles. IEEE Access 2023, 11, 37068–37081. [Google Scholar] [CrossRef]
Li, Y.; Zhu, X. Design and testing of cooperative motion controller for UAV-UGV system. Mechatron. Intell. Transp. Syst. 2022, 1, 12–23. [Google Scholar] [CrossRef]
Santoso, F.; Garratt, M.A.; Anavatti, S.G. Hybrid PD-fuzzy and PD controllers for trajectory tracking of a quadrotor unmanned aerial vehicle: Autopilot designs and real-time flight tests. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 1817–1829. [Google Scholar] [CrossRef]
Li, F.; Zhang, K.; Wang, J.; Li, Y.; Xu, F.; Wang, Y.; Tong, N. Multi-UAV Hierarchical Intelligent Traffic Offloading Network Optimization Based on Deep Federated Learning. IEEE Internet Things J. 2024, 11, 21312–21324. [Google Scholar] [CrossRef]
Xi, M.; Dai, H.; He, J.; Li, W.; Wen, J.; Xiao, S.; Yang, J. A lightweight reinforcement learning-based real-time path planning method for unmanned aerial vehicles. IEEE Internet Things J. 2024, 11, 21061–21071. [Google Scholar] [CrossRef]
Arunarani, A.; Manjula, D.; Sugumaran, V. Task scheduling techniques in cloud computing: A literature survey. Future Gener. Comput. Syst. 2018, 91, 407–415. [Google Scholar] [CrossRef]
Yuan, H.; Bi, J.; Zhou, M.; Liu, Q.; Ammari, A.C. Biobjective task scheduling for distributed green data centers. IEEE Trans. Autom. Sci. Eng. 2020, 18, 731–742. [Google Scholar] [CrossRef]
Saleem, U.; Liu, Y.; Jangsher, S.; Li, Y.; Jiang, T. Mobility-aware joint task scheduling and resource allocation for cooperative mobile edge computing. IEEE Trans. Wirel. Commun. 2020, 20, 360–374. [Google Scholar] [CrossRef]
Hosseinioun, P.; Kheirabadi, M.; Tabbakh, S.R.K.; Ghaemi, R. A new energy-aware tasks scheduling approach in fog computing using hybrid meta-heuristic algorithm. J. Parallel Distrib. Comput. 2020, 143, 88–96. [Google Scholar] [CrossRef]
Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
Xiao, Y.; Zhang, Q.; Liu, F.; Wang, J.; Zhao, M.; Zhang, Z.; Zhang, J. NFVdeep: Adaptive online service function chain deployment with deep reinforcement learning. In Proceedings of the International Symposium on Quality of Service, Phoenix, AZ, USA, 24–25 June 2019. [Google Scholar]
Zhou, C.; Wu, W.; He, H.; Yang, P.; Lyu, F.; Cheng, N.; Shen, X. Deep reinforcement learning for delay-oriented IoT task scheduling in SAGIN. IEEE Trans. Wirel. Commun. 2020, 20, 911–925. [Google Scholar] [CrossRef]
Zhao, N.; Ye, Z.; Pei, Y.; Liang, Y.-C.; Niyato, D. Multi-agent deep reinforcement learning for task offloading in UAV-assisted mobile edge computing. IEEE Trans. Wirel. Commun. 2022, 21, 6949–6960. [Google Scholar] [CrossRef]
Campos-Martínez, S.-N.; Hernández-González, O.; Guerrero-Sánchez, M.-E.; Valencia-Palomo, G.; Targui, B.; López-Estrada, F.-R. Consensus Tracking Control of Multiple Unmanned Aerial Vehicles Subject to Distinct Unknown Delays. Machines 2024, 12, 337. [Google Scholar] [CrossRef]
Galicia-Galicia, L.-A.; Hernández-González, O.; Garcia-Beltran, C.D.; Valencia-Palomo, G.; Guerrero-Sánchez, M.-E. Distributed Observer for Linear Systems with Multirate Sampled Outputs Involving Multiple Delays. Mathematics 2024, 12, 2943. [Google Scholar] [CrossRef]
Zhou, L.; Leng, S.; Liu, Q.; Wang, Q. Intelligent UAV swarm cooperation for multiple targets tracking. IEEE Internet Things J. 2021, 9, 743–754. [Google Scholar] [CrossRef]
Zhou, L.; Leng, S.; Wang, Q.; Liu, Q. Integrated sensing and communication in UAV swarms for cooperative multiple targets tracking. IEEE Trans. Mob. Comput. 2022, 22, 6526–6542. [Google Scholar] [CrossRef]
Kapoor, M.; Parikh, N.; Jhaveri, V.; Mehta, V.; Sikka, P. LogiScope: Low Cost & Portable Logical Analyzer For Single Device and Multiple Platforms. Int. J. Emerg. Technol. Adv. Eng. 2014, 4, 695–701. [Google Scholar]
Ding, Y.; Xiang, R. An Automated Test Scheme based on AutoRunner and TestCenter. In Wireless Communication and Sensor Network, Proceedings of the International Conference on Wireless Communication and Sensor Network (WCSN 2015), Changsha, China, 12–13 December 2015; World Scientific: Singapore, 2016. [Google Scholar]

Figure 1. The view of system architecture.

Figure 2. UAV cluster testing task processing method based on deep reinforcement learning.

Figure 3. The performance comparison of three methods in terms of load balancing effect.

Figure 4. The performance comparison of three methods in terms of bandwidth consumption.

Figure 5. The performance comparison of three methods in terms of system external service performance.

Figure 6. The performance comparison of three methods in terms of testing task allocation failure rate.

Table 1. Description of main functions.

Symbol	Definition
WL^if	Load balancing degree of the ith UAV node data
U^if	Resource utilization of the ith node
Rⁱ	Total resources of the ith node
Mean(U)	Average resource utilization in a certain period of time
A	Load balancing degree of the whole system
B(L_DAG, t)	Total bandwidth resource cost required in the collaborative processing of multi-UAV testing tasks
w_j_,_k	Weight of the wireless network link

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, D.; Yang, P. Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning. Appl. Sci. 2024, 14, 11282. https://doi.org/10.3390/app142311282

AMA Style

Li D, Yang P. Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning. Applied Sciences. 2024; 14(23):11282. https://doi.org/10.3390/app142311282

Chicago/Turabian Style

Li, Dong, and Panfei Yang. 2024. "Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning" Applied Sciences 14, no. 23: 11282. https://doi.org/10.3390/app142311282

APA Style

Li, D., & Yang, P. (2024). Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning. Applied Sciences, 14(23), 11282. https://doi.org/10.3390/app142311282

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed Unmanned Aerial Vehicle Cluster Testing Method Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. The Formalization of UTDR

4. The Proposed Algorithm

4.1. System Architecture

4.2. Main Idea

4.3. Implementation of UDTR

5. Results

5.1. Experimental Setup

5.2. Load Balancing Effect

5.3. Bandwidth Consumption

5.4. System External Service Performance

5.5. Testing Task Allocation Failure Rate

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI