1. Introduction
In recent years, the technology and applications of unmanned aerial vehicles (UAVs) have developed rapidly. Compared with manned aerial vehicles, UAVs have higher maneuverability and better operational sustainability. The use of UAVs for mission execution can effectively reduce personnel risk and maintenance costs, and can therefore replace manned vehicles to perform some harsh, dangerous, and tedious missions. A typical application of UAVs in the military could be UAV Swarm Saturation attack [
1]. Specifically, it refers to how the attacking side adopts a high density, continuous attack strategy to cause the enemy’s defense system to collapse. Then, the UAV Swarm will have a much higher probability of breaking through defenses and achieve the goal of destroying the enemy’s high-value targets. In order to achieve cooperation and control of multiple UAVs, the task assignment for UAV swarm saturation attack is of great significance.
Task assignment for a UAV swarm is to allocate multiple UAVs to specific tasks according to the number and types of the vehicles, the tasks to be performed, and the conditions of the environment, which is essentially an optimization problem under multiple constraints [
2]. Besides, the task assignment problem is also NP-hard [
3], which means that the optimal solution comes from searching the whole solution space. In practical scenarios, the task assignment problem for a UAV swarm is challenging, especially in the complex dynamic environment. The traditional method is to use the deterministic algorithms. Darrah et al. considered the multiple UAV dynamic task allocation problem in a Suppression of Enemy Air Defense (SEAD) mission in [
4]. They used Mixed Integer Linear Programming (MILP) to assign vehicles to specific tasks. Schumacher et al. also proposed a method based on MILP to solve the constrained optimization problem for UAV task assignment in [
5]. In [
6], Nygard et al. proposed the dynamic network flow optimization models and used the centralized optimization algorithm to solve the air vehicle resource allocation problem. Ye et al. developed an extended consensus-based bundle algorithm with task coupling constraints (CBBA-TCC) in [
7] to solve the multi-task assignment problem with task coupling constraints in the heterogeneous multi-UAV system.
However, the computational complexity of the task assignment problem increases exponentially with the growth of the number of targets and UAVs. It is hard for the exact algorithms that pursue the optimal solution to complete the searching process in an acceptable time. In order to speed up the solving process, another commonly used method is applying heuristic algorithms. Shima et al. proposed genetic algorithms to solve the multiple task assignment problem for cooperating UAVs in [
3,
8]. Jia et al. considered the cooperative multiple task assignment problem with stochastic velocities and time windows for heterogeneous UAVs in [
9] and proposed a modified genetic algorithm to solve it. In [
10], Zhu et al. focused on the multirobot task allocation problem (MRTAP) and proposed the self adaptive evolutionary game particle swarm optimization algorithm to solve it. Zhao et al. considered the search and rescue scenario in [
11] and they proposed a heuristic distributed task allocation method to solve the problem. Zhen et al. proposed an intelligent self-organized algorithm to solve a cooperative search–attack mission planning problem for multiple unmanned aerial vehicles in [
12]. Fan et al. proposed a modified nature-inspired meta-heuristic methodology for heterogeneous unmanned aerial vehicle system task assignment problem in [
13]. Xia et al. proposed a system framework for solving the problem of multi-UAV cooperative task assignment and track planning for ground moving targets in [
14]. The heuristic algorithms can obtain a solution for large-scale task assignment problems in an acceptable time, but usually fall into a local optimum and stop searching process early. Besides, the solutions need to be encoded into a vector in these algorithms. Once the problem scale or characteristics changes, the old encoding strategy is hard to apply to the new problem.
With the development of artificial intelligence technology in recent years, deep reinforcement learning (DRL) has achieved remarkable breakthroughs in many fields. Deep reinforcement learning is mainly used to make sequential decisions, i.e., to make action choices based on the current environmental conditions and continuously adjust their strategies based on the feedback from the actions to achieve the goals. The performance of deep reinforcement learning on problems such as AlphaGo Zero [
15] and Atari [
16] in recent years has demonstrated its powerful learning and optimization decision-making capabilities. Additionally, deep reinforcement learning techniques have shown a significant advantage for combinatorial optimization problems. Combinatorial optimization, which is the optimal selection of decision variables in a discrete decision space, naturally has similar characteristics to “action selection” in reinforcement learning. Therefore, it is a good choice to solve the traditional combinatorial optimization problem using deep reinforcement learning methods. Compared with traditional optimization algorithms, DRL-based combinatorial optimization algorithms have a series of advantages such as fast solving speeds and high generalization ability, which is a hot research topic in recent years. In 2015, Vinyals et al. [
17] proposed a pointer network (Ptr-Net) model for solving combinatorial optimization problems by analogizing it to a machine translation process. The network was trained using supervised learning and achieved good optimization results on the TSP problem. Bello et al. [
18] used reinforcement learning to train the pointer network model. They considered each problem instance as a training sample, used the REINFORCE reinforcement learning algorithm for training, and introduced the critic network as a baseline to reduce the training variance. Kool et al. [
19] proposed a new method for solving multiple combinatorial optimization problems using the attention mechanism, based on the Transformer model [
20]. Furthermore, the algorithm outperforms previous pointer network models on multiple optimization problems.
Furthermore, deep reinforcement learning techniques have also been applied to many practical optimization problems. Zhao et al. [
21] considered a task allocation problem for UAVs in the presence of environment uncertainty and proposed a Q-learning based fast task allocation (FTA) algorithm. Tian et al. [
22] presented the multi-robot task allocation algorithm for fire disaster response based on reinforcement learning. Yang et al. [
23] focused on a resource management problem for the ultra-reliable and low-latency internet of vehicle communication networks and presented a decentralized actor–critic reinforcement learning model with a new reward function to learn the optimal policy. Luo et al. [
24] focused on the missile target assignment (MTA) problem and proposed a data-driven policy optimization with deep reinforcement learning (PODRL) for the adversarial MTA. Liang et al. [
25] proposed a deep reinforcement learning model to control the traffic light. Huang et al. [
26] focused on online computation offloading in wireless powered mobile–edge computing networks and proposed a deep reinforcement learning-based online offloading (DROO) framework.
In practice, task assignment for UAV swarm saturation attacks usually occur in the hostile environment, which is complex and stochastic. In our paper, we propose a deep reinforcement learning method to solve the task assignment problem, which meets the requirements of real-time and flexibility in the actual situation. The simulation and experiments have shown that our reinforcement learning agent based on deep neural network converges rapidly and stably using the policy gradient method. Additionally, the solutions obtained from our policy network are effective for both large- and small-scale problems. The contributions of our work are listed as follows.
We construct a mathematical model to formulate the task assignment problem for UAV swarm saturation. Furthermore, we consider the task assignment model as a Markov Decision Process from the reinforcement learning perspective;
We build a task assignment framework based on the deep neural network to generate solutions for adversarial scenarios. The policy network uses the attention mechanism to pass information and guarantees the effectiveness and flexibility of our algorithm under different problem scales;
We propose a training algorithm based on the policy gradient method so that our agent can learn an effective task assignment policy from the simulation data. We also design a critic baseline to reduce the variance of the gradients and increase the learning speed.
The rest of our article is organized as follows. The problem formulation of task assignment for UAV swarm saturation attack is provided in
Section 2. In
Section 3, a task assignment framework based on deep reinforcement learning is constructed to provide solutions for the problem. Then, the simulation and analysis of the proposed method is conducted in
Section 4. Finally, the conclusions and future works are presented in
Section 5.
3. Proposed Method
In this section, a reinforcement learning agent is proposed for the task assignment problem. We construct the decision agent using a deep neural network. The deep neural network we proposed is a stack-based architecture that uses attention layers to pass information. Then, in order to obtain a well-trained neural network, we resort to policy gradient methods to train the policy neural network.
3.1. Network Architecture for Task Assignment
We parameterize our reinforcement learning agent using a deep neural network architecture. Based on the MDP model we have constructed above, the input to the deep neural network is the state of each process and the output is the action. Inspired by the Transformer architecture [
20,
29], the deep neural network we proposed in this paper is a stack-based architecture that uses attention layers to pass information between different potential assignments. In our network architecture, there is an initial embedding layer that we use to encode the state matrix. Following this, a set of “stacks” with the same structure but different parameters implement similar operations. Each stack consists of two communication layers: Agent-wise Communication Layer and Task-wise Communication Layer. In order to gain a more holistic understanding of the current situation, we use these stacks to pass information between different agents and tasks. Then, we use a linear feed-forward layer to transform the hidden representations of each UAV–Target pair into a scalar. Finally, a Softmax layer is used to process the final results and we obtain the output. The full network architecture for the task assignment is shown in
Figure 2.
Input: As in the MDP model we constructed above, we use the state of each process as the input to the deep neural network. A three-dimensional array
is used in our paper to represent the state information, where
is the total number of the variables we use to indicate the state information. In the three-dimensional array, we can index
X through
i and
j to view information about the specific UAV–Target pairs:
represents the i-th UAV’s properties, indicates the state information of the j-th target, and corresponds to the UAV–Target pairwise state information.
Embedding Layer: The input
X is first processed by the embedding layer
, which is a feed-forward neural network. We use the embedding layer to project the last dimension of the input
X into the dimension
. In other words, every
is processed through a a feed-forward layer element-wise with parameter
as follows:
Communication Layer: Inspired by the Transformer architecture [
20], we construct the communication layer in our paper. Each communication layer is composed of two sublayers: a multi-head attention layer and a feed-forward layer. We also add skip–connection [
30] and batch normalization [
31] for each sublayer.
We utilize the attention mechanism [
20] as a message passing method between different UAVs or targets in our task assignment problem. There are two kinds of communication layers: Agent-wise communication and Task-wise communication. The two kinds of communication layers perform the same operations, only the dimensions they operate on are different. Here, we will illustrate the Task-wise communication as an example; the other one is the same. For the embedded input
, which is a three-dimensional array, we can divide it into M groups according to different UAVs. Each group can be represented as a two dimensional vector
. Our network architecture will apply a self-attention operation on each group separately using the same parameters. Self-attention performs three feed-forward layers separately to generate queries
, keys
, and values
first. Then, we operate the attention mechanism as follows:
We utilize multi-head attention to operate the multiple attention mechanism in parallel. In our paper, we use
h heads:
,
, and
. Then, the whole self-attention heads will be concatenated together:
where
Thus, and .
After the multi-head attention sublayer, we then use a fully connected feed-forward (FF) layer to process the data. Furthermore, we also add a skip-connection and batch normalization (BN) for each sublayer:
where
is the result of the embedding layer and
C is the result of the communication layer.
Linear and Softmax Layer: After the operation of N stacks, where each stack performs the self-attention operation, we construct a Linear layer. The Linear layer we use is simply a fully connected feed-forward layer. We use the Linear layer to further process the state information. In the end, we use the Softmax layer to process the final result. However, before the Softmax operation, the infeasible “UAV–Task” pairs need to be masked out. We perform the mask out operation on the temporary output
:
where
is a very large positive constant. After the mask operation, we use Softmax layer to obtain the final result.
3.2. Optimization with Policy Gradients
In our paper, we resort to policy gradient methods to train the policy neural network. Our goal is to find an optimal policy and maximize the expected return of this policy in the environment. We define the objective function of policy learning as follows:
Here,
denotes the initial state and
indicates the state–value function that represents the expected return starting from the state following the policy
. We derive the objective function with respect to
. The gradient ascent method can then be used to maximize this objective function to obtain the optimal policy. The gradient of the objective function is formulated as follows:
where
represents the action–value function, which indicates the expected return from performing the action on the current state when the MDP follows the policy
.
We use the REINFORCE algorithm [
32] to formulate the gradient of the objective function. The REINFORCE algorithm utilizes the Monte Carlo method to sample trajectories and estimate
. Considering the large variance of the gradient estimates of the original REINFORCE algorithm, we introduce the baseline function in the training algorithm to reduce the variance of the gradients and therefore increase the learning speed:
where
indicates the baseline function that does not depend on the action. In our paper, we utilize a parametric baseline function to estimate the expected return. An auxiliary network, called a critic, is introduced in the training period to accelerate the learning. The critic network
parameterized by
has the similar architecture with the policy network but the end of the network is a little different, which the output of the critic network is a scalar estimating the expected return. We train the critic baseline function with a stochastic gradient descent method on a mean squared error objective between the expected return sampled by the most recent policy and its predictions
. The pseudocode of the training algorithm is provided in Algorithm 1.
Algorithm 1: REINFORCE with critic baseline |
|
3.3. Searching Strategy
After the training process, the well-trained policy network can be used to infer the optimal task assignment scheme. In the inference period, we consider two searching strategies to construct the final assignment scheme: greedy search and sampling.
Greedy Search: Our first approach is simply select the UAV–Target pair with the largest probability in each process. Given the initial state representing a task assignment problem, the well-trained policy network adopts the state as the input and outputs the probability distribution over different UAV–agent pairs. The greedy search strategy select the index with the largest probability greedily. Then, the next state information is fed into the decision agent and the above process is repeated until the end of the allocation.
Sampling: Considering constructing a task assignment scheme using our reinforcement learning agent is inexpensive, so we utilize the sampling approach to sample multiple candidate solutions from the policy network and select the best one. Our sampling strategy obtains each solution based on the action probability distribution from the output of the policy neural network. More candidate solutions may produce a better result, but it will also consume more time. We can choose a proper scale of the candidates to achieve a balance between performance and efficiency.