1. Introduction
With the rapid development of uncrewed aerial vehicle (UAV) technology and autonomous decision-making approaches, the characteristics of air combat including fast pace, strong adversarial games, and incomplete information have been presented [
1]. For autonomous maneuver decision-making in air combat, UAVs use specific methods such as optimal control, differential game, and machine learning to generate maneuvering commands based on the environment situation (including opponent information) acquired by its onboard detection equipment [
2]. Autonomous maneuver decision-making is significant to dogfighting. The autonomous maneuver decision-making method is key to winning dogfights since it leads UAVs to occupy an advantageous position. How to use the obtained air combat situation to generate decision-making commands accurately is currently the difficulty of air combat maneuver decision-making [
3]. Thus, it is important to investigate different intelligence decision-making methods to improve the decision-making speed and quality of UAVs.
The autonomous maneuver decision-making approaches in air combat include mathematical, search-based, and data-driven approaches. For the mathematical approach, the maneuver decision-making problem is formulated as an optimization problem and solved by using mathematical analytical methods, such as differential games [
4,
5], the bi-objective optimization method [
6], and the situational function optimization method [
7]. Although the analytical solutions obtained from these methods are clear, the calculations can be quite complex. For the search-based approach, the maneuver decision-making problem is modeled as a discrete variable optimization problem, and it is solved by matrix decision-making [
8], heuristic algorithms [
9], and dynamic programming [
10]. However, for these search-based approaches, finding a satisfactory solution within a finite number of iterations becomes challenging as the problem size increases. For the data-driven approach, it includes neural networks [
11], fuzzy algorithms [
12,
13], and reinforcement learning [
14,
15]. The data-driven approach modeled the maneuver decision-making problem as a mapping problem between different air combat situations and maneuver decision commands.
Reinforcement learning is a good option for solving sequential decision problems, which gives agents the ability to perform self-supervised learning. In order to obtain the maximum accumulated reward, the agent interacts with the environment and continuously adjusts its own strategy by obtaining reward in the environment. Deep reinforcement learning (DRL) combines deep neural networks and reinforcement learning approaches, utilizing the powerful representation and mapping capabilities of neural networks to approximate the reward function of state and action and map from state to action. DRL is an effective approach to solving high-dimensional state–action problems, and it is widely used in fields such as electronic games, recommendation systems, and intelligent control. A parameter-shared Q-network (PS-DQN) method is proposed in Reference [
16]. Multi-UAV maneuver decision-making applies PS-DQN to converge the strategy to Nash equilibrium with virtual self-play. But it is assumed that each UAV is flying at the same altitude and can detect the accurate position of enemy UAVs within a certain range. In Reference [
17], the Soft Actor Critic (SAC) approach is used to design the maneuver decision-making algorithm. Compared with the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, the simulation results show that the SAC algorithm has a shorter training time and a higher win rate. In Reference [
18], a one-to-one air combat model and missile attack zone are built, and then a Parallel Self-Play training SAC algorithm (PSP-SAC) is proposed for sharing the sample and policy in multiple combat environments. In Reference [
19], an Asynchronous Advantage Actor–Critic (A3C) algorithm is proposed, employing a multi-threaded asynchronous mechanism to reduce input correlation and accelerate training compared to methods based on experience replay. In Reference [
20], a bidirectional recurrent neural network (BRNN) is used to achieve communication between UAV individuals, and the multi-UAV cooperative air combat maneuver decision model under the actor–critic architecture is established. The decision model can obtain the cooperative maneuver policy through reinforcement learning, and guide UAVs to obtain the overall situational advantage and defeat the opponents under tactical cooperation. In Reference [
21], a maneuver decision-making approach is proposed by applying the LSTM network to the actor and critic networks in the PPO method, which has the ability to learn temporal air combat data. The simulation results show that this approach can improve the decision-making agent’s learning efficiency and decision quality. In Reference [
22], the DRL-based maneuver decision-making approach is proposed by using the LSTM-Dueling DQN network, and the simulation results show that the agent training efficiency is improved. In Reference [
23], an intelligent maneuver planning method for Beyond-visual-range (BVR) air combat using an improved deep Q network (DQN) based on the LSTM network is proposed; the results show that agents can effectively avoid enemy threats and gain tactical advantages. In Reference [
24], a data-driven approach using the LSTM neural network is proposed to predict missile trajectories in a model-free manner, leveraging their capacity to learn long-term temporal dependencies and handle both measurement noise and motion uncertainties.
The Transformer network achieves great success in dealing with time series data [
25], and introducing it into DRL has become a current research hotspot. In [
26], the self-attention mechanism in the Transformer network is introduced in DRL for structured state representation inference. In [
27], self-attention is applied to representation learning, which extracts relationships between multiple agents to learn and express strategies better. In [
28], the Transformer architecture is introduced to offline reinforcement learning, which is directly applied to the sequence decision-making model.
According to the above references, the following limitations have been identified. Firstly, the completely observable air combat environment is assumed in a variety of approaches, and the situation between the enemy and our agent is assumed to be completely available. However, in real application cases, the environment and agent state information is hard to obtain, and usually only partial information can be obtained. Secondly, a fully connected neural network is applied to DRL algorithms in most approaches. The fully connected network has a simple and intuitive structure with strong data mapping ability, but it is difficult to capture the dependency relationships in temporal data. The LSTM network is applied in some approaches [
19,
29], which use the gating mechanism to process data with temporal dependencies and prevent gradient vanishing effectively. Compared to fully connected networks, the dependency relationships between sequence inputs can be captured by the LSTM network. Compared to traditional RNNs and GRUs, additional gating units and memory units are introduced to solve the problem of gradient vanishing during propagation. However, the data are input in a serial manner, and the gating mechanism requires a large amount of computation, resulting in high training time costs. Thus, the DRL with fully connected networks and recurrent neural networks cannot work well in the situation of enemy information loss. Thirdly, although the Transformer network has been introduced into DRL, its application to air combat decision-making has not been sufficiently studied. Thus, the above approaches [
16,
17,
18,
19,
20] cannot provide reliable performance for cases with incomplete information in air combat decision-making, especially for cases with opponent information loss. Therefore, according to the above analysis, it is important to investigate the intelligent maneuver decision-making method in the information loss case to improve the decision-making quality.
In this paper, the Transformer network is introduced to design the actor and critic networks in DRL architecture. Considering the maneuver decision-making problem in air combat is a continuous-variable optimization problem, the Deep Deterministic Policy Gradient (DDPG) method based on the framework of actor and critic performs well to deal with this problem. Thus, in this paper, an approach based on DDPG combined with a Transformer network is proposed. The main contributions of this paper are concluded as follows:
- (1)
Considering the information loss, the Transformer network is introduced to design the actor and critic networks in DDPG, which can extract hidden relationships in the temporal trajectory samples and use this relationship as the reference for the agent network when opponent information is unavailable. Then, the maneuver decision quality can be improved in the information loss situation.
- (2)
The issues of limited experience samples, low sampling efficiency, and poor stability appear when the Transformer network is introduced into the DDPG algorithm. To address the issue of limited experience samples, a more effective reward function is designed by combining the global and local rewards, and meanwhile, an exploration mechanism is designed, which encourages the agent to explore the decision-variable space thoroughly and accumulate a large number of various experience samples. To address the issue of low sampling efficiency, a prioritized sampling mechanism is designed in the episode experience replay, assigning higher sampling probabilities to focus on more informative episodes and improve training efficiency. To address the issue of poor stability, a dynamic learning rate adjustment mechanism is designed to make a quick rapid gradient descent in the initial training stage and an accurate parameter setting in the later training stage.
- (3)
Based on numerous experiments, the proposed approach has a better performance in the case of a 10% probability of information loss compared with the traditional DDPG-based decision-making approach. This result proves the effectiveness of the proposed maneuver decision-making approach.
3. Opponent UAV Decision-Making Model Based on the Genetic Algorithm Optimizing Matrix Game
It is important to build a competitive UAV opponent for training the DRL-based UAV decision-making model. Here, a maneuver decision-making approach is adopted, which combines a genetic algorithm with a matrix game to generate opponent UAV actions. A matrix game is often used to solve zero-sum game problems, where the decision-maker can only choose an optimal one from the finite strategy set. In order to use the matrix game, the common basic maneuvering command set is built and both UAVs are assumed to select a command from the library. The basic command set consists of seven maneuvers: maintaining the current state, maximum overload acceleration/deceleration, left/right turns, pull-up, and dive.
3.1. Maneuver Decision-Making Using the Matrix Game Method
The matrix game method is based on game theory knowledge and is used to solve two-person finite zero-sum game problems, where decision-makers can only choose from a limited set of strategies during the decision-making process. The blue side uncrewed aerial vehicle uses a genetic algorithm to optimize the maneuvering strategy generated by matrix countermeasures and uses it as an air combat maneuver decision-making method to enable the uncrewed aerial vehicle to quickly form an attack advantage on the target. Firstly, the instructions in the typical maneuver instruction set are used as candidate decisions. Then, matrix games are used to estimate the strategies of the agent and the opponent to select specific decision instructions as candidate optimal actions. Next, genetic algorithms are used to iteratively optimize and search for the optimal action instructions in the neighboring domains within the candidate optimal action space, further improving decision quality while ensuring decision speed.
Assuming both sides employ one of the seven typical maneuvering actions, the advantage value of the blue side over the red side can be calculated after one decision step. In this scenario, the red side selects the
n th maneuvering method, while the blue side selects the
m th maneuvering method. By traversing all the action sets of both the red and blue sides, the advantage matrix for the blue side can be calculated as follows:
To ensure that the blue side selects a robust decision value, where the chosen action maximizes its overall advantage after maneuvering regardless of the maneuvering strategy adopted by the red side, the blue side should select the action A* corresponding to the row sum’s maximum value in the advantage matrix as the decision result.
3.2. Genetic Algorithm
The maneuver strategy optimized using the above matrix game method is a coarse-grained approach with low control accuracy, indicating that there is still room for improvement in the acceleration decision. Therefore, based on the matrix game framework, a genetic algorithm is integrated into the matrix game framework to select action values as the central reference for determining decision values. An optimization interval is established at 0.1g, extending 1g in both the overloaded left and right directions. Through continuous iteration, the genetic algorithm outputs the maneuver decision values, which serve as the final decision adopted by the blue side.
To ensure rapid convergence of the genetic algorithm, the population size is set to 50(p1–p50), and each chromosome is encoded as a real number containing three genes corresponding to overload in three directions. The fitness function is defined as the sum of the advantages of one maneuver of the blue side over the seven typical maneuvers of the red side after a single decision step.
When employing genetic algorithms for optimization, a new population is formed through selection, crossover, and mutation operations. Through continuous iteration, this newly generated population evolves towards higher fitness values, with the individual exhibiting the highest fitness ultimately output as the final result of the optimization search.
- 1.
Selection
During the selection process, individuals in the population are first sorted by their fitness from high to low, ranked from 1 to 50. The top 10 individuals (p1–p10) are retained for the next generation, while individuals numbered 11–40 (p11–p40) undergo crossover operations to generate 30 new individuals. Additionally, 10 individuals(p41–p50) are randomly selected for mutation operations.
- 2.
Crossover
During crossover operations, 30 chromosomes are randomly paired, and a gene is randomly selected for position exchange between the paternal and maternal parents in each pair.
- 3.
Mutation
During mutation operations, a gene with a value of c is randomly selected, and a mutation is performed within a uniform distribution range of [c − 0.5, c + 0.5].
The algorithm of opponent UAV decision-making is shown in Algorithm 1.
Algorithm 1: Genetic Algorithm Optimizing Matrix Game |
Initialization: Initialize a set containing 7 typical maneuver decisions as {a1…a7}, an advantage matrix advm,n filled with zeros. |
Matrix Game: |
for the blue side select ai = a1…a7 do: |
for red side select aj = a1…a7 do: |
Calculate the advantage value advi,j after one decision step |
Fill the value into the advantage matrix advm,n |
end for |
end for |
Calculate the row sum and choose the action A* corresponding to the maximum value |
Genetic Algorithm: |
Build a population of size 50 centered around decision A* |
for iteration = 1…20 do: |
Sort individuals in the population based on fitness, numbered p1–p50 |
Select p1 to p10 to directly enter the next generation |
Select p11 to p40 for crossover; the offspring enters the next generation |
Select p41 to p50 for mutation as the next generation |
Calculate the fitness of each individual in the next generation population |
end for |
Select the individual with the highest fitness as the final decision |
4. Decision-Making Approach Based on the Transformer Network and Deep Reinforcement Learning
The structure of the proposed maneuver decision-making approach is shown in
Figure 3, which includes the red and blue UAV agents, the air combat environment, and the adopted performance enhancement measures. The red UAV agent is built based on the AC framework by incorporating a Transformer network for both the actor and critic components. The blue UAV agent adopts the approach combining the matrix game and genetic algorithm to make the maneuvering decision. The air combat environment comprises a UAV simulation based on kinematic equations and a victory judgment system. This system provides state and reward feedback to the UAV agents. Also, three measures to improve the performance of the red UAV agent’s decision are given, which are encourage exploration, priority sampling, and learning rate dynamic adjustment.
4.1. Reward Function Design
The goal of DRL is to learn a policy,
, parameterized by
θ, to maximize the expected cumulative discounted reward defined as follows:
where
γ is the discount factor that discounts future rewards to the current time step, and
represents the reward obtained at time step
t.
The goal of close-range air combat is to achieve an advantageous attacking position. This occurs when the opponent is within the weapon’s attack angle and the UAV maintains a comparable speed to counter evasive maneuvers. Meanwhile, in order to deal with the poor convergence performance caused by sparse rewards in DRL, a reward function is designed by combining the local and global rewards. The reward function of the UAV agent at time
t is defined as follows:
where
is the weight and
and
represent the local reward and global reward, respectively.
The local reward refers to the process of reward to guide UAVs to occupy an advantageous position. The global reward refers to the resulting reward related to the final combat result, and it encourages the UAV agent to win the combat from a global perspective.
The local reward is represented as follows:
where
and
are the weights for the local angle reward and the local speed reward, respectively. The local angle reward
and local speed reward
are designed as follows:
Equation (11) defines the angle reward function, which ensures that the blue UAV remains within the red UAV’s attack angle range. According to
Figure 2, a smaller velocity leading angle (
) for the red UAV indicates that it is positioned behind the blue UAV, thereby meeting the attack angle condition specified in Equation (2). Equation (12) defines the speed reward function, which aims to make the speed of the red UAV close to the blue UAV. On the one hand, during turning maneuvers in adversarial training, a lower speed is required to achieve a smaller turning radius, resulting in a smaller velocity leading angle. On the other hand, to satisfy the attack distance requirement, a higher speed compared to the blue UAV is encouraged to close the distance.
Also, the global reward is represented as follows:
where the above result is determined by Equation (2).
4.2. The Transformer-Based Actor and Critic Networks
A Transformer network is an efficient model for handling sequential data with temporal dependencies, and it is widely applied in fields including machine translation, text summarization, and so on. The core of the Transformer network is the self-attention mechanism, which computes attention weights for different positions in the input sequence. The key component of self-attention is scaled dot-product attention, and it is described as follows:
where
Q,
K, and
V represent the query matrix, key matrix, and value matrix, respectively.
denotes the dimensions of both the query matrix and the key matrix, and
M is a lower triangular matrix used for computing masked self-attention. Especially, the query (
Q), key (
K), and value (
V) matrices are linear transformations derived from the input sequence. Assuming the input sequence is
, the matrices are computed as
,
,
, where
, and
are learnable weight matrices.
The multi-head attention mechanism consists of multiple sets of dot-product self-attention. Each self-attention mechanism enables the model to focus on a specific feature. Meanwhile, multi-head attention allows interaction between multiple heads to enhance its representational capacity. By concatenating matrices with different attention focuses, it obtains the features of different inputs. The multi-head attention is described as follows:
where
is a matrix of linear transformation coefficients.
The structure of actor and critic networks based on the Transformer-based GPT model is shown in
Figure 4, and both the networks consist of multi-head attention, residual connection, layer normalization, feedforward neural network, and other structures. In the actor network, a linear layer and position encoding layer are used to embed the states and to encode time instants, respectively, and then through the layers including dropout, masked multi-head attention, and other layers, the action output is obtained. Similar to the actor network, the inputs of the critic network are states, actions, and instants, and after passing through the linear layer, encoding layer, masked multi-head attention, and other layers, the state–action value
Q(
t) is obtained.
In the structure of actor and critic networks based on the GPT model, positional encoding plays a crucial role in addressing the position order problem in sequence data. Unlike traditional recurrent neural networks (RNNs), which inherently handle the sequential order of data, the Transformer architecture is entirely based on the self-attention mechanism, which is position-agnostic. Therefore, positional information needs to be explicitly injected into the model.
Positional encoding is typically achieved by adding a sequence of vectors—one for each position in the input sequence—to the input state embeddings. This allows the model to distinguish between states at different steps. Here, a commonly used method involving generating positional encodings using sine and cosine functions is adopted. Specifically, each position’s encoding is calculated as follows:
where pos is the position of the time step,
i is the index of the vector, and d is the dimension of the encoding vector.
4.3. Priority Sampling Mechanism Based on Episode Experience Replay
In the Transformer network, the continuous time series data of the state–action pair are used to train the network. However, in the traditional DRL, the experience replay stores one time instant of state–action data as an experience, which cannot be used to train the Transformer network. Furthermore, even if several experience samples are drawn, it is hard to determine the time order of these samples and meanwhile guarantee these experience samples have continuity in time and space dimensions, leading to incorrect training of the Transformer network. Thus, it is necessary to store the state–action time series as an experience sample in the experience pool, which is convenient for experience sampling to train the Transformer-based actor and critic network. Also, the random sampling approach in experience replay cannot ensure that the high-quality experience samples are efficiently used and learned, which has an influence on the convergence speed and quality of the DRL algorithm. This problem will worsen in the Transformer-based actor and critic network. Therefore, it is necessary to design an effective sampling approach in experience replay to improve the learning efficiency of the proposed DRL method.
To deal with the above problems, an episode-based replay memory is presented, and based on it a priority sampling approach is proposed. The episode-based replay memory collects an episode of state–action transitions as a single experience, which is stored in the sub-memory. When an episode of air combat is finished, the episode experience in the sub-memory is put in the replay memory and then the sub-memory is clear. Also, in the priority sampling approach, each episode-based experience is arranged in order based on cumulative reward, and the sampling probability is proportional to the sort priority of experiences. This approach guarantees that the good quality experience sample can be drawn in a larger probability, improving the convergence speed and quality of the proposed DRL algorithm.
Figure 5 illustrates the episode-based replay and priority sampling mechanism. Firstly, the sub-experience memory with the capacity defined as
B1 is designed, which is used to store the experiences generated in an episode. If the end time of an episode is less than the maximum simulation time,
tmax, then the remaining time steps are padded with zeros. When an episode is finished, an episode of experiences is added to the total replay memory. Meanwhile, the sub-memory is cleared and prepared to store the next episode of experiences. The capacity of total replay memory is defined as
B. When the total replay memory is full, the first-in-first-out rule is used to update the replay memory. The introduction of episode-based replay memory can improve the sampling efficiency since the continuous time samples are required to train the Transformer-based network.
When the replay memory is full, the priority sampling approach is used to obtain the sample from the replay memory to train the actor and critic network. The cumulative reward of each episode-based experience in the replay memory is first calculated, and then all experiences are sorted in ascending order by cumulative reward. As a consequence, the experience with the largest cumulative reward has the maximum sort number. The sampling probability of each experience is computed as follows:
where
j represents the sort number,
M is the number of experiences in the reply memory, and
Pj is the sampling probability.
N samples are drawn from the replay memory based on the sampling probability. For each sample, continuous time series with sequence lengths of
C (i.e.,
) are collected in random sampling to constitute a batch of samples. This priority sampling approach can collect higher reward experience samples for training the actor and critic networks, which can improve the learning efficiency and accelerate the algorithm’s convergence.
4.4. Dynamic Learning Rate Adjustment for Stable Training
For the Transformer-based network, it is necessary to adjust the learning rate as the training process goes on. At the initial training stage, a relatively large learning rate is usually used in order to adjust network weights quickly by adopting gradient descent optimization. As the training process goes on, the network has a better nonlinear mapping capability, and then it is more important to focus on the precision of the network output. This means that at the late stage, the network should use a relatively small learning rate to guarantee the network output converges to the optimal solution. Therefore, the dynamic learning rate adjustment by using a cosine annealing schedule is designed as follows:
where
lrcur is the current learning rate,
lrmax and
lrmin are the set maximum and minimum learning rates,
Tcur represents the current number of training episodes, and
Tmax is the total number of training episodes. By analyzing Equation (18), it is seen that the learning rate decreases as the training process continues, and the descent speed of the learning rate is small at the initial and late training stages but large at the middle training stage. In other words, the network gets close to the optimal solution quickly with a large learning rate at the initial stage, and after that, the learning rate decreases quickly at the middle state and then uses a small learning rate to search for the optimal solutions at the late stage. This is beneficial for stabilizing the reinforcement learning training process and obtaining a good decision result.
4.5. Trade-Off Between Exploration and Exploitation
A common challenge in deep reinforcement learning (DRL) is balancing the trade-off between exploration and exploitation. Exploration means trying actions that improve the model, whereas exploitation means behaving in an optimal way given the current model. In this paper, the Deep Deterministic Policy Gradient algorithm is adopted, and the actor network outputs deterministic actions. The actor network parameters are initialized randomly, but the actor outputs make little difference to various inputs at the beginning of training. Therefore, the actor outputs are not beneficial for exploring the potential good maneuver strategy. To encourage exploration, Gaussian white noise is added to the actor output in order to encourage exploration, which is expressed as follows:
where
represents the executed action;
denotes the action output by the actor network;
is the Gaussian white noise and its mean and standard deviation are 0 and
; and
clip() represents the truncation function that determines the action values in the range of
.
Based on Equation (19), it is seen that the noise has an influence on the degree of exploration. At the initial stage of training, the noise standard deviation is set relatively high to encourage efficient exploration of the action space. As the training goes on, the standard deviation of noise decreases gradually in order for the agent to achieve a balance between exploration and exploitation. At the end stage, the standard deviation is maintained at a specified small value in order to make the agent exploit the optimal maneuver strategy given the memory. Thus, inspired by the epsilon-greedy policy, the noise standard deviation is controlled as follows:
where
and
are the initial and end values of the noise standard deviation;
and
are the current and maximum number of episodes;
represents the episode number; and the noise standard deviation is equal to the end value. In the last training stage (i.e.,
), the noise standard deviation is set to
, which is not equal to zero. This prevents the experience samples in the replay memory from becoming overly homogeneous and subsequently protects the critic network from overfitting. Therefore, by adding Gaussian random noise to the actor output and controlling its standard deviation, the balance between exploration and exploitation is achieved.
4.6. Network Training and Policy Update
Once enough experience samples are stored in the replay memory, the priority sampling mechanism is used to collect a batch of samples. These samples are used for training both the actor and critic networks, and the network parameters are updated through gradient descent and error backpropagation. The proposed maneuver decision-making approach introduces the Transformer network in the DDPG algorithm, which contains four Transformer-based networks: the online and target actor networks with the parameters θ and θ′, and the online and target critic networks with the parameters and . The online actor network takes the state as input and outputs the maneuvering action . The online critic network takes the joint state–action as input and outputs the evaluation value Q for the joint state–action pair.
A continuous time experience sample with sequence length
l is defined as follows:
where
,
, and
are the continuous time sequence of state, action, and reward, and
is a randomly selected time instant in an episode satisfying
. The
N samples are defined as a batch, which is given as follows:
For each sample, the target
Q value for a time
is calculated as follows:
where
represents the
Q value generated by the target critic network, and
represents the use of the target actor network. By minimizing the mean-square-error loss function defined in Equation (24), the online critic network parameters are updated.
The online actor network is used to produce the maneuvering strategy for maximizing the expected cumulative discounted reward. The policy gradient of the actor network is computed as follows:
Based on Equations (24) and (25), the parameters of the online critic and actor networks are updated as follows:
where
is the soft update parameter to control the update degree.
The algorithm of the proposed approach is shown in Algorithm 2.
Algorithm 2: Decision-making approach based on the Transformer network and deep reinforcement learning. |
Initialization: Initialize a sub-memory R1 with capacity B1 and episode-based replay memory R with capacity B; online actor and critic networks based on the Transformer network with weights ; target actor and target critic networks with weights ; and a random process with a mean of 0 and an initial standard deviation of |
for episode = 1,2,… do: |
Initialize continuous time sequence of state in air combat |
for t = 1,2,…, do: |
Select action according to the online actor network |
Execute action , get a reward and a new state |
Store a transition with [] in sub-memory R1 |
If episode-based replay memory R is full: |
Sample a minibatch of from R with priority sampling weight Pj |
Compute |
Update the online critic network by minimizing the loss: |
|
Update the online actor network by using the policy gradient: |
|
Update the target actor and target critic networks: |
|
If air combat is finished: |
break |
end for |
Put the sub-memory to episode-based replay memory and clear the sub-memory |
Decay the standard deviation of a random process and the learning rate of networks |
end for |