3.1. Deep Deterministic Policy Gradient Algorithm
The DDPG algorithm is proposed by the Google DeepMind team for realizing continuous action space control. It is composed of an Actor–Critic structure, combined with the Deep Qlearning Network (DQN) algorithm.
Figure 4 shows that the Actor–Critic structure consists of two parts: the actor network and the critic network. Among them, the actor represents the strategy network, which takes the current state as input and then generates the action under the current state through the analysis of neural networks. It takes advantage of Policy Gradient’s ability to select actions in a continuous interval, and then selects actions randomly based on the learned action distribution. However, DDPG is different from Policy Gradient in that it generates deterministic actions based on the output of the actor, instead of generating according to Policy Gradient. The critic is the value network with a single step update. This update method solves the problem of low learning efficiency caused by the strategy gradient of round update. Through the reward function to guide the learning direction of the network, the critic can obtain the potential rewards of the current state, and it takes the action output from the actor network as the input and outputs the evaluation value. Critic evaluates the action selected by the actor and guides the update direction of the network parameters of the actor, so that the actor after updating the network parameters can choose actions with a higher value as much as possible. The evaluation value
$\mathrm{Q}$ is the reward for taking action
${a}_{i}$ under
${S}_{i}$. The formula is as follows:
where
${\theta}^{Q}$ denotes the parameter of the critic network.
The behavior of each state directly obtains a certain action value through the deterministic policy function
$\mu $:
where
$\mu $ represents the deterministic behavior policy, which is defined as a function and simulated by a neural network,
${\theta}^{\mu}$ represents the parameter of policy network, which is used to generate the determination action.
${a}_{t}$ is the change rate of angular velocity in state
${S}_{t}$, and the control input
u in (1) is changed by
${a}_{t}$.
In order to make the DDPG algorithm more random and learning coverage, it is necessary to add random noise to the selected action to make the value of the action fluctuate. The action after adding noise can be expressed as:
where
N denotes the gaussian noise, and the noise follows the normal distribution, where
${a}_{t}$ is the expectation and
$\sigma $ is the variance,
${a}_{low}$ is the minimum value of the action, and
${a}_{high}$ is the maximum value of the action.
The design of DDPG is based on the offpolicy approach, which separates the behavioral strategies from the evaluation strategies. There are estimated networks and target networks in the actor and the critic. Their estimated network parameters need to be trained, and the target network is softupdated. Therefore, the two network structures of the actor and the critic are the same, but the network parameters are updated asynchronously. The soft update formula of the target network of the actor and the critic is as follows:
where
$\tau $ represents the soft update rate,
${\theta}^{\mathrm{Q}}$ and
${\theta}^{\mu}$ are the estimated network parameters of the actor and the critic, and
${\theta}^{{\mathrm{Q}}^{\prime}}$ and
${\theta}^{\mu {}^{\prime}}$ are the target network parameters of the actor and the critic.
The action selected by the target network of the actor and the observation value of the environmental state are used as the input of the target network of the critic, which determines the update direction of the target network parameters of the critic. The update formula of the critic network parameters is:
where
${y}_{i}$ represents the real evaluation value which is calculated by the target network,
${S}_{i}$ indicates environment status,
${r}_{i}$ represents the real reward,
${a}_{i}$ indicates the selected action under
${S}_{i}$,
$\mu $ represents deterministic policy function, and
$\gamma $ denotes the reward decay rate, which controls the influence of the reward value of the future step on the evaluation value of the current step. Larger
$\gamma $ indicates that the critic pays more attention to future rewards, and smaller
$\gamma $ indicates that the critic pays more attention to current rewards.
L denotes the loss function, which is the sum of squared errors between the actual value
${y}_{i}$ and the estimated value.
The update of the actor network parameters follows the deterministic strategy, whose formula is:
where
$\nabla Q$ is from the critic, which is the update direction of the actor’s network parameters, so that the actor with updated parameters can choose the action to obtain a higher evaluation value from the critic.
$\nabla \mu $ is from the actor, which indicates the update direction of the parameters of the actor, so that the actor after updating the parameters is more likely to select the above action.
3.2. Improved Twin Delayed Deep Deterministic Policy Gradient Algorithm
Since DDPG is an offpolicy method based on the DQN algorithm, each time it selects the highest value in the current state instead of using the actual action of the next interaction, there may be an overestimation. In the Actor–Critic framework of continuous action control, if each step is estimated in this way, the error will accumulate step by step, resulting in failure to find the optimal strategy and, ultimately, making the algorithm unable to converge. The twin delayed deep deterministic policy gradient (TD3) algorithm is optimized for mitigating the overestimation error of the DDPG algorithm.
The actor has two networks, an estimation network and a target network. The critic has two estimation networks and two target networks, respectively, as schematically illustrated in
Figure 5. Thus, the critic has four networks with the same structure. The state quantity and action are the input of the critic network, and the output value is the value generated by the action executed in the current environment state. Regarding the optimization algorithm, TD3 adopts the Actor–Critic architecture similar to DDPG and is used to solve the problems in continuous action space. The improvement of the TD3 algorithm relative to the DDPG is mainly reflected in the following three aspects:
The first is the double critic network structure. In TD3, the critic’s estimated network and target network have two, respectively. The smaller value of the target network is selected as the update target to update Estimated Critic1 and Estimated Critic2, which can alleviate the phenomenon of overestimation. TD3 uses the same method as DDPG to construct the loss function:
S and ${S}_{i+1}$ are state quantities, as the input of the actor, and the output is the actions ${a}_{i}$ and $a{}_{i}^{\prime}$ generated in the current environment.
The second is to delay updating the actor. In the TD3 algorithm, the critic network is updated once every step, the parameters of the actor are updated in a delayed manner, with a lower update frequency than the critic, that is, after the critic is updated multiple times, update the actor once. On the one hand, delaying updating the actor can reduce unnecessary repeated updates. On the other hand, it can also reduce errors accumulated in multiple updates.
The third is the smooth regularization of the target strategy. By adding noise based on the normal distribution to the action of the target network selection as (11):
the value function is updated more smoothly, the network is more robust, and the robustness of the algorithm is improved.
TD3 solves the problem of overvaluation of DDPG and facilitates the exploration of better strategies to improve the success rate and landing accuracy. However, applying the TD3 algorithm to the trajectory planning of the parafoil delivery system, combined with our existing simulation results and analyzing the experimental data, it is found that TD3 still has a larger landing error than the traditional trajectory optimization algorithm. This is difficult to solve only by increasing the number of training because the parafoil does not necessarily explore a better policy each time, or even a worse policy than the existing one, and stores it in the experience pool. This is due to the uncertainty of adding noise to the action. In the DDPG and the TD3 algorithms, in order to increase the randomness of the algorithm and the coverage of learning, they adopt the way of adding noise to the action to make it produce a certain fluctuation, hoping to explore more strategies. However, the action after increasing the noise is not necessarily better; it may make the action after increasing the noise obtain a lower reward value, thus storing a poor experience in the experience pool, which is not conducive to the algorithm learning a better strategy.
To solve this problem, we propose an improved twin delayed deep deterministic policy gradient algorithm, which dynamically changes the scale of noise to be added by evaluating the reward value of the selected action in advance. This method can effectively reduce the negative impact of noise uncertainty on strategy exploration, and make full use of excellent strategies.
The action
$\mu \left({S}_{t}\right)$ selected in the
${S}_{t}$ state will first obtain the reward value
${r}_{predict}$ of environmental feedback without adding any noise, so as to preevaluate the value of the action. The purpose of preevaluation is to judge whether the action is an excellent strategy. If the action has a high value, reduce the scale of noise and maintain this strategy as much as possible. If the action has a low value, the scale of noise should be increased to explore better strategies. The variance
$\sigma $ of Gaussian noise is determined by
${r}_{predict}$. The higher the reward
${r}_{predict}$ is, the smaller the variance
$\sigma $ of noise is, the lower the reward
${r}_{predict}$ is, and the more the variance
$\sigma $ of noise is. In this work, the maximum value of
$\sigma $ is limited to 2, and the minimum value is 0.4. If the realtime reward value is less than 0, the exploration intensity should be increased, so
$\sigma $ of negative action is 2. When the realtime reward value is greater than 0,
$\sigma $ decreases along the positive direction of the
xaxis, and the variation amplitude of
$\sigma $ gradually decreases, but
$\sigma $ cannot be reduced to 0, and a small amount of noise still needs to be retained to maintain the exploration. The agent uses action
${a}_{t}$ after adding noise to explore. This makes better use of the good strategies explored by the agents and, to some extent, avoids storing the worse explored steps in the experience pool. In (12), the scale of variance
$\sigma $ is selected according to the preevaluated reward value. The improved TD3 algorithm is described in Algorithm 1, where the additional step 6 reflects our proposed modification:
Algorithm 1 Improved TD3 
 1:
Initial estimated critic network parameters ${\theta}^{Q1}$,${\theta}^{Q2}$, and estimated actor network parameter ${\theta}^{\mu}$  2:
Initial target networks parameters ${\theta}^{Q{1}^{\prime}}\leftarrow {\theta}^{Q1}$,${\theta}^{Q{2}^{\prime}}\leftarrow {\theta}^{Q2}$  3:
Set initial values of hyperparameters according to the task requirements: experience playback buffer pool B, minibatch size n, actor network learning rate ${l}_{a}$, critic network learning rate ${l}_{c}$, maximum episode E, soft update rate $\tau $  4:
for$t=1$ to T do  5:
Select action ${a}_{t}=\mu \left({S}_{t}\right{\theta}^{\mu})$  6:
According to preevaluated reward value of action ${a}_{t}$, select the size of noise variance $\sigma $, and add noise N to new action ${a}_{t}\sim clip(N(\mu \left({S}_{t}\right{\theta}^{\mu}),{\sigma}^{2}),{a}_{low},{a}_{high})$ with noise, and observe reward ${r}_{t}$ in current state and new state ${S}_{t+1}$  7:
Store transition tuple $({S}_{t},{a}_{t},{r}_{t},{S}_{t+1})$ of this step in B  8:
Sample minibatch of n transactions $({S}_{i},{a}_{i},{r}_{i},{S}_{i+1})$ from B  9:
Compute target actions $a{}_{i}^{\prime}={\mu}^{\prime}\left({S}_{i+1}\right{\theta}^{{\mu}^{\prime}})+clip(\mathcal{N}(0,\sigma ),c,c)$  10:
Compute Qtargets ${y}_{i}={r}_{i}+\gamma \underset{j=1,2}{min}{Q}^{\prime}({S}_{i+1},a{}_{i}^{\prime}{\theta}^{{Q}^{\prime}j})$  11:
Update estimated network parameters of the critic by minimizing loss: $L=\frac{1}{n}{\displaystyle \sum _{i}^{n}}({y}_{i}Q({S}_{i},{a}_{i}{\theta}^{Q}){)}^{2}$  12:
if t mod d then  13:
Update the actor policy using sampled policy gradient:  14:
${\nabla}_{{\theta}^{\mu}}J=\frac{1}{n}{\displaystyle \sum _{i}^{n}}{\nabla}_{a}Q(S,a{\theta}^{Q}){}_{S={S}_{i},a=\mu \left({S}_{i}\right)}{\nabla}_{{\theta}^{\mu}}\mu \left(S\right{\theta}^{\mu}){}_{{S}_{i}}$  15:
Update parameters of target network of the critic and the actor:  16:
${\theta}^{Q{j}^{\prime}}\leftarrow \tau {\theta}^{Qj}+(1\tau ){\theta}^{Q{j}^{\prime}}$ for j = 1,2  17:
${\theta}^{{\mu}^{\prime}}\leftarrow \tau {\theta}^{\mu}+(1\tau ){\theta}^{{\mu}^{\prime}}$  18:
end if  19:
end for
