A Fault Detection Method Based on an Oil Temperature Forecasting Model Using an Improved Deep Deterministic Policy Gradient Algorithm in the Helicopter Gearbox

The main gearbox is very important for the operation safety of helicopters, and the oil temperature reflects the health degree of the gearbox; therefore establishing an accurate oil temperature forecasting model is an important step for reliable fault detection. Firstly, in order to achieve accurate gearbox oil temperature forecasting, an improved deep deterministic policy gradient algorithm with a CNN–LSTM basic learner is proposed, which can excavate the complex relationship between oil temperature and working condition. Secondly, a reward incentive function is designed to accelerate the training time costs and to stabilize the model. Further, a variable variance exploration strategy is proposed to enable the agents of the model to fully explore the state space in the early training stage and to gradually converge in the training later stage. Thirdly, a multi-critics network structure is adopted to solve the problem of inaccurate Q-value estimation, which is the key to improving the prediction accuracy of the model. Finally, KDE is introduced to determine the fault threshold to judge whether the residual error is abnormal after EWMA processing. The experimental results show that the proposed model achieves higher prediction accuracy and shorter fault detection time costs.


Introduction
A helicopter is a kind of aircraft that can hover in the air, so it is widely used in the military, rescue, and transportation fields. Next, for the convenience of narration, Table 1 gives a detailed definition of the main acronyms. As the core component of the HTS, clearance, external force, friction and other factors interact with its dynamic behavior during the operation of the HMGB [1], making it deviate from the ideal operation states, leading to the occurrence of faults, and the failure of the gearbox is a direct threat to the flight safety of the helicopter. Therefore, the effective detection of gearbox failures that may cause catastrophic accidents is of great significance to ensure the flight safety of helicopters [2]. In order to satisfy the maintenance needs of helicopters, HUMS represented by EC135 [3], SA330 [4], AH-64, and UH-60 [5] is designed for health monitoring and fault diagnosis on key components such as HMGB and others, which has achieved obvious results in the reducing flight accident rate and maintenance costs. Generally, HUMS is composed of sensors for collecting vibration, sound, temperature, and other signals and a central computer for data processing and fault diagnosis [6].
Most researchers study the PHM of a HTS based on vibration signals. The vibration signals are processed and the fault types are recognized by using expert experience or machine learning methods [7][8][9]. Although the vibration signal of a helicopter carries rich running state information and can be used to detect early weak faults of mechanical components in time, HUMS cannot effectively and reliably diagnose some parts, such as the bearings of HMGB, from the vibration signal [10], because HMGB-status monitoring based on vibration signal analysis will encounter the following challenges: (1) The early fault features contained in the vibration signal is usually weak, and it will be interfered with by various noises in the transmission process. (2) The design of HMGB shows the trend of structural complexity. (3) HMGB is often quickly switched from different working states when performing multiple tasks, and the variable speed and load make it difficult to distinguish health and fault samples. (4) Given that a helicopter is high-reliability equipment, fault samples are extremely scarce; therefore, the distribution of relevant history data will be highly skewed to healthy samples.
In many cases, it is far more important to effectively and reliably detect HMGB anomalies than to identify specific fault types. Rashid et al. pointed out that any failure of HMGB would be reflected in the oil of the lubrication system, and the mechanical parts would not obtain ideal lubrication due to abnormal oil temperature, which would in turn accelerate the occurrence of micropitting, wear, scuffing, pitting, and other typical failures [11]. If the thermodynamic equation of HMGB can be known and can accurately calculate the oil temperature through the current working states, according to whether the actual oil temperature deviates from the normal predicted oil temperature, the ground crew can maintain the HMGB in time before a fault occurs. At present, the methods for establishing an oil temperature forecasting model can be mainly categorized into physicsbased [12] and data-driven [13]. Using the physics-based method, Feng et al. derives relationships between oil temperature, transmission efficiency, rotational rate, and power output and believes that the oil temperature of the gearbox will increase linearly with power output and that, when the fault occurs, the transmission efficiency will inevitably decrease, resulting in the actual oil temperature being higher than the theoretical oil temperature [14]. However, with the increasing complexity of the equipment, using physics-based methods will involve a variety of parameters, making it difficult to accurately establish an oil temperature forecasting model.
In recent years, artificial intelligence technology has developed rapidly; therefore datadriven processes can rely on many algorithms to excavate the internal laws of historical data, such as the shallow perceptron, deep neural network, and statistical method, and has gradually become a means of verification to quantitatively describe the thermodynamic behavior of a gearbox. Data-driven processes use a large amount of health data to build the implicit mapping relationship between oil temperature and working state, and the residual errors between the actual value and the predicted value of oil temperature is used to evaluate the health degree of the gearbox through statistical methods. Liu et al. used the eXtreme gradient boosting (XGBoost) algorithm to establish the oil temperature regression forecasting model for condition monitoring WT gearbox [15]. Zeng et al. proposed sparse Bayesian learning to estimate the oil temperature of a WT gearbox [16]. Dhiman et al. utilized a twin support vector machine (TSVM) to predict the WT gearbox oil temperature, and an adaptive threshold is used to judge whether the gearbox is abnormal [17]. Wang et al. adopted a DNN-based framework to detect the health states of a wind turbine (WT) gearbox [18]. Guo et al. utilized an adam-trained LSTM to represent an oil temperature forecasting model to calculate the failure threshold [19]. Yang et al. combined the LSTM with a generalized regression neural network (GRNN) to form a weighted-combination oil temperature prediction model [20]. Jia et al. presented a robust denoising autoencoder (DAE) model to predict the raw temperature signal reconstruction error [21]. However, due to the increasingly complex structure of the HMGB and the increase in the number of sensors used for condition monitoring, the DL algorithm based on a single model or a hybrid model cannot meet the requirements. Therefore, intelligent state-of-the-art technology should be explored to improve the accuracy of HMGB oil temperature prediction under complex working states.
A model-free DRL framework integrates the perception of DL for the environment with the decision-making ability of RL, which can automatically find the optimal strategy through the reward of environmental feedback; hence, DRL has been successfully applied in thecomputer game [22], autonomous driving [23], machine vision [24], and fault diagnosis fields [25]. Although DRL is a promising technology, there are few references about research on DRL in gearbox-condition monitoring based on the oil temperature prediction principle. Up to now, only Liu et al. used the SARSA algorithm to select the features of each sub-series in gearbox condition monitoring, and it can be concluded that selecting appropriate features is conducive to improving the accuracy of oil temperature prediction [26]. However, the Q-learning algorithm, such as SARSA and DQB, cannot solve the time-series-prediction problem in high-dimensional and continuous state space [27]. In several commonlyused DRL frameworks, e.g., SARSA, DQN, advantage actor-critic (A2C), and DDPG, but DDPG is the only framework suitable for mapping the continuous state space to the corresponding continuous output action value, which can directly output the corresponding oil temperature value under the current working condition. In other to fill the research gap of the application of DRL in HMGB oil temperature forecasting and condition monitoring, an improved DDPG model for building a more accurate oil temperature forecasting model is proposed in this paper. The main contributions for the condition monitoring and health evaluation of HMGB in this paper, compared with the original DDPG algorithm, can be summarized as follows: 1.
An improved DDPG framework is proposed, namely multi-CRDPG, in which CNN-LSTM is used as a basic learner to sense the input working condition information of HMGB; thereby, the strong feature extraction ability of CNN and the advantage of LSTM in dealing with time series prediction are combined. DDPG is introduced as an RL framework for training the basic learner, which enhances the prediction ability to deal with the complex oil temperature series of the basic learner.

2.
A novel reward function is designed for educating the agent to output the predicted action as accurately as possible.

3.
An explore strategy is presented, in which agent are encouraged to actively explore the unknown space in the early stage of training and use the learned experience to gradually converge on the ideal output action in the later stage of training. 4.
In order to avoid the inaccurate estimation of the current state by a singer critic network and the inability to find the optimal strategy, a multi-critics network structure is advanced. A minimum and truncated mean processing method for a multi-critics network is conducive to reducing the deviation and variance of the estimated Q-value.

5.
KDE is used to calculate the probability density function of the prediction residual errors in the healthy state of a HMGB to determine the failure threshold, and the trend of residual errors generated by EWMA control chart is to judge the HMGB health degree in the monitoring process. 6.
The rest of this paper is organized as follows. Section 2 introduces the basic theories of the fault mechanism, RL, DDPG and CNN-LSTM algorithm. Section 3 provides the proposed multi-CRDPG algorithm. Section 4 describes the implementation details and experimental results ofthe mMulti-CRDPG in actual testing. Section 5 outlines conclusions and future works. HMGBs are subjected to the first law of energy conservation during operation, and the energy exchange is shown in Figure 1. As the carrier of heat exchange between a HMGB and the environment, the input energy loss of a HMGB can be related to the increase in oil temperature. The thermal energy balance equation of HMGB is expressed by Equation (1).

Basic Theories
where E represents the input rotational kinetic energy of HMGB, Q HMGB represents the heat loss power of HMGB, P HMGB represents the output rotational kinetic power of the HMGB, and t represents running time.
opy 2022, 24, x FOR PEER REVIEW From Equation (5), we can know that, when angular velocity is should decrease with the increase of transmission efficiency in unit t gradually deteriorates, its moment inertia and heat transfer coefficie changed, and the transmission efficiency is significantly reduced [1]; t rise at the state of HMGB failure is significantly higher than that at th healthy HMGB.

The Concept of the Deep Deterministic Policy Gradient Algorithm
RL is an important branch of machine learning. It can be represent system composed of agents, environment, a state space st∈S, an action reward function rt = R(st, at, st+1), as shown in Figure 2, in which the s scribes the set of information received by the agent and the action sp the set of agents' decision-making in state space st∈S [28]. The reason why HMGB generates extra heat during operation is the transmission efficiency, which is defined as: Substituting Equation (2) into Equation (1) can be written as Equation (3): Furthermore, E can be expressed in term of angular velocity and moment inertia: where I HMGB represents the moment inertia of HMGB, and ω 2 HMGB represents the angular velocity of HMGB.
Supposing the HMGB compound heat transfer coefficient is U GB , the oil temperature rise is ∆T, then the expression of ∆T is as follows: From Equation (5), we can know that, when angular velocity is a fixed value, ∆T should decrease with the increase of transmission efficiency in unit time. When HMGB gradually deteriorates, its moment inertia and heat transfer coefficient are almost unchanged, and the transmission efficiency is significantly reduced [1]; the oil temperature rise at the state of HMGB failure is significantly higher than that at the natural state of a healthy HMGB.

The Concept of the Deep Deterministic Policy Gradient Algorithm
RL is an important branch of machine learning. It can be represented as a closed-loop system composed of agents, environment, a state space s t ∈S, an action space a t ∈A, and a reward function r t = R(s t , a t , s t+1 ), as shown in Figure 2, in which the state space s t ∈S describes the set of information received by the agent and the action space a t ∈A describes the set of agents' decision-making in state space s t ∈S [28].
rise at the state of HMGB failure is significantly higher than that at the natural sta healthy HMGB.

The Concept of the Deep Deterministic Policy Gradient Algorithm
RL is an important branch of machine learning. It can be represented as a closed system composed of agents, environment, a state space st∈S, an action space at∈A, reward function rt = R(st, at, st+1), as shown in Figure 2, in which the state space st∈ scribes the set of information received by the agent and the action space at∈A des the set of agents' decision-making in state space st∈S [28]. In RL, the interactive information between the agent and the environment can scribed by a quadruple tuple (st, at, rt, st+1), i.e., the next state st+1 is just dependent o current state st, and the transition from the current state st to the next state st+1 can garded as the MDP. A complete time step of MDP is defined as τ = (s0, a0, r0),…, (st, namely a trajectory. The return of a trajectory is the weighted sum of discount aw which is calculated by Equation (6).
where γ is the discount factor, which reflects the impact of the current action at o future. T is the maximum time step in a trajectory. In RL, the interactive information between the agent and the environment can be described by a quadruple tuple (s t , a t , r t , s t+1 ), i.e., the next state s t+1 is just dependent on the current state s t , and the transition from the current state s t to the next state s t+1 can be regarded as the MDP. A complete time step of MDP is defined as τ = (s 0 , a 0 , r 0 ), . . . , (s t , a t , r t ), namely a trajectory. The return of a trajectory is the weighted sum of discount awards, which is calculated by Equation (6).
where γ is the discount factor, which reflects the impact of the current action a t on the future. T is the maximum time step in a trajectory. Agents are trained to find an optimal strategy π * (a t |s t ) to deal with state space s t ∈S; the objective function strategy π(a t |s t ) is the expected value of return on several trajectories, as shown in Equation (7).
Entropy 2022, 24, 1394 6 of 29 Assuming that the agent performs actions according to a strategy π(a t |s t ), the state value function V π (s) is used to evaluate the correlation degree of the current each state to the future state, and the state-action value function Q π (s,a) is used to evaluate the influence degree of each action in each state on the future state. The expressions of V π (s) and Q π (s,a) are shown in Equations (8) and (9), respectively.
In order to solve the optimal strategy π * (a t |s t ) in continuous state space, i.e., maximize V π (s) and Q π (s,a), DDPG was presented by Google's Deepmind team in 2016 [29]. The DDPG is one of the famous algorithms of DRL; it combines the deep neural network from DL with the Q-learning algorithm and the actor-critic structure from RL and includes four kinds of neural networks, namely a online actor network µ(s|θ µ ), a target actor network µ'(s|θ µ' ), an online critic network Q(s,a|θ Q ) and a target critic network Q'(s,a|θ Q' ), where θ µ , θ µ' , θ Q and θ Q' denotes the weight parameter of the four kinds of networks. The basic framework of the DDPG is shown in Figure 3. Assuming that the agent performs actions according to a strategy π(at|st), the state value function Vπ(s) is used to evaluate the correlation degree of the current each state to the future state, and the state-action value function Qπ(s, a) is used to evaluate the influence degree of each action in each state on the future state. The expressions of Vπ(s) and Qπ(s,a) are shown in Equations (8) and (9), respectively.  (9) In order to solve the optimal strategy π * (at|st) in continuous state space, i.e., maximize Vπ(s) and Qπ(s, a), DDPG was presented by Google's Deepmind team in 2016 [29]. The DDPG is one of the famous algorithms of DRL; it combines the deep neural network from DL with the Q-learning algorithm and the actor-critic structure from RL and includes four kinds of neural networks, namely a online actor network μ(s|θ μ ), a target actor network μ'(s|θ μ' ), an online critic network Q(s,a|θ Q ) and a target critic network Q'(s,a|θ Q' ), where θ μ , θ μ' , θ Q and θ Q' denotes the weight parameter of the four kinds of networks. The basic framework of the DDPG is shown in Figure 3. The current state st is the input to the actor network μ(s|θ μ ) to obtain a deterministic output action at, and the online critic network Q(s,a|θ Q ) calculates the Q(s, a). However, one of the challenges of DDPG is that it is often difficult to make online actor network μ(s|θ μ ) and critic network Q(s,a|θ Q ) converge; therefore, it is necessary to introduce the target network μ'(s|θ μ' ) and Q'(s,a|θ Q' ) as a copy of the online network μ(s|θ μ ) and Q(s,a|θ Q ), respectively, which can temporarily fix the actor-critic network parameters to provide a reference for the update of the original network, so as to avoid the divergence of the original network after updating.
In addition, the DDPG is developed based on a deep Q-network (DQN), and the replay buffer is preserved, which is an important improvement to store historical transition tuples (st, at, rt, st+1). The Q(s,a|θ Q ) estimated by the target critic network is shown in Equation (10). The current state s t is the input to the actor network µ(s|θ µ ) to obtain a deterministic output action a t , and the online critic network Q(s,a|θ Q ) calculates the Q(s,a). However, one of the challenges of DDPG is that it is often difficult to make online actor network µ(s|θ µ ) and critic network Q(s,a|θ Q ) converge; therefore, it is necessary to introduce the target network µ'(s|θ µ' ) and Q'(s,a|θ Q' ) as a copy of the online network µ(s|θ µ ) and Q(s,a|θ Q ), respectively, which can temporarily fix the actor-critic network parameters to provide a reference for the update of the original network, so as to avoid the divergence of the original network after updating.
In addition, the DDPG is developed based on a deep Q-network (DQN), and the replay buffer is preserved, which is an important improvement to store historical transition tuples (s t , a t , r t , s t+1 ). The Q(s,a|θ Q ) estimated by the target critic network is shown in Equation (10).
where t represents the sequence time of samples taken from the replay buffer. Once the replay buffer is full of samples, sampling a batch of samples can be started to update the critic network and the actor network by minimizing Equation (11) and executing Equation (12), respectively. After that, the oldest samples will be squeezed out of the replay buffer by the new samples. Figure 4 shows that the actor network uses the gradient rise method to find the best output action a t corresponding to the state space s t ∈S, i.e., the search for the optimal action with the largest Q value in a certain state.
where H represents the maximum batch number of samples taken from the replay buffer.  (10) where t represents the sequence time of samples taken from the replay buffer.
Once the replay buffer is full of samples, sampling a batch of samples can be started to update the critic network and the actor network by minimizing Equation (11) and executing Equation (12), respectively. After that, the oldest samples will be squeezed out of the replay buffer by the new samples. Figure 4 shows that the actor network uses the gradient rise method to find the best output action at corresponding to the state space st∈S, i.e., the search for the optimal action with the largest Q value in a certain state.  (12) where H represents the maximum batch number of samples taken from the replay buffer.  Finally, soft assignment is used to steadily update the two target networks, as shown in Equations (13) and (14).

The Convolutional Long-Short Time Memory Neural Network
CNN-LSTM, as its name suggests, is a hybrid model of CNN and LSTM, which integrates the local feature extraction ability of CNN and the long-term and short-term prediction ability of LSTM [30]. Furthermore, 1-D CNN is applied to the feature extraction of oil temperature signal; Figure 5a shows its network structure, and 1-D convolution and pooling operation is its main calculation. When the data is sent to the convolution layer, the 1-D convolution kernel with customized length will slide on the data in order and Finally, soft assignment is used to steadily update the two target networks, as shown in Equations (13) and (14).

The Convolutional Long-Short Time Memory Neural Network
CNN-LSTM, as its name suggests, is a hybrid model of CNN and LSTM, which integrates the local feature extraction ability of CNN and the long-term and short-term prediction ability of LSTM [30]. Furthermore, 1-D CNN is applied to the feature extraction of oil temperature signal; Figure 5a shows its network structure, and 1-D convolution and pooling operation is its main calculation. When the data is sent to the convolution layer, the 1-D convolution kernel with customized length will slide on the data in order and perform convolution calculation. The output of the i-th 1-D convolution layer can be expressed as Equation (15). In the convolution layer, the data segment with the same length as the convolution kernel is dot-multiplied by a convolution vector, and then an offset term is added to output the operation result. All outputs calculated in each window will form a vector according to the convolution operation sequence, which is essentially filtering the signal. All signal segments share the same convolution kernel, which means that the signal is mapped and that the local features of the signal are extracted.
where f i (t) represents the convolution kernel whose size needs to be preset, and parameters are obtained by learning from input data. h(t) represents the input data. b i represents the bias factor. σ(·) represents the activation function; commonly used activation functions include Sigmoid, Relu, Tanh, etc. X t h t−1 C t−1 h t C t .
pressed as Equation (15). In the convolution layer, the data segment with the same length as the convolution kernel is dot-multiplied by a convolution vector, and then an offset term is added to output the operation result. All outputs calculated in each window will form a vector according to the convolution operation sequence, which is essentially filtering the signal. All signal segments share the same convolution kernel, which means that the signal is mapped and that the local features of the signal are extracted.
where fi(t) represents the convolution kernel whose size needs to be preset, and parameters are obtained by learning from input data. h(t) represents the input data. bi represents the bias factor. σ(·) represents the activation function; commonly used activation functions include Sigmoid, Relu, Tanh, etc. Xt ht-1 Ct-1 ht Ct In the convolution process of signals, there is only a small amount of useful information, and most of the information is redundant. Adopting pooling processing can speed up the calculation speed and prevent overfitting, including max pooling, average pooling, etc. For CNN-LSTM, the data will be input into the LSTM layer after the pooling operation. A LSTM network, as a variant of a recurrent neural network (RNN), can reflect the dependence of MDP by hiding the memory state, which is used to extract the mediumand long-term correlation characteristics of the corresponding time series from the stored CNN and reveal the essence of the time series. LSTM is composed of four neural network layers in a special connection mode. The gradient disappearance problem can be effectively solved through four interacting layers. Its structure (forget gate, input gate, update gate and output gate) is shown in Figure 5b [31]. The calculation process of a typical LSTM network unit module is shown in Equation (16). In the convolution process of signals, there is only a small amount of useful information, and most of the information is redundant. Adopting pooling processing can speed up the calculation speed and prevent overfitting, including max pooling, average pooling, etc.
For CNN-LSTM, the data will be input into the LSTM layer after the pooling operation. A LSTM network, as a variant of a recurrent neural network (RNN), can reflect the dependence of MDP by hiding the memory state, which is used to extract the mediumand long-term correlation characteristics of the corresponding time series from the stored CNN and reveal the essence of the time series. LSTM is composed of four neural network layers in a special connection mode. The gradient disappearance problem can be effectively solved through four interacting layers. Its structure (forget gate, input gate, update gate and output gate) is shown in Figure 5b [31]. The calculation process of a typical LSTM network unit module is shown in Equation (16).
where i t , f t , o t and g t represent input gate, forget gate and output gate and, respectively, their activation functions σ(·) are sigmoid. W i , W f , W o , U f , U f and U o represent the weight matrix corresponding to the hidden state and the input state, respectively. b i , b f , b o and b g represent the bias term. C t represents the memory cell.

The Deign of Reward Incentive Function
Herein, the action of an agent is considered the predicted value of oil temperature. In the application of DDPG in oil temperature prediction, it is very important to design an appropriate reward function to guide the agent to accurately output the oil temperature value according to the input working conditions. According to previous research and experience, the response of an oil temperature signal to working states change has a certain delay; therefore, the oil temperature signal changes slowly. In a short time interval, the oil temperature value at the previous point will not differ much from that at the next point under a healthy-state HMGB. In the references about DDPG applied to forecast time series, the residual error between the output action a t and the actual load value is regarded as a reward function [32,33]. However, taking the residual error as the linear incentive condition is not only not sensitive to the agent at the initial stage of training, which slows down the training speed, but it is also too sensitive to the agent at the later stage of training, which results in difficulty in convergence. Based on this consideration, a new reward incentive function (RIF) is designed in Equation (17).
where r t is the RIF in time steps t. T means the number of samples that have been input into the model. k p is the proportion coefficient; k i is the integration coefficient, and k d is the differentiation coefficient. OT t and OT t-1 denote the actual oil temperature value at the time t and t − 1, respectively. a t and a t−1 denote the predicted oil temperature value at the time t and t − 1, respectively.
The first item can more sensitively detect the change of residual error and give greater punishment when the condition of the predicted value is larger than the actual value. The second item gives some positive rewards to the output actions a t that make the residual error less than the average residual error and some penalties to the output actions a t that make the residual error greater than the average residual error, making all the output actions a t more relevant, which is analogous to the inherent characteristic of the slow change of the oil temperature signal. The third item is considered to have an incentive effect, when the error (a t − OT t ) at time t is smaller than the error (a t-1 − OT t-1 ) at time t − 1, and this trend should be rewarded [34]. Figure 6 shows how the designed RIF works. Gradient is used to measure the change of reward function. The gradient change of the RIF at time steps t and t + 1 is greater than that of the traditional reward function with the same error value change, which shows that the RIF is very sensitive to the change in error. RIF can stimulate agents towards output high-precision-prediction values better than traditional reward functions. tropy 2022, 24, x FOR PEER REVIEW can stimulate agents towards output high-precision-prediction values bette tional reward functions.
Oil Temperature Reward without RIF Reward with RIF

Time
Step Figure 6. Schematic diagram of the RIF.

Variable Exploration Variance
Selecting the deterministic action at corresponding to max Qπ(st, at) means that there will be many state-actions (s, a) that cannot be selected, w that the agent cannot fully explore the entire continuous state space. Becaus initialized randomly, it will lead to the inaccurate estimation of some Qπ(st, at) experience. To avoid this challenge, an exploratory strategy is proposed, by a dom number from a noise process N.
where N can be any form of noise, and Gaussian noise is selected in this pa by X~N(, σ 2 ), as shown in Equation (19).
where x is the mean, and σ is the variance. Herein, x is set to 0.
Unfortunately, this kind of exploration is unstable. Simply adding noise action may not be as effective as each output deterministic strategy. The key t optimal strategy is to maintain the balance between exploration and output d actions. A good solution is to add noise with high variance σ at the initial stag so that the agent can quickly explore the state space. When the agent gradu good strategy, it gradually reduces the variance σ of noise at a later stage o that the agent can output deterministic actions using previous experience. T exploration strategy that the variance σ decreases with the increase of epoch as shown in Equation (20).

Variable Exploration Variance
Selecting the deterministic action a t corresponding to max Q π (s t , a t ) in each state means that there will be many state-actions (s, a) that cannot be selected, which means that the agent cannot fully explore the entire continuous state space. Because Q π (s t , a t ) is initialized randomly, it will lead to the inaccurate estimation of some Q π (s t , a t ) without real experience. To avoid this challenge, an exploratory strategy is proposed, by adding a random number from a noise process N.
where N can be any form of noise, and Gaussian noise is selected in this paper, denoted by X~N(x, σ 2 ), as shown in Equation (19).
where x is the mean, and σ is the variance. Herein, x is set to 0. Unfortunately, this kind of exploration is unstable. Simply adding noise to the output action may not be as effective as each output deterministic strategy. The key to finding the optimal strategy is to maintain the balance between exploration and output deterministic actions. A good solution is to add noise with high variance σ at the initial stage of training, so that the agent can quickly explore the state space. When the agent gradually learns a good strategy, it gradually reduces the variance σ of noise at a later stage of training so that the agent can output deterministic actions using previous experience. Therefore, an exploration strategy that the variance σ decreases with the increase of epoch is presented, as shown in Equation (20). where n epo represents the current epoch, σ epo represents the current variance, and n epo_total represents the total epoch. For instance, when n epo_total = 1000, the variance σ epo and the noise N(u, σ 2 epo ) decays as n epo increases, as shown in Figure 7. where nepo represents the current epoch, σepo represents the current variance, and nepo_total represents the total epoch. For instance, when nepo_total = 1000, the variance σepo and the noise N(u, σ 2 epo ) decays as nepo increases, as shown in Figure 7.

Multi-Critics Networks Structure
In the training process, the DDPG algorithm updates the critic network based on the gradient rise method, and the performance of actor networks depends on critic networks. Similar to the maximization operation of DQN, there is an overestimation value problem in the evaluation of Qπ(st, at) by the critic network. In the RL algorithm based on the value function, any small change in value estimation may lead to a suboptimal strategy. To solve this problem, Fujimoto et al. presents a twin delayed deep deterministic policy gradient (TD3) [35], and the algorithm adopts two critic networks and updates the critic network parameters by selecting a pair of the minimum state-action value Q as the target Q value, which alleviates the overestimation problem of DDPG to a certain degree. Supposing 1 Q and 2 Q represent the estimated action-state value from two independent critic networks, respectively, and that Qreal is the real value, there are two deviation terms Z1= 1 Q − Qreal and Z2 = 2 Q − Qreal, which obey a uniform distribution, i.e., Zi=1,2~U (−u, u). In TD3, the expectation E[min Zi=1,2]=-u/3 and variance Var[min Zi=1,2] = 2u 2 /9 after the minimization operation, the negative expectation is introduced in each update of the critic network, resulting in the Q value being underestimated. This underestimation makes the Q value of critic networks lower than the real value, and the accumulated underestimation will generate suboptimal strategies, which will degrade the performance of the algorithm. In order to solve the problem of underestimation in TD3, an improved network structure based on the truncated mean of multi-critic networks is proposed. The algorithm adds multiple critic networks to reduce the underestimation and estimation variance based on TD3, which can improve the performance and stability of the algorithm to a certain extent. Assuming that there are K deviations of critic networks (K > 5), and the deviation Zi=1,2,..,K between their estimated action-state values

Multi-Critics Networks Structure
In the training process, the DDPG algorithm updates the critic network based on the gradient rise method, and the performance of actor networks depends on critic networks. Similar to the maximization operation of DQN, there is an overestimation value problem in the evaluation of Q π (s t , a t ) by the critic network. In the RL algorithm based on the value function, any small change in value estimation may lead to a suboptimal strategy. To solve this problem, Fujimoto et al. presents a twin delayed deep deterministic policy gradient (TD3) [35], and the algorithm adopts two critic networks and updates the critic network parameters by selecting a pair of the minimum state-action value Q as the target Q value, which alleviates the overestimation problem of DDPG to a certain degree. SupposingQ 1 andQ 2 represent the estimated action-state value from two independent critic networks, respectively, and that Q real is the real value, there are two deviation terms Z 1 =Q 1 − Q real and Z 2 =Q 2 − Q real , which obey a uniform distribution, i.e., Z i=1,2~U (−u, u). In TD3, the expectation E[min Z i=1,2 ]=-u/3 and variance Var[min Z i=1,2 ] = 2u 2 /9 after the minimization operation, the negative expectation is introduced in each update of the critic network, resulting in the Q value being underestimated. This underestimation makes the Q value of critic networks lower than the real value, and the accumulated underestimation will generate suboptimal strategies, which will degrade the performance of the algorithm. In order to solve the problem of underestimation in TD3, an improved network structure based on the truncated mean of multi-critic networks is proposed.
The algorithm adds multiple critic networks to reduce the underestimation and estimation variance based on TD3, which can improve the performance and stability of the algorithm to a certain extent. Assuming that there are K deviations of critic networks (K > 5), and the deviation Z i=1,2,. . . ,K between their estimated action-state valuesQ i=1,...,K and the real action-state value, Q real follows the uniform distribution U(−u, u). Firstly, the highest action-state value Qmax and the lowest action-state value Qmin of multi-critic networks are removed to reduce the impact of extreme values on the real action-state value Q real . The truncated mean can better reflect the concentration trend of the deviation and reduces the upper and lower limits of the deviation, while the remaining K − 2 deviation Z i=1,2, . . . ,K-2 follows a more concentrated uniform distribution U(−u ' , u ' ) at this process, where u ' < u. Second, taking any two critic networks to minimize, i.e., min Q i=1,2 , and calculating the average value of min Q i=1,2 and Q i=1,2,. . . ,K-4 from the remaining K-4 critic networks is a process described in Equation (21).
whereQ is the finally estimated action-state value. The error value between the estimated action-state valueQ and the actual action-state value Q real is calculated in Equation (22). The estimation deviation of the multi-critic networks is lower than that of the TD3 algorithm. A lower estimation deviation helps to improve the stability of the algorithm and improve the performance of the algorithm.
Furthermore, the variance of the error valueQ − Q real can be expressed as Equation (23). The results show that, compared with the TD3 algorithm, the estimated value obtained by the multi-critic networks is more stable. If the computer hardware allows it, increasing the number of critic networks is conducive to further reducing the variance of the estimated action-state valueQ.
Finally, in order to save computer memory, the multi-reviewer network structure will share some parameters. In addition, the actor network has a structure similar to that of the critic networks. After the actor network outputs the action a t , the action is subsequently transmitted to the full connection layer of the critic networks in the form of a sum, and finally the state-action value Q π (s, a) is obtained, as shown in Figure 8. To take into account the differences between the individual critic network and the overall performance of the multi-critic networks, a loss function with a weight and penalty mechanism is introduced [36], which is written in Equation (24). Table 2 shows the main process of the multi-CRDPG algorithm.
where a, b and c represent the weight coefficient, when K = 1, the loss function in this paper, degenerates into the loss function of original DDPG algorithm. and finally the state-action value Qπ(s, a) is obtained, as shown in Figure 8. To take into account the differences between the individual critic network and the overall performance of the multi-critic networks, a loss function with a weight and penalty mechanism is introduced [36], which is written in Equation (24). Table 2 shows the main process of the multi-CRDPG algorithm.     Multi-CRDPG 1: Initialize the parameters θ µ , θ µ' of online actor network µ(s|θ µ ) and online critic networks Q(s,a|θ Q ), and their target network µ'(s|θ µ' ), Q'(s,a|θ Q' ) is set to θ µ ' = θ µ , θ Q' = θ Q . 2: Initialize the experience replay buffer B, the batch size H and the number of critic output K 3: Set the learning parameters α, β, τ, n epo_total and weight coefficient a, b, c. 4: for episode n epo = 1, . . . , n epo_total do: 5: for t = 1,2, . . . , training size do: 6: Receive initial state s 1 7: Calculate the current noise variance σ epo according to equation (20) 8: Select action a t = µ(s|θ µ ) + N(0, σ epo ) from the actor network µ(s|θ µ ) 9: Execute action a t , get reward r t and receive the next state s t+1 10: Store transition (s t , a t , r t , s t+1 ) in B 11: Sample random batch size of H transitions from B 12: Calculate the loss function L critic and the estimated Q value: Calculate the estimated state-actionQ(s t , a t θ Q ) of the target critic network using Equation (21) Minimize the loss L critic using Equation (24) Update the parameters of the online critic network: θ Q k = θ Q k − α∇L critic 13: Calculate the gradient of theQ(s t , a t θ Q ) for the online actor network by using Equation (12) Update the parameters of the online actor network: θ µ = θ µ − β∇L actor 14: Update the parameters of the target actor network and target critic network using soft updating A condition monitoring and fault detection method for HMGB based on multi-CRDPG and EWMA by using SCADA data is proposed in this section, and an overall diagram is shown in Figure 9.    It can be observed that the diagram mainly consists of four steps: feature selection, data processing, offline data training and online fault detection. Their detailed explanations are as follows.
Feature selection: The first task in constructing an oil temperature forecasting model is to select high-quality input features. In SCADA data, although all the features have a little correlation with the increase of oil temperature, the differences include strong correlations and weak correlations. The agent may be forced to learn the relationship between these weak correlation features and oil temperature, so as to establish an unstable model, which is not conducive to forecasting the new data. The input characteristics carrying useful information and selecting features with strong correlation can effectively prevent the overfitting of the basic learner. Therefore, it is necessary to eliminate the weak correlation features. In this paper, a cross-correlation function (CCF) is used to measure the correlation between input characteristics and oil temperature at different times to solve lag time series analysis, as written in Equation (25).
where f OT (t + τ) represents the oil temperature series, and f i (t) represents the different input features series. Data processing: In most of the data-driven method for oil temperature forecasting, the actual data may contain missing and abnormal values, and the scales of each parameter are also different. Hence, outlier detection, missing value interpolation and data normalization is an indispensable operation in data processing. Firstly, for the slowly changing data such as oil temperature, abnormal points refer to those sudden change points. In practice, it is impossible for the difference between the data value at time t and the data value at time t + 1 to be large, which is a simple and effective outlier detection method. In order to restore the authenticity and objectivity of information, if OT t is considered an abnormal point, the mean between OT t−1 and OT t+1 is used to substitute for OT t . Secondly, to deal with a small number of missing values, in addition to optimizing the acquisition system as much as possible to avoid missing data, bezier interpolation can also be used to obtain high-precision and consecutive data in this study. Finally, normalizing the data of different scales is beneficial in training the agent, which can adjust the eigenvalues of the input data to a similar range and facilitate the selection of a uniform learning rate.
Offline data training: The multi-CRDPG algorithm is used to establish an oil temperature forecasting model in this step. In the feature-extraction stage, an autocorrelation function (ACF) and a partial autocorrelation function (PACF) are adopted to determine the optimal lag period. Before training the agent, the SCADA data should be divided into a training set and a testing set by a series of time windows, each windows containing several input series with a length equal to the lag period and a prediction series with a time step of n. After initializing the parameters, the training task can be executed.
Fault threshold calculation: Because of the accuracy limitation of the prediction model, the residual error between the predicted oil temperature and the oil temperature in the test set can be used to determine the fault threshold. The residual error processed by EWMA can not only reflected the trend of the residual value but can also effectively eliminate the false alarm point, so that the fault threshold can be set more scientifically, EWMA expression is shown in Equation (26).
where e t and e t−1 represents the residual error at time t, and e t represents the sliding average of the residual error at time t. λ represent the weight of historical data; it reflects the impact of the previous data on the next data; this impact will be gradually weakened with the passage of time.
A thorny problem here is that the distribution of the generated residual error is unknown, and the residual errors generated by different models are considered to have different distributions, e.g., T-distribution, Gaussian distribution, etc. Unlike parameter estimation, nonparametric estimation does not add any prior knowledge but fits the distribution according to the characteristics and properties of the data itself, which can obtain a better model than parameter estimation. KDE is a nonparametric estimation method. Without knowing the distribution of residual error, the fault threshold can be expressed as Equation (27).
where S represents the fault threshold, and N represents the number of data points in testing set. S th represents an interval upper limit. According to the interval estimation theory in statistics, the distribution characteristics of residual errors can be analyzed by KED.
Assuming that the residual error is distributed in the interval [0, S th ] with a probability value of 1−p, 1−p is called the confidence level, which represents the cumulative probability distribution. The smaller the value of p, the less likely the occurrence of S > S th . By setting different probability values for p, multiple thresholds of S can be obtained to judge the health degree of a HMGB. Online fault detection: After the oil temperature forecasting model is established and the fault threshold S is set, the real-time working condition data can be input into the model. By comparing whether the difference in the residual error between the predicted oil temperature value and the actual oil temperature value exceeds the fault threshold S, whether the HMGB is degraded can be detected.

Experimental Verification and Results Analysis
In this section, to verify the effectiveness of the proposed fault detection method for a HMGB, a simulated helicopter transmission system is manufactured to collect the data generated under a healthy state to train the oil temperature forecasting model and carry out a series of fault-seeded experiments. The test rig mainly includes a drive motor, HPGB, spur gearbox, load motor, and data acquisition system, as shown in Figure 10. In this study, HPGB is the monitored object, and the collected variables include load torque, driving motor speed, gearbox oil pressure, gearbox inlet oil temperature, gearbox oil temperature and ambient temperature, and gearbox oil temperature; the gearbox oil temperature reflects the health status of HPGB. The 1# sensor for collecting the inlet oil temperature and the 2# sensor for collecting the gearbox oil temperature are shown in Figure 11. this study, HPGB is the monitored object, and the collected variables include load torque, driving motor speed, gearbox oil pressure, gearbox inlet oil temperature, gearbox oil temperature and ambient temperature, and gearbox oil temperature; the gearbox oil temperature reflects the health status of HPGB. The 1# sensor for collecting the inlet oil temperature and the 2# sensor for collecting the gearbox oil temperature are shown in Figure 11.

Generation of Datasets
In actual flight, the helicopter cannot change the engine power by adjusting the throttle. The output power is approximately constant and controls the lift by changing the angle of the rotor hub, but the HMGB speed is variable. Therefore, the condition of constant motor output power is simulated, and the motor output power is 53 KW. The relevant variables were collected by a sampling interval of 1 s, from 17 June 2022, to 25 June 2022, totaling 180 h and 648,000 points. Figure 12 shows the results after data preprocessing; then it can be observed that the gearbox oil temperature has a certainly delayed correlation with other variables but has little relationship with the ambient temperature. The dataset details and the maximum cross-correlation coefficient between each variable and the gearbox oil temperature are shown in Table 3. The motor speed, load, gearbox oil pressure and gearbox inlet oil temperature has the strongest CCF value with the gearbox oil temperature. Therefore, the above four variables are selected as inputs for the oil temperature forecasting model.

Generation of Datasets
In actual flight, the helicopter cannot change the engine power by adjusting the tle. The output power is approximately constant and controls the lift by changing t gle of the rotor hub, but the HMGB speed is variable. Therefore, the condition of co motor output power is simulated, and the motor output power is 53 KW. The re variables were collected by a sampling interval of 1 s, from 17 June 2022, to 25 June totaling 180 h and 648,000 points. Figure 12 shows the results after data preproce then it can be observed that the gearbox oil temperature has a certainly delayed co tion with other variables but has little relationship with the ambient temperatur dataset details and the maximum cross-correlation coefficient between each variab the gearbox oil temperature are shown in Table 3. The motor speed, load, gearbox oi sure and gearbox inlet oil temperature has the strongest CCF value with the gearb temperature. Therefore, the above four variables are selected as inputs for the oil te ature forecasting model.

. Generation of Datasets
In actual flight, the helicopter cannot change the engine power by adjusting the throttle. The output power is approximately constant and controls the lift by changing the angle of the rotor hub, but the HMGB speed is variable. Therefore, the condition of constant motor output power is simulated, and the motor output power is 53 KW. The relevant variables were collected by a sampling interval of 1 s, from 17 June 2022, to 25 June 2022, totaling 180 h and 648,000 points. Figure 12 shows the results after data preprocessing; then it can be observed that the gearbox oil temperature has a certainly delayed correlation with other variables but has little relationship with the ambient temperature. The dataset details and the maximum cross-correlation coefficient between each variable and the gearbox oil temperature are shown in Table 3. The motor speed, load, gearbox oil pressure and gearbox inlet oil temperature has the strongest CCF value with the gearbox oil temperature. Therefore, the above four variables are selected as inputs for the oil temperature forecasting model.  In this study, after constructing the collected data into a dataset, the first 70 as the training set, and the remaining 30% is set as the test set. The lag time is sele analyzing ACF and PACF. Figure 13 shows the ACF and PACF diagrams of the perature. The red line represents the significance threshold, which is limited to 5% the scope of this case. The ACF diagram tails off to zero, while the PACF diagram sents a truncation trend. In PACF, the time steps before 35 exceed this threshold, w regarded as heavily relevant to the oil temperature, so the optimal time lag perio to 35. It is worth noting that, although single-step was is selected in this paper, in o ensure the accuracy of oil temperature prediction, it was still decided to use the time steps to predict the oil temperature at the next time.

022, 24, x FOR PEER REVIEW
It is an important step to apply the proposed multi-CRDPG to the establish oil temperature prediction model in this paper, and the prediction of oil temp should be changed to the continuous control of DRL. Before training an agent, the  In this study, after constructing the collected data into a dataset, the first 70% is set as the training set, and the remaining 30% is set as the test set. The lag time is selected by analyzing ACF and PACF. Figure 13 shows the ACF and PACF diagrams of the oil temperature. The red line represents the significance threshold, which is limited to 5% within the scope of this case. The ACF diagram tails off to zero, while the PACF diagram represents a truncation trend. In PACF, the time steps before 35 exceed this threshold, which is regarded as heavily relevant to the oil temperature, so the optimal time lag period is set to 35. It is worth noting that, although single-step was is selected in this paper, in order to ensure the accuracy of oil temperature prediction, it was still decided to use the first 35 time steps to predict the oil temperature at the next time. Parameter optimization is an indispensable step to ensure the performance of the temperature prediction model, and the random grid search method is applied as the rameter tuning method in this paper due to the model involving too many hyperparam ters. The random grid search method abandons the global hyperparameter space, inste selecting some parameter combinations to constructs the hyperparameter subspace. Co pared with the enumeration grid search method, the random grid search method requi less computation. For example, assuming that there are parameters A and B in the 2 search space, then the value of A is [1-7]; the value of B is the [1-7], and the search step set to 1. Then the enumeration grid search method must search all 49 parameter combi tions, but the random grid search method only needs to select some parameter space v ues as parameter combinations to search. Although the results of the random grid sea method are uncertain, the minimum loss is very close to the minimum loss obtained the enumeration grid search method; the random grid search method is used toget with cross validation, mainly using K-fold cross-validation. The main idea is to divide original datasets into K groups; take a verification set for each sub-datasets and use remaining K-1 sub-datasets as the training set, so that K trained models can be obtain and take the average error of K times as the final evaluation index. The K is set to 10 this paper. In addition, within the limitation of each sample length (35 data points), creasing the convolution layers and pooling layers of convolution neural network mo as much as possible enhances the feature extraction ability. Detailed experimental con tions and parameter settings are shown in Table 4. It is an important step to apply the proposed multi-CRDPG to the establishment of oil temperature prediction model in this paper, and the prediction of oil temperature should be changed to the continuous control of DRL. Before training an agent, the specific state s t and action a t at each time should be clearly defined. Feature samples are generated by using two data windows, in which state s t is composed of four time series with 35 timesteps, and action a t is only composed of output oil temperature OT t at time t, as shown in Figure 14. After winnowing, the training set contained 12,960 samples, and the testing set contained 5554 samples. Once the oil temperature prediction model is converted into the decisionmaking process of the current HMPG working condition, the multi-CRDPG algorithm can be used.

Modules
Layers Types Parameters Input/Output Channel Parameter optimization is an indispensable step to ensure the performance of the oil temperature prediction model, and the random grid search method is applied as the parameter tuning method in this paper due to the model involving too many hyperparameters. The random grid search method abandons the global hyperparameter space, instead selecting some parameter combinations to constructs the hyperparameter subspace. Compared with the enumeration grid search method, the random grid search method requires less computation. For example, assuming that there are parameters A and B in the 2-D search space, then the value of A is [1][2][3][4][5][6][7]; the value of B is the [1-7], and the search step is set to 1. Then the enumeration grid search method must search all 49 parameter combinations, but the random grid search method only needs to select some parameter space values as parameter combinations to search. Although the results of the random grid search method are uncertain, the minimum loss is very close to the minimum loss obtained by the enumeration grid search method; the random grid search method is used together with cross validation, mainly using K-fold cross-validation. The main idea is to divide the original datasets into K groups; take a verification set for each sub-datasets and use the remaining K-1 sub-datasets as the training set, so that K trained models can be obtained; and take the average error of K times as the final evaluation index. The K is set to 10 in this paper. In addition, within the limitation of each sample length (35 data points), increasing the convolution layers and pooling layers of convolution neural network model as much as possible enhances the feature extraction ability. Detailed experimental conditions and parameter settings are shown in Table 4.

Performance Evaluation of the Model
Only by using reasonable indices to evaluate the performance of the model can it be correctly evaluated and persuasive; therefore, the four classical evaluation indices were used in this paper, i.e., mean absolute error (MAE), R-square (R 2 ), mean absolute percentage error (MAPE) and root mean squared error (RMSE). Different indicators can reflect the performance of the model from different perspectives, and four indicators can comprehensively evaluate the model. The expressions are written in Equations (28) to (31).
where OT actual t and OT predicted t indicate the actual and predicted values at time step t, respectively. It is worth noting that smaller the value of MAE, MAPE and RMSE and bigger the value of R 2 , the higher the prediction accuracy of the model.

Performance Comparison of Different Models
This subsection compares multi-CRDPG, CRDPG (only using CNN-LSTM as a basic learner of the DDPG algorithm, without RIF and multi-critic), RDPG, DDPG and other baseline models in terms of prediction accuracy, generalization and robustness. As we all know, the convergence time of the DRL algorithm is always longer than that of the DL algorithm, so the multi-CRDPG is only compared with RDPG and DDPG in terms of time cost. The existing models include classical models and state-of-the-art models. The other baseline models include the classical models LS-SVM, NARX, ARIMA and the state-of-the-art models CNN, GRU and CNN-LSTM, and they have been proven effective in the prediction of time series. In order to gradually analyze the advantages of the method proposed in this paper, the parameters of GRU, CNN, CNN-LSTM, DDPG and CRDPG are the same as those of the corresponding modules in multi-CRDPG. The parameter settings of the remaining models, LS-SVM, NARX and ARIMA, are described as below, and detailed parameters are shown in Table 5. Next, 10 oil temperature prediction experiments were carried out and the average of relevant results was taken. Figures 15 and 16 display the oil temperature prediction results of the ten models under different working conditions in testing sets, and the comparison results can be summarized in detail as follows: (a) Some conventional time series prediction models, including LSSVM, NARX, ARIMA and GRU, can predict the change trend in oil temperature to a certain extent, but in the face of complex working condition data, the ability of the above models to improve the prediction accuracy is very limited. As a classical feature extraction algorithm, CNN is good at data classification, but it is not good at excavating the transformation rules of time series, so its prediction performance is unsatisfactory. Of three DRL models, DDPG performs the worst among these models, even worse than conventional DL, because the structure of the BP neural network is too simple. (b) The key to the successful application of the DRL algorithm is to choose a model with excellent performance as the basic learner. After extracting the feature of the time series, the prediction accuracy is higher than directly forecasting from the original data, and the training time of the model is shorter due to the simplification of the sequence information. By observing four evaluation indicies of LSSVM, NARX, ARIMA, CNN, GRU and CNN-LSTM, CNN-LSTM performed better than other models, which implies it has more potential as a basic learner of DRL and obtains optimal predicted accuracy. (c) When comparing GRU and CNN-LSTM with their corresponding RL algorithms RDPG and CRDPG, it was fully proven that the performance of the basic learner guided by the reinforcement learning framework has been greatly improved. The possible reason is that the decision-making ability of the reinforcement learning framework is stronger than that of the traditional deep learning method for directly fitting data.
formation rules of time series, so its prediction performance is unsatisfactory. Of three DRL models, DDPG performs the worst among these models, even worse than conventional DL, because the structure of the BP neural network is too simple. (b) The key to the successful application of the DRL algorithm is to choose a model with excellent performance as the basic learner. After extracting the feature of the time series, the prediction accuracy is higher than directly forecasting from the original data, and the training time of the model is shorter due to the simplification of the sequence information. By observing four evaluation indicies of LSSVM, NARX, ARIMA, CNN, GRU and CNN-LSTM, CNN-LSTM performed better than other models, which implies it has more potential as a basic learner of DRL and obtains optimal predicted accuracy. (c) When comparing GRU and CNN-LSTM with their corresponding RL algorithms RDPG and CRDPG, it was fully proven that the performance of the basic learner guided by the reinforcement learning framework has been greatly improved. The possible reason is that the decision-making ability of the reinforcement learning framework is stronger than that of the traditional deep learning method for directly fitting data.  In order to further verify the effectiveness of the proposed improved method in terms of time cost and prediction accuracy, it is necessary to compare multi-CRDPG with DDPG, CRPG and CRDPG. Figure 17 shows the forecasting results (only showing the results of the last 2592 s) of the above four DRL algorithms. It can be observed that CRDPG obviously outperform the RDPG and DDPG models, which indicates that using CNN-LSTM as a basic learner can effectively perceive time series information compared with GRU and BP. RDPG cannot correctly predict the trend in oil temperature in the time period when the working conditions change sharply, and the prediction error is unacceptable for the actual fault detection of HMGBs. However, the predicted result of CRDPG is always higher than the actual oil temperature, and the evaluating indicators of CRDPG measured by MAE, RMSE, R 2 and RMSE are 1.09 • C, 0.94 • C, 1.27 • C and 0.04 • C, respectively, which is caused by the overestimation of the state value of the working condition by a single critic network. When multi-critic is introduced in multi-CRDPG, the MAE, RMSE, R 2 and RMSE are 0.66 • C, 0.98 • C, 0.008 • C and 0.4 • C. In addition to RMSE, other indicators have been greatly improved, which proves that the multi-critic network can correctly estimate the state value of the working condition. In order to further verify the effectiveness of the proposed improved method of time cost and prediction accuracy, it is necessary to compare multi-CRDPG with CRPG and CRDPG. Figure 17 shows the forecasting results (only showing the re the last 2592 s) of the above four DRL algorithms. It can be observed that CRDP ously outperform the RDPG and DDPG models, which indicates that using CNN as a basic learner can effectively perceive time series information compared wi and BP. RDPG cannot correctly predict the trend in oil temperature in the time when the working conditions change sharply, and the prediction error is unaccep the actual fault detection of HMGBs. However, the predicted result of CRDPG is higher than the actual oil temperature, and the evaluating indicators of CRDPG m by MAE, RMSE, R 2 and RMSE are 1.09 °C, 0.94 °C, 1.27 °C and 0.04 °C, respectively is caused by the overestimation of the state value of the working condition by critic network. When multi-critic is introduced in multi-CRDPG, the MAE, RMSE RMSE are 0.66 °C, 0.98 °C, 0.008 °C and 0.4 °C. In addition to RMSE, other indicat been greatly improved, which proves that the multi-critic network can correctly estimate value of the working condition. In order to further verify the effectiveness of the proposed improved method in ter of time cost and prediction accuracy, it is necessary to compare multi-CRDPG with DDP CRPG and CRDPG. Figure 17 shows the forecasting results (only showing the results the last 2592 s) of the above four DRL algorithms. It can be observed that CRDPG ob ously outperform the RDPG and DDPG models, which indicates that using CNN-LST as a basic learner can effectively perceive time series information compared with GR and BP. RDPG cannot correctly predict the trend in oil temperature in the time peri when the working conditions change sharply, and the prediction error is unacceptable the actual fault detection of HMGBs. However, the predicted result of CRDPG is alwa higher than the actual oil temperature, and the evaluating indicators of CRDPG measur by MAE, RMSE, R 2 and RMSE are 1.09 °C, 0.94 °C, 1.27 °C and 0.04 °C, respectively, wh is caused by the overestimation of the state value of the working condition by a sin critic network. When multi-critic is introduced in multi-CRDPG, the MAE, RMSE, R 2 a RMSE are 0.66 °C, 0.98 °C, 0.008 °C and 0.4 °C. In addition to RMSE, other indicators h been greatly improved, which proves that the multi-critic network can correctly estimate the st value of the working condition.   Figure 18 presents the loss value of the CRDPG and multi-CRDPG in the training process, and it can be intuitive to see that the loss value of the multi-CRDPG decreases rapidly near the 100th updating time and have converged before the 500th updating time, but the loss value of CRDPG decreases slowly. This indicates that the designed RIF has a strong incentive effect on the training basic learner. Figure 18 presents the loss value of the CRDPG and multi-CRDPG in the training pro cess, and it can be intuitive to see that the loss value of the multi-CRDPG decreases rapidly near the 100th updating time and have converged before the 500th updating time, but the loss value of CRDPG decreases slowly. This indicates that the designed RIF has a strong incentive effect on the training basic learner.

Fault Detection Analysis
In order to explore the actual performance of the proposed fault detection method based on an multi-CRDPG oil temperature forecasting model and an EWMA control chart a series of damage-seeded experiments was performed with different fault types in the transmission system and lubricating oil system of the HPGB. The occurrence of these faults will cause the oil temperature to rise. Naturally, the variable composition of the data set is the same as that of the data set for establishing the oil temperature prediction mode in the actual test. The three damage-seeded experiments are tested in this paper, including The residual error of the HPGB oil temperature in a healthy state is shown in Figure 19a. Although some residual error values exceed 3 • C, the overall distribution is symmetrical, with an average value of 0.198 • C and a standard deviation of 0.483 • C. The residual error processed by the EWMA control principle is shown by the red line in Figure 19a. Figure 19b shows the distribution of residual error. Setting 99.5% confidence, according to probability density function (PDF) and cumulative distribution function (CDF), results in a residual error of 2.61 • C being determined as the fault threshold.  Figure 18 presents the loss value of the CRDPG and multi-CRDPG in the training process, and it can be intuitive to see that the loss value of the multi-CRDPG decreases rapidly near the 100th updating time and have converged before the 500th updating time, but the loss value of CRDPG decreases slowly. This indicates that the designed RIF has a strong incentive effect on the training basic learner. The residual error of the HPGB oil temperature in a healthy state is shown in Figure  19a. Although some residual error values exceed 3 °C, the overall distribution is symmetrical, with an average value of 0.198 °C and a standard deviation of 0.483 °C. The residual error processed by the EWMA control principle is shown by the red line in Figure 19a. Figure 19b shows the distribution of residual error. Setting 99.5% confidence, according to probability density function (PDF) and cumulative distribution function (CDF), results in a residual error of 2.61 °C being determined as the fault threshold.

Fault Detection Analysis
In order to explore the actual performance of the proposed fault detection method based on an multi-CRDPG oil temperature forecasting model and an EWMA control chart, a series of damage-seeded experiments was performed with different fault types in the transmission system and lubricating oil system of the HPGB. The occurrence of these faults will cause the oil temperature to rise. Naturally, the variable composition of the data set is the same as that of the data set for establishing the oil temperature prediction model in the actual test. The three damage-seeded experiments are tested in this paper, including

Fault Detection Analysis
In order to explore the actual performance of the proposed fault detection method based on an multi-CRDPG oil temperature forecasting model and an EWMA control chart, a series of damage-seeded experiments was performed with different fault types in the transmission system and lubricating oil system of the HPGB. The occurrence of these faults will cause the oil temperature to rise. Naturally, the variable composition of the data set is the same as that of the data set for establishing the oil temperature prediction model in the actual test. The three damage-seeded experiments are tested in this paper, including planet gear broken teeth (Fault 1), a damaged bearing cage and rolling elements (Fault 2) and a clogged oil filter element (Fault 3). The detailed experimental environment is shown in the Table 6. As shown in Figure 20, planetary gear tooth breakage is a typical fault that threatens the operation safety of a HMGB, and it must be detected as soon as possible. When this fault is seeded, the predicted and actual oil temperature values are shown in Figure 21a, and the EWMA control chart of residual errors between predicted and actual oil temperature values is shown in Figure 21b. The green and pink lines represent residual errors below and above the fault threshold, respectively. Since the broken tooth has little influence on the operation state of HPGB in the early stage, it can hardly cause the oil temperature to rise, so the fault cannot be detected. As time goes on, the transmission system of the HPGB gradually deteriorates due to the planet gear broken teeth, and the oil temperature tends to rise. After the fault was seeded for about 18,000 s, if the actual oil temperature was higher than the predicted oil temperature and higher than the fault threshold 2.61 • C, then the HPGB is deemed to have a serious fault.
planet gear broken teeth (Fault 1), a damaged bearing cage and rolling elements (Fault 2 and a clogged oil filter element (Fault 3). The detailed experimental environment is shown in the Error! Reference source not found.. As shown in Figure 20, planetary gear tooth breakage is a typical fault that threatens the operation safety of a HMGB, and it must be detected as soon as possible. When this fault is seeded, the predicted and actual oil temperature values are shown in Figure 21a and the EWMA control chart of residual errors between predicted and actual oil temper ature values is shown in Figure 21b. The green and pink lines represent residual errors below and above the fault threshold, respectively. Since the broken tooth has little influ ence on the operation state of HPGB in the early stage, it can hardly cause the oil temperature to rise, so the fault cannot be detected. As time goes on, the transmission system of the HPGB gradually deteriorates due to the planet gear broken teeth, and the oil temper ature tends to rise. After the fault was seeded for about 18,000 s, if the actual oil tempera ture was higher than the predicted oil temperature and higher than the fault threshold 2.61 °C, then the HPGB is deemed to have a serious fault.     As shown in Figure 20, planetary gear tooth breakage is a typical fault that threatens the operation safety of a HMGB, and it must be detected as soon as possible. When this fault is seeded, the predicted and actual oil temperature values are shown in Figure 21a, and the EWMA control chart of residual errors between predicted and actual oil temperature values is shown in Figure 21b. The green and pink lines represent residual errors below and above the fault threshold, respectively. Since the broken tooth has little influence on the operation state of HPGB in the early stage, it can hardly cause the oil temperature to rise, so the fault cannot be detected. As time goes on, the transmission system of the HPGB gradually deteriorates due to the planet gear broken teeth, and the oil temperature tends to rise. After the fault was seeded for about 18,000 s, if the actual oil temperature was higher than the predicted oil temperature and higher than the fault threshold 2.61 °C, then the HPGB is deemed to have a serious fault.

Damaged Bearing Cage and Rolling Elements
Bearing is also an important part in the transmission system, and its cage and rolling element are broken, which are common failures and will lead to other failures. Figure 22 shows the predicted and actual oil temperature value after the bearing with broken cage and rolling element are seeded in HPGB. In this experiment, the driving motor was kept in the high-speed range, which accelerated the degradation of the bearing, resulting in a rapid rate of oil temperature rise. The oil temperature exceeded the fault threshold 900 s after the fault was seeded.

Damaged Bearing Cage and Rolling Elements
Bearing is also an important part in the transmission system, and its cage and rolling element are broken, which are common failures and will lead to other failures. Figure 22 shows the predicted and actual oil temperature value after the bearing with broken cage and rolling element are seeded in HPGB. In this experiment, the driving motor was kept in the high-speed range, which accelerated the degradation of the bearing, resulting in a rapid rate of oil temperature rise. The oil temperature exceeded the fault threshold 900 s after the fault was seeded.

Clogged Oil Filter Element
In addition to the transmission system, the lubrication oil system is also an important system in the gearbox, which can ensure the lubrication and heat dissipation of mechanical parts. In order to simulate the clogging of the oil filter element, impurities are added into the lubricating oil artificially, and the filter element is rendered clogged. It can be observed from Figure 23 that the actual oil temperature is lower than the predicted oil temperature due to the influence of the ambient temperature at the beginning of the HPGB operation. When impurities accumulate on the oil filter element to a certain extent, the filter element will gradually become blocked, and the poor oil supply of the gearbox will lead to difficulty in heat dissipation and oil temperature rise. When it exceeds the threshold line at about 162 s, the HPGB is judged to be abnormal.
Finally, a comparison of the time costs for the above different models detects faults under experiments repeated 30 times is showed in Table 7. The proposed multi-CRDPG model requires the shortest time in three fault detection cases, and it can be concluded that the better the performance of the oil temperature forecasting model, the earlier the seeded fault can be found. Table 8 shows the missing rate for the above different models to detect faults within the duration time under all experiments. Apparently, multi-CRDPG and CRDPG have high reliability, and all seeded faults were identified in repeated experiments. Other models have a certain number of missing alarms for different faults, and, with the reduction of the fault severity, the probability of a missing alarm increases.

Clogged Oil Filter Element
In addition to the transmission system, the lubrication oil system is also an important system in the gearbox, which can ensure the lubrication and heat dissipation of mechanical parts. In order to simulate the clogging of the oil filter element, impurities are added into the lubricating oil artificially, and the filter element is rendered clogged. It can be observed from Figure 23 that the actual oil temperature is lower than the predicted oil temperature due to the influence of the ambient temperature at the beginning of the HPGB operation. When impurities accumulate on the oil filter element to a certain extent, the filter element will gradually become blocked, and the poor oil supply of the gearbox will lead to difficulty in heat dissipation and oil temperature rise. When it exceeds the threshold line at about 162 s, the HPGB is judged to be abnormal.   Finally, a comparison of the time costs for the above different models detects faults under experiments repeated 30 times is showed in Table 7. The proposed multi-CRDPG model requires the shortest time in three fault detection cases, and it can be concluded that the better the performance of the oil temperature forecasting model, the earlier the seeded fault can be found. Table 8 shows the missing rate for the above different models to detect faults within the duration time under all experiments. Apparently, multi-CRDPG and CRDPG have high reliability, and all seeded faults were identified in repeated experiments. Other models have a certain number of missing alarms for different faults, and, with the reduction of the fault severity, the probability of a missing alarm increases.

Conclusions and Future Works
Accurate oil temperature forecasting has great significance for the fault detection of HMGBs due to it being necessary to establish a forecasting model to describe the relationship between the HMGB working condition and oil temperature in the healthy state; this model will compare whether the residual error between the predicted oil temperature corresponding to the current working condition and the actual oil temperature exceeds the fault threshold. Conversely, an inaccurately predicted oil temperature can result in false alarms and missed alarms.
In this paper, a novel fault detection method based on an improved deep deterministic policy gradient algorithm with a CNN-LSTM-based learner, reward incentive function and multi-critic networks, and an EWMA control chart is proposed for oil temperature forecasting and fault detection. Actual HPGB datasets includes health samples and three failure cases; nine baseline and four evaluation indicators are used to verify the performance of the proposed model. According to the results of many comparison experiments, the five conclusions are summarized as follows: (1) The proposed model has the advantages of higher prediction accuracy and more stable convergence than other baseline models. The results of comparison experiments in the datasets of each working condition demonstrate that an accurate oil temperature prediction model is successfully established. Meanwhile, the robustness of the model is verified, which can ensure the reliability of the prediction and detection results. (2) The proposed deep deterministic policy gradient method is based on a CNN-LSTM network, which can extract complex time series features, eliminate redundant information, reduce noise influence and excavate the change rules of time series. Moreover, CNN-LSTM educated by a deep deterministic policy gradient framework can obtain better performance than the original CNN-LSTM. (3) The proposed reward incentive function can accelerate and stabilize the convergence of model training by exciting the agent, which is worth being rewarded at different time steps. (4) The proposed variable exploration variance is beneficial for the agent to fully explore the state space and correctly evaluate each state value in the initial training stage. Reduce the noise variance at the later training stage to make the model converge gradually. (5) The proposed multi-critic network structure and a state-action value estimation strategy can reduce the overestimation and underestimation of the state-action values of the agent to improve the forecasting accuracy of the basic learner, which is a key step to further improve the prediction accuracy of the model. Although the proposed model has the above innovations and advantages, it has some limitations that require further improvement. Firstly, the model involves many hyperparameters, and it needs to select a good initial value to ensure the prediction accuracy of the model. Therefore, an adaptive parameter selection method needs to be designed in the future. Finally, there is a risk of overfitting the proposed model with the increase of iteration times. In the future, an updated method should be developed to ensure that the model is updated in the direction of better performance.

Conflicts of Interest:
The authors declare no conflict of interest.