A Novel Fault Detection with Minimizing the Noise-Signal Ratio Using Reinforcement Learning

In this paper, a reinforcement learning approach is proposed to detect unexpected faults, where the noise-signal ratio of the data series is minimized to achieve robustness. Based on the information of fault free data series, fault detection is promptly implemented by comparing with the model forecast and real-time process. The fault severity degrees are also discussed by measuring the distance between the healthy parameters and faulty parameters. The effectiveness of the algorithm is demonstrated by an example of a DC-motor system.


Introduction
With the increasing expense and complexity of modern industrial systems, there is a growing demand for higher reliability and security.Measurement instrument faults may result in performance degradation or even malfunction due to the incorrect conclusion drawn by the process fault detection and diagnosis system.Therefore, the problem of fault detection and diagnosis (FDD) has become a popular research topic [1][2][3].
Generally, fault diagnosis methods can be categorized into model-based methods, signal-based methods and knowledge-based methods [1,2].In model-based methods, the models of the industrial processes or the practical systems are obtained by using either physical principles or system identification techniques.Based on the model, fault diagnosis algorithms are developed to monitor the consistency between the measured outputs of the practical systems and the model-predicted outputs.Signal-based methods utilize measured signals rather than explicit input-output models for fault diagnosis.The feature signals to be extracted for symptom (or pattern) analysis can be either the time domain (e.g., mean, trends, standard deviation, phases, slope and magnitudes such as peak and root mean square) or frequency domain (e.g., spectrum).These issues were studied by various signal processing methods, such as wavelet transform (WT) [4], empirical mode decomposition (EMD) [5,6], intrinsic mode functions (IMF) [7] and local mean decomposition (LMD) [8].A large volume of data has been more accessible with the development of modern electronic and measurement technologies such as SCADA and smart sensors [9][10][11][12][13], which stimulates knowledge-based fault diagnosis methods.Applying a variety of artificial intelligent techniques (either symbolic intelligence or computing intelligence) to the available historic data of the industrial processes, the underlying knowledge, which implicitly represents the dependence of the system variables, can be extracted.Interesting results on knowledge-based fault diagnosis and applications were reported during the last few decades [14][15][16][17][18].
Unexpected faults may cause performance degradation or even malfunction, and it is thus desired to detect, isolate and identify the faulty components as early as possible.However, it is difficult to release the fault feature in a short time because of the influences from heavy background noises.Based on the statistical theory, the traditional data-driven methods can be implemented by the sliding window technology in which the data are regarded as a concentration of system character and renew with window sliding.The features of the system can be extracted by analysing the data series in a sliding window after a filtering process and further stressed by strengthening technology such as PCA [19], SVM [20], information theory [21], and so forth.These traditional approaches have two flaws for fault detection: The first is that more data examples need to be collected in order to achieve a change of statistical character with a fault occurrence because a few new data can only have a small impact on the statistical character of the whole window.More data examples require more time to collect.Therefore, it is difficult for the traditional sliding window-based technology to carry out swift fault detection.The second is the lack of effective data in the case of early unexpected fault.Due to the complexity, uncertainty and unpredictability of the faults, it is challenging to obtain a number of valid fault data within a short period except for some special cases such as batch process.It is trade-off between getting more faulty data and giving less admissible time.
It is well known that the model parameters are more reliable than the state variables, especial in a noisy condition.However, the model parameters also face two problems similar to the aforementioned ones.The traditional approaches struggle to provide a quick detection due to the lack of the early information on sudden and unexpected faults.
Reinforcement learning (RL) is a powerful tool, which is motivated by statistics, psychology, neuroscience and computer science [22][23][24].An agent will learn through experience, without a teacher.In each training session, named an episode, the agent explores the environment and receives the reward if any until it reaches the desired goal.The purpose of the training is to enhance the 'brain' of the agent.The goal of an agent is to maximize the reward that is received in the long run.One can obtain the optimal action only using the current states [25][26][27][28].
Motivated by the idea of "obtain the optimal action only using the current states", an original idea based on RL is proposed to solve the swift fault detection problem.The minimization of the noise-signal ratio (NSR) is taken as the goal of the expecting series, and the policy iteration of RL is used as a tool to get parameters by considering the parameters as actions of RL.Then, one can get the model parameters corresponding to current states with noises.By comparing with the noise information (it is easier to get offline from the healthy data series), one will implement prompt fault detection and diagnosis with the next sample data.There are two main contributions in this paper.
(1) The unexpected faults will be detected promptly within a sampling period by using the measured data only.
(2) The estimated model is always consistent with the real-time process under the noisy condition by adjusting the parameters every sampling with the goal of minimizing the NSR using RL technology.

Problem Description
Suppose a discrete-time system with noises is controlled by a pre-controller, depicted by Figure 1.Here, respectively, and D is the order of the system.u(k) ∈ R m , y(k) ∈ R p are the control input and measured output, respectively; ω(k) ∈ R n is a white Gaussian signal with zero mean and covariance matrix Σ ω .We suppose the system states are observable, and the control series R Dn+m ; the system can be rewritten as a vector form: where Dn+m) is a parameter matrix and T represents a transpose.

Noise-Signal Ratio
The noise is categorized into multiplicative noises and additive noise.Here, we only take into consideration additive noise, which is consistent with the nature of many processes.This means x(k) = x * (k) + ω(k) for any time k, where x(k) is the observed system states, x * (k) is the real data without noise and ω(k) is the noise.
Define a noise-signal ratio where x i (k) and x * i (k) are the i-th component of the measured data and the real data at k sampling time, respectively, and l is the length of the data series.Further, an integer noise-signal ratio δ of data series {x(k There are three factors that affect the noise-signal ratio δ i for a given n-dimensional data series: the measured data {x i (k)}, the real data x * i (k) and the length l.From the statistics viewpoint, l must have enough length in order to discover the feature of data series.This means it will spend a long time collecting the sample data.If one pursues a short time, the length l should be shorter.It is evident that when l becomes shorter, the noise will have a greater effect on the statistics character of the measured data series.It is a compromise between accuracy and velocity.

Reinforcement Learning Method
The reinforcement learning that is motivated by statistics, psychology, neuroscience and computer science is a powerful tool to deal with uncertain surroundings by interacting with its environment.In terms of [22,24,25], the basic theory and methods of the reinforcement-learning are simply introduced here.The basic frame of reinforcement learning is shown in Figure 2 [24].An agent will get the evaluation of good or bad behaviour on the environment and learn through experience without a teacher, who teaches how to do perform this.In every single training session, named an episode, the agent explores the environment by changing action u i and receives the state x i and the reward R i .The purpose of the training is to enhance the 'brain' of the agent.The goal of an agent is to maximize the reward ∑ R i that is received in the long run.
Consider a Markov decision process MDP(X , U , P, ), where X is a set of states and U is a set of actions or controls.The transition probabilities P : X × U × X → [0, 1] represent for each state x ∈ X and action u ∈ U the conditional probability P(x(k + 1), x(k), u(k)) = Pr{x(k + 1) | x(k), u(k)} of transitioning to state x(k + 1) ∈ X where the MDP is in state x(k) and takes action u(k).The cost function : X × U × X → R is the expected immediate cost R k (x(k + 1), x(k), u(k)) paid after transition to state x(k + 1) ∈ X , given that the MDP starts from state x(k) ∈ X and takes action u(k) ∈ U .The value of a policy V π k (x(k)) is defined as the conditional expected value of the future cost E π {∑ k+T i=k γ i−k R i }, with R i ∈ R when starting in state x(k) at time k and following policy π(x, u).One can further have: where T = ∞.It is noted that T = ∞ represents that the Markov decision process has enough length to show its essential characteristic according to the statistical law.If it is too short, the V π k (x) is prone to inaccuracy with few data.We usually use enough length l instead of ∞ in practical application.Equation (4) releases the value function V π k (x) for the policy π(x, u) satisfying the Bellman Equation [29]: Therefore, the optimal actions can be gained by alternating the policy evaluation and policy improvement according to Equations ( 6) and ( 7): where γ is a discount factor with 0 ≤ γ < 1 in order to be convergent.For a deterministic system, As a result, Equations ( 6) and ( 7) are rewritten as: It is stressed that x(k + 1) is only a temporary expected state in the process of alternating the policy evaluation and policy improvement, which is used to implement the cost R k (x(k + 1), x(k), u(k)).The policy improvement ( 9) is usually obtained by using the greedy method [24] that will pursue the better policy at each iteration.

Remark 1.
There is only state information in Equations ( 8) and (9).One can obtain the optimal action only using the two states x(k) and x(k + 1) in the process of minimizing the goal R k .It does not need more time to collect more data, and the past information is not necessary to know.

Fault-Free Scenario
One can obtain the estimated Equation of System (1) as follows: where x(k + 1) is an estimated value of x(k + 1); θ1 , • • • , θn are vector components of θ.If there are enough data in data series with length l, the parameter θ can be gained by using a least squares method (LSM) [30] according to the following: where T and the subscripts k and k + 1 are the sampling time instants, while l is the length of the data series.The accuracy of θ is further improved online by a recursion Equation (12) with new data x k+1,l : where P is an auxiliary matrix and P 0 = βI for some large positive constant β; and θk+1 is an estimated parameter improved by adding new data.Goodwin and Sin [30] showed that LSM converges asymptotically to the true parameters if θ is fixed and φ(k) satisfies the persistent excitation condition: for all N ≥ N 0 , where 0 ≤ ε0 and N 0 is a positive number.This indicates x * = x in the meaning of the LSM.Here, x * is the real data without noise, and x is an estimated value by using LSM.

Fault Scenario
It is assumed that the change from the normal to faulty operation does not affect the noise distribution and intensity.A model of data series subjected to a fault ω f is described as: where θ f ∈ R Dn+m is a coefficient vector after fault, ω(k) is the noise that is the same as fault free and ω f is an unexpected fault.One can obtain θ f by applying the least squares method again if there are enough valid data.The estimated model subjected to faults is as Equation ( 15): Substitute ( 10) and ( 15) into (2), hence the noise-signal ratio of fault free δ i and of fault δ f ,i is Equations ( 16) and ( 17): ) The integer noise-signal ratio of fault free δ and of fault δ f is obtained by substituting (10) and ( 15) into (3): Remark 2. The noise-signal ratio δ f ,i and δ f subjected to fault has a similar form as the noise-signal ratio δ i and δ that is fault free.One can get θi by the LSM method because there are enough valid data that are fault free.However, it is impracticable for θ f i in the early fault due to lack of effective data subject to limited time.
The noise-signal ratio for a data series that is given a dimension n and a length l is related to three factors: the current measured data {x(k)}, parameter θi and the historical inputs φ(k − 1) in the condition of either fault or fault free.When l = 1, Equation (19) becomes Equation (20): The noise-signal ratio δ f (k) of single sample x f i (k) is referred to by using the input φ f (k − 1) and responding parameter θT f i at sample k.The other way around, one can get θT f i at sample k by using δ f (k) in the case of knowing x f i (k) and φ f (k − 1).

The Relation between Noise-Signal Ratio and Parameter
Theorem 1.For a data series {x(k), k = 0, • • • , l}, the following conclusions are obtained if it is written as the form of Equation ( 1): 1. Different ω f induce different θ f ; 2. The same θ f causes the same noise-signal ratio δ f ; 3. Different θ f incurs different noise-signal ratio δ f ,i .
Proof. 1.For a measured data series {x(0), } subjected to fault and noise, it can be described by: where θ f is the parameter by LSM and For a fault denoted by ω f 1 (k), the data series can be written as: For a fault denoted by ω f 2 (k), we are not sure whether the fault will change the parameter θ f .Therefore, the data series can be written as: where the subscripts f 1 and f 2 are used to distinguish the data and parameters under different faults.
It is noted that we discuss the data properties of a measured data series.As a result, Therefore, we can have: leading to ω f 1 = ω f 2 , which is contradiction.As a result, we can have θ 2. According to the definition of Equation (2), we have: For a measured data series {x(0), It is obvious that δ f 1,i = 0 and δ f 2,i = 0. Therefore, This means the same θ f causes the same noise-signal ratio δ f ,i .Further, it results in the same integer noise-signal ratio Hypothesis 1. Different θ f have the same noise-signal ratio δ f , which means δ f 1 = δ f 2 .Observe that: which is equivalent to: Rearranging the Equation above, we have: by the matching squares method.Let: Further, let: Note h3 = 0 besides x 1 (k) = 0 or ω(j − 1) = 0 (scarcely): (x 1 (j) = x 2 (j)) for the same series data.If h = 0, there is h1 = 0 and h4 = 0. Further: Notice θT f 2 and θT f 1 are fixed by LSM and φ(k − 1) is a vector from measured data, but it is uncertain for all k.There is no other vector to satisfy this Equation except θT which is contrary to the hypothesis.Remark 3. The above analyses release the relationship between parameters θ f and noise-signal ratio δ f ,i .One can eliminate the influence of noise by the most extent by adjusting the parameter θ f with the target of minimizing the NSR of data series.Once the parameter θ f is determined, the model is used to forecast the next state xk+1 without noise.Therefore, the measured state x k+1 will be judged immediately according to the noise law based on the model prediction xk+1 .
The parameters θ f can be estimated by traditional methods such as LSM and MLE (maximum likelihood method) based on the historical numerical data.Window technology is used to reduce computational load, and the sliding window is employed to capture the time-varying parameters in the dynamic system.The statistics characteristics depend on the data in the window.A longer window, which includes more data, means higher accuracy, but needs more time to make a decision.A shorter window, which consists of less data, means a quick decision, but it also needs enough data in order to satisfy the statistics law.

Seeking θ f by the Reinforcement Learning Method
Engineering systems are subjected to faults or malfunctions due to unexpected events, which would degrade the operation performance and even lead to the operation failure.As a result, the fault should be detected quickly, and measures will be taken as early as possible.The greatest difficulty is the lack of enough valid data for an early fault.Reinforcement learning provides a way to estimate the parameters directly by approaching the noise-signal ratio δ f of the fault to noise-signal ratio δ h of health (fault free).
To apply the reinforcement learning, the first thing is to determine the cost function R k (δ f (k)) at time k.Here, one defines the cost function R k (δ f (k)) at time k as an absolute value of error between the current integer noise-signal ratio δ f (k) and the integer noise-signal ratio δ h of being fault free.
where δ h is the integer noise-signal ratio of being fault free that will be achieved offline according to Equation ( 18), | • | is the absolute value and the meanings of other parameters are the same as before.
The function V k (δ f (k)) after time k is defined as: As a result, one has: Following a Bellman optimal principle, the optimal value function is obtained according to Equation (41): where V * (δ f (k)) and θ f (k) are the optimal value function and the parameter at time k, respectively; and γ is a discount factor, 0 ≤ γ < 1.
It is noticed that (41) cannot be used online because one cannot know the information of the future time instant, that is δ f (k + 1).A Q-algorithm proposed by Watkins [23] provides an effective solution by substituting the Q-function.A mimic of the Q-algorithm defines the evaluation function Q(δ f (k), θ f (k)) as the minimum discounted cumulative reward that can be achieved from δ f (k) and θ f (k) as the first action: where ϕ(δ f (k), θ f (k)) expresses the state δ f (k + 1) that comes from δ f (k) and θ f (k), that is One denotes ϕ(δ f (k), θ f (k)) in order to stress the relation between δ f (k + 1) and δ f (k), θ f (k).If Q achieves its optimization under some parameter θ f (k), the function V can also achieve its optimization with the same parameter.As a result, V may be replaced by Q.This implies that the optimal parameter can be obtained only by reward without using the value function V.
Denote the optimum of Q as Q * ; therefore, one has: where the superscript * expresses the optimal values.It is seen from Equation (43 with the same parameter.Therefore, the optimal parameter θ f (k) can be obtained by the policy iteration that includes the alternation of two processes: policy evaluation and policy improvement following Equations ( 44) and (45): where π k is called a policy in reinforcement learning.By using policy iteration, it will finally converge to the steady state, and we get the responding parameter.
It is important for policy iteration to be convergent.Fortunately, it has been proven by Lemma 1.

Lemma 1 ([21]
).Consider a Q learning agent in a deterministic Markov decision process (MDP) with bounded reward (∀δ The Q learning agent uses the training rule of Equation: to arbitrary finite values and uses a discount factor γ such that 0 ≤ γ < 1.
following the n-th update.If each state-action pair is visited infinitely often, then Q Remark 4. Lemma 1 provides a guarantee on the convergence of Q learning.By using policy iteration, the Q learning agent will finally converge to the steady state, and the optimal control π * (δ f (k), θ f (k)) can be obtained readily.

Procedure 1:
The RL algorithm can be summarized as follows: Step Step 2: Select a parameter θ f (k) randomly.
Step 4: Get the new state δ f (k + 1) and compute the value function according to Equation (40).
Step 6: Set the next state δ f (k + 1) as the current state δ f (k).

Detection of Fault
Based on the parameters θ * f (k), we will get the next state x f (k + 1) according to Equation (15).Therefore, we have a chance to judge new measure data x f (k + 1) immediately with taking x f (k + 1) as a criterion.
The state x f (k + 1) with fault is made up of three parts: the real state x * (k + 1) that is fault free, the component from fault ω f and the component from noise ω.We take the first two items as an integer and remark that they are the real data x * f (k + 1) of x f (k + 1).Considering the parameter θ * f is obtained by seeking for a goal of minimizing the noise-signal ratio, Equation ( 15) implies the noise minimization of forecasting the state at the next time k + 1.Therefore, x * f (k + 1) is obtained by θ * f according to Equation ( 15).We will get the estimated state x f (k + 1) at time k + 1 in the case of fault according to: where θ * f (k) is the parameters at time k obtained from the RL algorithm, T is the transpose and e = [e 1 , e 2 , • • • , e n ] T is the confidence interval of noise ω at confidence level α: where D i is the variance of the i-th component of samples, which are obtained offline by data series The above analysis shows that one can forecast x f (k + 1) in a noisy condition only by using φ(k) during one sampling period.It is valuable for the system to detect faults promptly.
Define the Euclidean distance (ED) between measure x f (k + 1) and estimation x f (k + 1) as: where x f (k + 1) ∈ R n and x f (k + 1) ∈ R n are the measured data and the estimated data at time k + 1 under the fault, respectively.The threshold of ED is selected as the maximum error between measured data and estimated data being fault free: One can detect a fault if: Once one detects a fault, the parameters that are fault free will keep unchanged in order to build a virtual healthy model.Meanwhile, the parameters subject to fault continue to renew by the proposed RL method and forecast the next state under fault.In this condition, the ED becomes an indicator of fault degree (IFD).Therefore, we get Equation (52) by replacing x f (k + 1) for fault with x(k + 1) for fault free in Equation (49): Here, x f i (k + 1) for minimizing NSR is used to instead of x f i (k + 1) in order to reduce the effect of noise.We use the IFD(k + 1) to express the severity of the fault at k + 1, so we will evaluate the fault degree in time and take measures to balance the safety and efficiency of the plant.Remark 5.One will detect a fault and evaluate the fault degree promptly during one sampling period according to Equations (51) and (52).
The forecast of states at k + 1 is valid under faulty or under fault-free conditions because the parameters of a reference model are essentially obtained by minimizing the noise-signal ratio.
This method only makes use of the residual and noise-signal ratio so that it is easy to identify the condition under being fault free.Meanwhile, it has the ability to trace unexpected fault by adjusting the parameters online.

Procedure 2:
The fault detection and fault seriousness degree procedure is given as follows: Step 1. Get the next real state x * f (k + 1) without the noise based on the parameters θ * f (k) from Procedure 1 according to Equation (15).
Step 2. Computer the variance of the i-th component of samples from the data series being fault free according to Equation (48).
Step 3. Get the estimated state x f (k + 1) according to Equation (7).Sept 4. Get the measured data x f (k + 1).
Step 6. Compute the threshold of ED according to Equation (50).
Step 7. Perform fault detection and get the fault seriousness degree according to (51) and (52).
Step 8. Go to Step 1 to check the next state.

Examples and Simulations
In this section, simulation results based on a DC-motor are presented to verify the efficacy of the proposed scheme.Figure 3 shows the topology of the DC-motor test bed.The DC-motor is selected as Model 57BL90-210 with 24 V, 1000 rpm and 60 W. The rotary encoder is LPD3806-600BM.The integrated driver is an improved ZD-6405 that provides the positive inversion with a toggle switch and speed governing with 0-5V control voltage.It also gives the armature current detection and some protections against short circuit, under voltage and overload.The DC-motor is driven by an integrated driver with the controller of the STM32 single-chip microcomputer.The controller of STM32 is used to receive the DC-motor speed collected by the rotary encoder and the armature current from the integrated driver and, meanwhile, to output the driver control voltage according to the control approach.The controller is programmed on the plat of Keil3.0 by the JTAG (Joint-Test-Action-Group) interface, and the data are transmitted to the computer online in order to save memory.The computer is an i5-2320 CPU with 3.0 GHz and 32 G RAM.The MATLAB 2011 is used to run the method and share the data from the controller by data/file exchange technology.We add a white noise to data from the sensor before they are transmitted to the computer in order to strengthen the noise's effects.The test bed of the DC-motor is shown in Figure 4.A fault-free time series is produced according to the DC-motor system.The estimated model of one-order of system is obtained by LSM and has passed the statistical test under the significance level of 0.05 in the healthy condition:  5.The blue curve, the red curve and the green curve are the data that are fault free, measured data subject to fault and estimated data by the proposed method, respectively.When the fault occurs, the system responds to the fault after two sampling periods due to the inertia.State x1 conforms to the healthy state (blue curve) due to the little influence of this fault.State x2 begins to deviate from the blue curve from Sample 203 and raises to 0.5 after seven sampling periods.The new stability that has a stable bias with the healthy state (blue curve) achieves at the time of system response the stability of the fault.The estimated data (green curve) for the RL method are obtained by immediately adjusting the model parameters along with minimizing the NSR.One can see that the green curve coincides with the red curve whether before and after a fault occurs.In order to compare with the sliding window method (SLW), we determine an estimated θ instead of θ LSM with the width of sliding window l = 50.The result is shown in Figure 5 as the black curve.The black curve shows that State x2 has a similar tendency as the green curve except with a delay.During the healthy stage, both SLW and RL methods have good performance in tracing measure data (red curve), and the SLW has less fluctuations than RL.When a fault appears, the SLM will experience a transient process similar to the green curve, raising from 0.3-0.5 after about 25 sampling periods, but not immediately.This means the SLM will have a longer delay to respond to the fault.The SLW method is good for the healthy process that has a stable statistical indicator.When there is a fault occurrence, the statistical indicators of the data series move to a new stable state to fit the fault after they suffer a transition change.This process depends on the fault style and intensity.Therefore, the SLW method cannot avoid the delay due to its necessary data collection to change the statistical indicators in the range of its window length.It can speed the judgement by shortening the window length.However, if the window length is too small, the statistical indicators will become unstable because the data of the window cannot express the feature of the data series.Our proposed RL method will make up for this condition.We also show a training process of minimizing the noise-signal ratio by reinforcement learning.It is seen in Figure 6.The horizontal coordinate and vertical coordinate represent the episodes and the responding NSR, respectively.The discount factor γ is 0.95.Beginning with a parameter θ f (k) randomly (as Procedure 1), the NSR will converge after a training of 8300 episodes, and one will get the required parameter θ f (k) when it is convergent.

Fault Detection
A comprehensive fault signal ω f combined with a step, a sine and a slope is added to State x2 in order to verify the fault diagnosis and detection ability of the proposed RL method.The fault signal is generated according to Equation (54): and shown in Figure 7.
The state x f (k + 1) at time k + 1 is estimated based on the observation φ f (k) at time k according to Equation (15) and in which θ f is obtained by the proposed RL approach.The evolution of states from k = 100 to k = 1000 is shown in Figure 8.The blue curve, the red curve and the green curve are the data that are fault free, the measured data and the estimated data, respectively.It is seen from Figure 8 that the estimated data (green curve) coincide with the measure data (red curve) throughout the process of different faults.In fact, the green curve is an estimation based on the measured data at the previous moment by using the proposed RL approach.It is produced a sampling period earlier than the red curve.We also compute the errors between measurement and estimation according to Equation (49) in order to show the accuracy.The mean of x1 and x2 between measured data and estimated data are 0.05 and 0.02, respectively, and the maximum error is 0.25 and 0.15.The result is seen in Figure 9.If the data that are fault free are taken as a reference and the fault degree is expressed with the IFD according to Equation (52), the threshold of ED is obtained in the condition of being fault free based on the healthy data from 1-200 by Equation (50) and ED sh = 0.0286.Then, we compute the IFDs at every sampling time according to Equation (52).The results are shown in Figure 10.The blue curve and the red curve are the indicator of fault degree (IFD) and the threshold of ED, respectively.Figure 10 shows the IFD that is fault free is below the threshold.During the fault process, the IFD that fluctuates with a limited range is above the threshold except some samples that are close to healthy data.
We will also know the fault severity at every sample by observing the IFD s scale.For example, the fault from Sample 200-Sample 300 is limited between 0.05 and 0.15, which means the fault is comparatively stable.At Samples 320, 380, 440 and 510, a peak appears respectively with a heavy fault over 0.3.

Influence of Disturbance
We give a step disturbance to State x2 by raising the control voltage at Sample 20.The evolution of states is shown in Figure 11.The blue curve, the red curve and the green curve are data without disturbance, measured data and estimated data by the proposed method, respectively.It is seen that the armature current almost keeps the initial state because there is no load change.The angular velocity (red curve) rises to 0.4 rad in response to this disturbance after a short transition.The proposed method gives an ample estimation (green curve) because the data with disturbance have enlarged the NSR more that without disturbance in a long enough process.From an inverse view, an ample estimation will be taken to make up the NSR without disturbance according to the proposed method.This shows the RL's robustness in disturbance.The proposed method cannot distinguish between faults and disturbances because it makes a decision only according to the NSR.In fact, the disturbance is eliminated by the closed loop of the control system.If the disturbance cannot be removed by the control system due to the fault, it is necessary for this disturbance to be handled as a special fault in order to keep the plant safe and effective.

Conclusions
Comparing a single sample datum with healthy data is the fastest way for fault detection.However, it can hardly be achieved because the noise of sample data will disturb the normal data.No one knows whether the discrepancy between sample data and healthy data comes from fault or comes from noise only according to a single collected datum.The statistical method needs a quantity of valid data; however, it is difficult to obtain them in the early stage of unexpected fault, which leads to a dilemma of prompt FDD.In order to solve these shortages, a reinforcement learning method has been proposed to estimate the model parameter by taking the parameter as a special action.Taking a minimization of the NSR as a goal of the data series, the model parameter can be obtained by applying the technology of the policy valuation and policy improvement.This method has the ability of getting rid of the noise's influence and keeping consistency with the current situation.Furthermore, the FDD has been implemented by evaluating the residual of the real-time process data and pre-obtained healthy time-series data.The fault can be promptly detected with the help of the threshold from the healthy data series by only using the information within one sampling period.
In the future, further work will distinguish the slight fault signal from healthy data as quickly as possible and apply this method to an engineering-oriented real-time process.

Figure 2 .
Figure 2. The basic frame of reinforcement learning.

Figure 3 .
Figure 3.The topology of DC-motor test bed.

Figure 4 .
Figure 4.The test bed of the DC-motor.
do an experiment to test the speediness of fault judgement.The fault signal ω f with a step of amplitude 0.2 is added to State x2 from Sample 200.The results from Sample 195 to Sample 235 are shown in Figure

Figure 9 .
Figure 9.The error between measure and estimation.

Figure 10 .
Figure 10.Results of fault detection.IFD, indicator of fault degree.

Figure 11 .
Figure 11.The evolution of states in disturbance.