Hierarchical Deep Recurrent Neural Network based Method for Fault Detection and Diagnosis

A Deep Neural Network (DNN) based algorithm is proposed for the detection and classification of faults in industrial plants. The proposed algorithm has the ability to classify faults, especially incipient faults that are difficult to detect and diagnose with traditional threshold based statistical methods or by conventional Artificial Neural Networks (ANNs). The algorithm is based on a Supervised Deep Recurrent Autoencoder Neural Network (Supervised DRAE-NN) that uses dynamic information of the process along the time horizon. Based on this network a hierarchical structure is formulated by grouping faults based on their similarity into subsets of faults for detection and diagnosis. Further, an external pseudo-random binary signal (PRBS) is designed and injected into the system to identify incipient faults. The hierarchical structure based strategy improves the detection and classification accuracy significantly for both incipient and non-incipient faults. The proposed approach is tested on the benchmark Tennessee Eastman Process resulting in significant improvements in classification as compared to both multivariate linear model-based strategies and non-hierarchical nonlinear model-based strategies.


Introduction
Process faults significantly impact the profit of chemical plants.A fault in a dynamic system is an anomalous variation that results in the deviation of process state variables from its acceptable range of operation [24].Since the effect of faults often propagates along the process it is imperative to detect them soon upon their occurrence.To mitigate the economic losses resulting from faults, industrial plants are often operated with multiple sensors and control loops that employ these sensors for feedback corrective action.However, in the presence of large process disturbances and manipulated variable constraints, these control schemes are not sufficiently resilient to avoid abnormal operation [10].
There are two different major approaches to fault detection and diagnosis (FDD) for industrial process systems namely, active and passive.Most of the work in the area of process systems engineering for FDD is based on passive approaches where the system outputs are monitored for detecting observable statistical changes.The active approach for FDD involves injecting persistently exciting input signal of specific bandwidth into the system and using the resulting input-output data for incipient fault detection and diagnosis [19,13,6].In this work, a blend of both passive and active approaches are used where the passive approach is shown to be effective for identifying most faults but an active approach is required for detecting incipient faults.A fault is generally referred in the literature to as observable when its occurrence can be observed from a set of measured variables [44].Observability/Diagnosibility is an important aspect in any fault detection and diagnosis problem since the lack of it leads to incorrect detection and miss-classification.
Lack of observability often arises due to low signal to noise ratio in the measurements used for FDD and the presence of feedback control [24].The purpose of feedback controllers is partly to compensate for anomalous system variations which can mask the effects of certain faults.In addition, lack of distinguishability between different faults is related to the fact that different faults have a similar effect on the measured variables.
A typical process monitoring system is composed of two parts: a detection algorithm and a classification algorithm.
The objective of detection is to make a binary decision on whether the process is in normal or faulty operation.After detecting abnormal operation, a fault classification algorithm is used to infer the type of fault and to determine which associated process variables are affected by the fault.In the current study, we simultaneously perform detection and classification with a single algorithm by considering the normal operation condition as an additional fault class to be identified in the classification step.
Process monitoring schemes rely on process models that are trained using historical data and then are used to infer faults.Based on the type of model used, the schemes can be classified into two main approaches: mechanistic modelbased (e.g. using first principles models) and data-driven model-based approaches [10].Data-driven models, such as the one used in the current study, are based on a comparison between combination of different sensor measurements corresponding to normal behaviour to the values of these variables corresponding to the faulty operation [55].
Within the class of data-driven approaches several algorithms predominantly are based on multivariate statistical methods such as PCA (Principal Component Analysis) [57,54,31,43] or their dynamic variants such as DPCA (Dynamic Principal Component Analysis) [10,54,28,39,37] have been proposed.Since the above mentioned methods were based on assumptions of process linearity, nonlinear modelling techniques such as ANN (Artificial Neural Network) based methods are investigated to deal with nonlinear process behavior.Some of the key challenges with the earlier versions of ANN algorithms were related to the difficulty in training large networks and perform complex calculations given the computational limitations that existed at the time when these algorithms were first proposed.In the last decade, a new generation of Deep Neural Networks (DNNs) algorithms has emerged that capitalizes both on the significant increase in computational power as well as novel algorithmic developments that facilitate the training and calibration of these networks.The use of these algorithms for fault detection in the process industry has recently received increased attention.However, despite the improvements in detection accuracy obtained with these techniques for nonlinear problems the classification of incipient faults remains a challenge.Based on the above facts, the cur- rent study focuses on deep learning techniques for the detection of faults with emphasis on incipient faults.Towards this goal, a hierarchical classification strategy based on DNN is proposed that involves identifying separate models for different subsets of faults with different signal to noise ratio characteristics.An addition of a test signal is also investigated to enhance fault diagnosibility for faults that are particularly difficult to identify.The proposed approach is based on dynamic network models that explicitly exploit the dynamic correlations in the data , i.e. auto-correlations and cross-correlations.In addition, the effects of data horizon or time length is also addressed.
All studies are conducted with the standard simulated data from the Tennessee Eastman Process (TEP).Since its introduction in the process systems research, the TEP has served as a benchmark problem for testing control and fault detection algorithms and it is thus ideal for comparing existing approaches to our proposed algorithm.The following are the main contributions of the current study: 1. Analysis of the effect of data horizon in the dynamic deep learning model to improve fault classification ability.The manuscript is organized as follows.Different fault detection and classification algorithms that are relevant to the TEP problem are briefly reviewed in Section 2. Section 3 presents the proposed methodology.Section 4 describes the case study.The results and comparisons with previously reported approaches are presented in Section 5 followed by conclusions.

Review of FDD Techniques relevant to the Tennessee Eastman Process (TEP)
The Tennessee Eastman plant has been used widely for testing several process monitoring and fault detection algorithms [10,31,39,52,40,4,29,30].Thus, the current brief review of detection and diagnosis methods mainly focuses on TEP, that is also used in the current work as the case study.Also, some recently reported applications of deep learning algorithms that are relevant to the current study are included.
Figure 1 shows the flow-sheet of the TEP process consisting of different interconnected unit operations including a vapor-liquid separator, a reactor, stripper a recycle compressor and a condenser.The simulation contains 20 preprogrammed fault scenarios, which are shown in Table 1.Additional details about the process model can be found in the original paper [14] and descriptions of the different control schemes that have been applied to the simulator can be found in [40] and its revised version [4].Several data-driven statistical process monitoring approaches have been reported for the detection and diagnosis of disturbances in the Tennessee Eastman simulation.Each of these methods has shown different levels of success in detecting and diagnosing the 20 faults considered in the simulations (Table 1).Several statistical studies have reported faults 3, 9 and 15 as unobservable or difficult to diagnose due to the close similarity in the responses of the noisy measurements used to detect these faults [31,43,10,15].These 3 difficult to observe faults are referred hereafter as incipient faults.It should be emphasized that the responses associated with incipient faults are similar but not identical to each other.Ideally in a noise free case, if a perfect model is available then these three faults could be correctly diagnosed.However, that is not the case in the presence of the noise levels used in the TEP studies and thus these incipient faults are incorrectly diagnosed most of the time.
Among the techniques used for detection, Principal Component Analysis (PCA) and its variants are widely employed [57,54,31].PCA is an unsupervised learning data-driven technique based on orthogonal transformations that compresses a multivariate dataset into a lower-dimensional space while conserving the most relevant information [38,22].Extension of the PCA algorithm to enhance FDD involves the use of dynamic information since PCA capture static correlations only.For this purpose, Dynamic Principal Component Analysis (DPCA) was proposed that uses a dynamic data matrix to learn dynamic correlations [28].Although DPCA improved the diagnosis accuracy over the results obtained with PCA on many TEP faults, it does not significantly improved the detection of the incipient faults [10,28,39,37].Another variant of the PCA approach that combines the results from the PCA algorithm with a Cumulative Sum (CUSUM) operation has shown to be a viable option to solve the detection problem of the incipient faults but a relative long time after occurrence of the fault is needed for detection.The reason is that the cumulative sum of PCA score values over a sufficient amount of time can provide detection of minor changes in the process variables that cannot be detected without using the CUSUM operation [43,44].
Typically, using the multivariate statistical algorithms reviewed above, it is possible to differentiate between normal operation to faulty operation, i.e. nominal operation versus faulty by comparing the values of the normal state with values of the state in the presence of the fault.Then, by combining the results of these detection algorithms with supervised classification techniques it is also possible to identify the particular fault occurring in the process.For example, Support Vector Machines (SVM) is a supervised-learning technique that is based on transforming the input data into a high dimensional space such as the distance between two different classes [48] is maximized.SVM has been applied to the TEP for both detection and diagnosis of faults [29,9,36].
Deep learning fault detection and classification techniques have been widely researched for applications in several engineering fields [2,3].In chemical engineering, machine learning techniques have been applied for the detection and classification of faults in the Syschem plant, which contains 19 different faults [21] and for the TEP problem.Outside the process industries, several studies on deep learning approaches have been conducted for the prevention of mechanical failures.For example deep learning models have been used for detecting and diagnosing faults present in rotating machinery [25,26], motors [47], wind turbines [58], rolling element bearings [16,18] and gearboxes [27,8].Few deep learning studies have been recently conducted on TEP.Similar to linear multivariate statistical approaches, deep learning methods for incipient faults have not significantly improved the diagnosis accuracy as compared to previous studies [52,35,59].
Xie and Bai, 2015 proposed neural network based methodology as a solution for the diagnosis problem in the Tennessee Eastman simulation that combines the network model with a clustering approach.The classification results obtained by this method were satisfactory for most non-incipient faults faults but were not satisfactory for incipient faults.Both Wang et al., 2018 [49] and Spyridon et al. , 2018 [46] proposed the use of Generative Adversarial Networks (GANs), as a fault detection scheme for the TEP.GANs are an unsupervised technique composed of a generator and a discriminator trained with the adversarial learning mechanism, where the generator replicates the normal process behavior and the discriminator decides if there is abnormal behavior present in the data.This unsupervised technique can detect changes in the normal behavior achieving good detection rates for non incipient faults.Lv et al., 2016 [35] pro-posed a stacked sparse autoencoder (SSAE) structure with a deep neural network to extract important features from the input to improve the diagnosis problem in the Tennessee Eastman simulation.The diagnosis results applying this deep learning technique showed improvements compared to other linear and non-linear methods for non-incipient faults.To account for dynamic correlations in the data, Long Short Term Memory (LSTM) units have been recently applied to the TEP for the diagnosis of faults [59].A model with LSTM units was used to learn the dynamical behaviour from sequences and batch normalization was applied to enhance convergence.An alternative to capture dynamic correlations in the data is to apply a Deep Convolutional Neural Networks (DCNN) composed of convolutional layers and pooling layers [50].A DCNN model was constructed to learn the dynamic behaviour of different faults by taking advantage of the spatial, i.e. feature space, and temporal domains.While this proposed deep learning algorithm was shown to achieve high classification rates for non-incipient faults it was not accurate for incipient faults.
Following the above, the current study investigates a different model structure for detecting and diagnosing faults with particular focus on the detection and classification of incipient faults by the following means: i-extension of the time horizon used in the LSTM network, ii-the use of a hierarchical structure that uses separate models for incipient and non-incipient faults respectively and iii-the design and injection of external PRBS excitation signal.

Recurrent Neural Networks (RNNs)
The current study uses a Recurrent Neural Network (RNN) model that was originally developed for handling dynamic data by using time sequences of data x , = 1, 2, ..., ∈ ℝ ℎ × .[42] as inputs to the network.Parameters associated with RNN are shared along a time horizon to capture temporal correlations in data.This enhances the generalization capability of the model to time sequences that were not used for model calibration.Figure 2 shows a schematic description of a simple example of recurrent neural networks.As shown in Figure 2, recurrent models are composed of feed-forward connections, which represent the flow of information from a neuron to another(w and w ), and recurrent connections, which captures important information stored from previous time steps(w ).
Figure 2, shows a fully connected recurrent model that produces an output at each time step and contains a recurrent connection in its hidden layer.In this case, the equation of the hidden layer is formulated to account for the recurrent relation as follows: A well-known challenge for training RNNs is the vanishing gradient or exploding gradient problem arising from the use of gradient descent algorithms in combination with sigmoid activation functions [5].To deal with this problem, the best practice is to use gated-type unit structures within RNN models such as Long-Short Term Memory units (LSTM) [20] and Gated Recurrent Units (GRU) [11].LSTM is reviewed in the following section since they serve as the basis for the models used in the current study for FDD.

Long-Short Term Memory (LSTM) Units
The LSTM unit is composed of three gated units and a memory cell [20].Figure 2 shows a single LSTM unit that includes four major gates: the forget gate (f ), the input gate (i ), the output gate (o ) and the update gate (g ).The key component of the LSTM unit is the memory cell (c ∈ ℝ ℎ ×1 ) that is responsible for storing critical long term dependencies learned over time.The input gate (i ) is responsible for evaluating which part, if any, of the past historical data should be kept.Thus the function of the input gate is to allow the network to keep only relevant information from the previous time steps and discard the rest for a sample .
Subsequently, the information that is worth recording is determined by the memory cell (c ).The process of identifying information and storing in the memory cell consists of two parts: new information that is recorded and information that is discarded.The information that should be discarded from previous cell state c −1 is determined by the forget gate (f ), which is responsible for forgetting previously stored cell state values that have lost their relevance.
Then new relevant information is added and existing cell-state values are updated by first selecting which values to update using the input gate i and the output from the input gate is then multiplied by the new information generated by the update gate g .Ultimately, the output h is computed at every time step from the information contained in the memory cell and it is further gated by an output gate according to its importance or relevance.The mathematical equations describing these gating operations are as follows: where () and tanh() are the element-wise sigmoid and hyperbolic tangent functions respectively. where are the bias parameters.

Deep LSTM Supervised Autoencoder Neural Network (LSTM-SAE NN)
The The decoder function reconstruct the input using the extracted feature vectors.The operation performed by the encoder for a single LSTM layer between the input variables to the latent variables ∈ ℝ ℎ ×1 can be mathematically described as follows: where is the LSTM encoder function The latent variables are used both to predict the class labels and to reconstruct back the inputs x as follows: where is a non-linear activation function applied for the output layer.W ∈ ℝ × and b ∈ ℝ are output weight matrix and bias vector respectively.For training the SAE, the following loss function is minimized: where 1 is the weight multiplying the reconstruction loss in the cost to be minimized, is the number of classes, , is a binary indicator (0 or 1) equal to 1 if the class label is the correct one for observation and 0 otherwise, ̂ , is the non-normalized log probabilities and , is the predicted probability for a sample of class .Moreover, to avoid over-fitting, a regularization term is added to the objective function in Equation 8. Accordingly, the objective function for Deep Supervised LSTM NNs used for FDD is as follows: where W [ ] are the weight matrices for each layer in the network and the weights on the individual objective functions 1 , 2 , 3 are chosen using validation data.

Model Structure and Specifications
The 64Bit Windows 10 operating system in Python environment.The models are developed using Keras [12] (an open deep learning library) on TensorFlow platform.[1].All hyper-parameters such as number of LSTM encoder layers, LSTM units in each layer, weights and learning rate are optimized using Keras-tuner developed by Google team and is included in the Keras library.

Hierarchical Structure
The sensitivity of nonlinear models such as deep neural networks is highly dependent on the variability of the data used for calibration.Accordingly, a key data pre-processing step towards model calibration involves data standardization.It is required to account for different ranges of values and engineering units of the inputs.It is hypothesized that by building separate models for different groups of faults based on similar characteristics of certain faults, it is possible to increase the sensitivity of different models and diagnosibility between faults because of the different specific data re-normalization step conducted within each group.Accordingly, a hierarchical structure is proposed in the current study as shown in Figure 5.In the first level of the hierarchical structure, the normal state condition is grouped along with incipient faults as class 1 and is classified against all the other non-incipient faults.Subsequently, in a second step, PRBS signals are injected into the system to distinguish between different incipient faults and normal state.This hierarchical structure can be summarized as follows: 1.A first-level hierarchical model is used to identify the non incipient faults 2. A second-level hierarchical model that focuses on the normal state and 3 difficult to observe faults, i.e. faults 3, 9 and 15.If the incipient faults cannot be properly identified in the second step then we inject the PRBS.It should be noted that the incipient faults are characterized by responses that are very similar to the normal state and thus a model that is trained to predict all the faults together will be shown to be unable to accurately discern between these responses.It should also be noted that incipient faults that are grouped along with normal state in the first level may also be miss-classified as other faults.Hence, the overall classification accuracy for the incipient faults after the second level is executed has to be re-calculated accordingly.
In the hierarchical structure described in Figure 5 the normalized data is fed to a first level model where the softmax layer of LSTM-SAE NN uses 18 units instead of the 21 units (incipient faults and normal state grouped as one) as used in the non-hierarchical type model.The structure of the model in the second level of the hierarchical structure is similar to the first level one but the difference is that the softmax layer involves only 4 units each for one of the incipient faults (3,9,15) and for the normal state (fault 0).The PRBS is injected only when the incipient fault cannot be properly identified with the Hierarchical Deep RNN based model.

Design: Pseudo-random Binary Signal (PRBS)
Although the hierarchical structure proposed in the previous section enhances the diagnosibility of few faults, detection of incipient faults is still challenging due to lack of excitation to detect these faults in the presence of noise.
This problem is particularly acute in the TEP since the data-set contains variables that are used in closed-loop control thus exhibiting small variation with respect to their set-point values making it difficult to estimate the occurrence of faults from such variables.To increase diagnosibility of incipient faults the use of active fault detection, as reviewed in the Introduction, is proposed for the TEP process.The lack of diagnosibility/distinguishability of the incipient faults can be viewed as a problem of inaccurate identification of a model relating variability in measured values to faults.
To improve the identification accuracy it is required to use inputs that sufficiently excite the system dynamics in the presence of noise [33] which will result in larger changes in the measured quantities and larger sensitivity to fault changes.Thus, it is required to introduce additional excitation to the one available in regular operation of the system.
Accordingly, external forcing signals are injected at particular points of the control loops, e.g. an excitation signal to the set-points of the loops that involve variables related to the difficult to detect faults.The addition of such excitation signals in combination with a separate deep neural network model (second level) in the hierarchical structure described in the previous section is investigated in the current study for detecting and diagnosing incipient faults that cannot be accurately identified with the regular operating data collected from the process.
To avoid a large negative impact of the external signals on the profitability of the plant the input signals should meet certain constraints as follows: 1. Reduce input move sizes (to reduce wear and tear on actuators).
2. Reduce input and output amplitudes, power, or variance.

Short experimental time to prevent losses
In a practical implementation, the added excitation signal should result in variations in the measured quantities that will be large in magnitude relative to the noise.Towards this goal it is necessary to include information of frequencies lower than the crossover frequency of the closed loop transfer function [41].PRBS signals are used as excitation signals in this study since they have a finite length that can be synthesized repeatedly with simple generators while presenting favorable spectra.The spectrum at low frequencies are flat and constant while at high frequencies the spectra drop off.
Thus, the PRBS can be designed to have a specific bandwidth, which can be utilized for exciting the processes within the required range of frequencies [17].The analytical expression for the power spectrum of a PRBS is given by: where is the frequency, is the clock period (minimum time between a change in levels) which is a multiple of the sampling time ( ) and is the amplitude of the signal.Thus, for designing the PRBS signal it is necessary to estimate the amplitude and the frequency range.
Rivera and Gaikwad, 1995, Lee and Rivera,2005 and Garcia-Gabin and Lundh provided practical guidelines for estimating the range of frequency needed for process closed-loop identification using time domain information.The primary frequency band of interest for excitation is determined by the dominant time constants of the system.
where is a safety factor used to augment the bandwidth of the excitation signal, and are the dominant time constants of the open loop and closed loop process respectively.Also, the upper value of the frequency must be lower than the Nyquist frequency to avoid aliasing.Although the magnitude of the signal has not been optimized in the current work, it could be further optimized by taking a profit function of the plant into consideration for minimal losses and using the validation data used for the FDD model.

Results and discussion
In this section, the industrial benchmark TEP is used to validate and demonstrate the effectiveness of the proposed method.We investigated the multi-class classification performance using a total of 20 fault modes which involve all of the compositions, manipulated and measurement variables in the TE process.For an individual class IDV(i), the performance was typically evaluated by a confusion matrix which consists of true positives (TP ), false positives (FP ), true negatives (TN ) and false negatives (FN ).The notation used in the confusion matrix is as follows: Two main important metrics for quantifying the performance of the proposed process monitoring methodology are as follows: Counts of predicted label i

Counts of predicted label other than i
Counts of real label TP TN Counts of real label other than FP FN

Table 3
Confusion Matrix for each fault (IDV(i)) • Fault Detection Rate (FDR): = (fault data that have been detected as fault) FDR represents the probability that the abnormal conditions are correctly detected which is an important criterion to compare between different methods in terms of their detection efficiency.Evidently, a very high FDR is desirable.
• False Alarm Rate (FAR): = (normal data that have been detected as fault) where the class corresponding to normal operation is considered as the positive class.FAR represents the probability that the normal operation is wrongly identified as abnormal and thus a very low FAR is desired and necessary.
The fault detection results obtained with the RNN based model are compared with both linear multivariate statistical methods and deep learning methods reported in previous studies.For a fair comparison between the methods, for studies where only non-incipient faults were considered the results were compared to fault detection results obtained from the first level of the hierarchical structure model whereas for studies where all the faults were considered, the comparisons were done for results obtained from second level of the hierarchical structure model.The fault detection rate (FDR) for all the faults is compared in Table 5 for the proposed method, PCA [35], DPCA [35], ICA [23], Convolutional NN (CNN) [45], Deep Stacked Network (DSN) [7], Stacked Autoencoder (SAE) [7], Generative Adversarial Network (GAN) [46] and One-Class SVM (OCSVM) [46].The fault detection rates for all non-incipient faults and  4 and 5 respectively for different methodologies along with the results from the proposed method.It can been seen from Table 4 that the proposed method outperformed the linear multivariate methods and other DL based methods for most fault modes.For example, for PCA with 15 principal components, the average fault detection rates are 61.77% and 74.72% using 2 and statistic respectively.Since the principal components extracted using PCA captures static correlations between variables, DPCA is used to account for temporal correlations (both auto-correlations and cross-correlations) in the data.The effect of increasing the number of time samples in the Tennessee Eastman simulation is also investigated following the hypothesis that increasing the time horizon will enhance classification accuracy.In the case of DPCA, the number of lags used in the observation matrix is a key parameter.Since DPCA is only a data compression technique it must be combined with a classification model for the purpose of fault detection.Accordingly, the output features from the DPCA model are fed into an SVM model that is used for final classification.Different time horizons were tried for training the DPCA model.Based on validation results the best DPCA model was obtained with 15 lags and thus this model is compared with an RNN also based on 15 lags.The average detection rate obtained was 72.35%.ICA [23] based monitoring scheme perform better than both PCA and DPCA based methods with an averaged accuracy of approximately 90%.It should be noted that all these methods (PCA, DPCA and ICA) perform poorly for detecting incipient faults.In addition to the comparison to linear methods the proposed methodology was also compared with different DNN architectures such as CNN [7], DSN [7], SAE-NN (results reported in Chadha and Schwung,2017) and GAN [46], OCSVM (results reported in Spyridon and Boutalis,2018) reported previously.It can be seen that the proposed method also outperforms these DNN based methodologies.The relative advantage of our method versus these other DNN architectures (Table 4) is mostly due to the inclusion of the incipient faults within the normal class.This reduces the confusion between the normal samples  were chosen as the optimal time-horizon.Confusion matrix for level 1 model is presented in Figure 7.
The next important design parameter for the second level hierarchical model is the location in the process at which the external excitation signal should be introduced to maximize information about the occurring incipient fault.In this work, this choice is based on the flow-sheet and by identifying which variables are mostly correlated to the incipient faults under consideration.Specifically, the excitation signals were added to process set-points in control loops that are most correlated to the incipient faults.When the selection of the variable to be excited by a PRBS is not obvious from the process flow-sheet, a more systematic approach is to use sensitivity analysis, e.g.sensitivity of changes in the variable connected to the fault to all process variables.Since it may be detrimental to perturb the set-point continuously by the PRBS signal the latter can be introduced intermittently into the process.In the current work an excitation signal of length 40 time-steps was intermittently introduced every 4 hours into the process by assuming that such event will not impact significantly the profitability of the process (for test data).Changes in the separator temperature set-point will force changes in the condenser temperature.Since the fault to be identified is stiction in the valve that affects the condenser temperature, the imposed PRBS in the separator set-point indirectly helps in identifying fault 15.A snapshot of the PRBS and the output signal is shown in Figure 8.For fault 9 i.e. random variation in D feed temperature (refer and incipient faults are considered.Figure 9 (a) shows the confusion matrix after introducing the PRBS signal that was designed for identifying fault 15 and Figure 9 (b) shows the confusion matrix after introducing both PRBS signals that were designed for identifying fault 15 and fault 9.The total FAR calculated using Equation 17 was 2.41%.
The averaged fault classification rate for all non-incipient faults and for all faults (including incipient faults) are shown in Figure 11 and 12 respectively.Figure 11 shows a bar-chart comparison of the proposed method with several non-linear methods such as Sparse representation [51], SVM [56], Hierarchical model based method [53], Random Forest, Structural SVM.It can be seen that the Hierarchical Deep RNN based method outperforms other methods with a significant margin.It should be noted that comparisons made in Figure 11 do not consider incipient faults.In Figure 12, the averaged test accuracy of all faults (both incipient and non-incipient faults) are compared with other DL based  methods [34].It can be seen that the second level hierarchical model combined with the introduction of the designed PRBS signals significantly improves the classification of the incipient faults and thus the averaged test accuracy for fault diagnosis increases significantly.

Conclusions
This work studied the application of a deep learning model within a hierarchical structure as a way to increase the detection and classification of faults in the Tennessee Eastman Process (TEP).The TEP simulation contains 20 different faults that were used during this study to make the classification problem.As previously reported by other researchers a subset of these faults, referred in these study as incipient, are particularly difficult to diagnose due to low signal to noise ratio and similarities in the resulting dynamic responses corresponding to different faults.

Acknowledgement
This work is the result of the research project supported by MITACS grant IT10393 through MITACS-Accelerate Program.

2 .
Development of a hierarchical structure combined with the design of external excitation signals to enhance the detection and classification accuracy for both incipient and non-incipient faults.3. Comparison of the proposed algorithm to both linear multivariate statistical techniques and other deep learning (DL) based methods previously reported.

16 )
Deviations of heat transfer within stripper random variation IDV(17) Deviations of heat transfer within reactor random variation IDV(18) Deviations of heat transfer within condenser random variation IDV(19) Recycle valve of compressor, underflow stripper and steam valve stripper stiction IDV(20) unknown random variation

Figure 3 ,
Figure 3, is based on the minimization of a weighted sum of the reconstruction loss function and the supervised classification loss corresponding to the first and second terms in (Equation (8)) respectively.The minimization of the reconstruction loss function in Equation(8) ensures that the estimated latent variables are able to capture the variance

Figure 5 :
Figure 5: Hierarchical structure used for fault detection and diagnosis

Figure 6 :
Figure 6: Training and Validation averaged classification accuracy for different values of hyper-parameters.Different colour represent different runs with a specific combination of hyper-parameters

Figure 7 :
Figure 7: Confusion Matrix for the first level model of the hierarchical structure (i.e.classification of non-incipient faults and considering incipient faults as a normal class)

A
comparison between deep learning techniques to a multivariate linear technique for fault detection such as PCA, DPCA, ICA and other deep learning methods is also presented.It is observed that RNN-Hierarchical based model is superior than traditional linear and other deep learning based methods for fault classification due to their ability to capture nonlinear dynamic behaviour.It was also shown that the classification averages can be enhanced by extending the length of the time horizon of past data fed to the RNN based model.However, most of these improvements in classification occurred for the non-incipient faults.Therefore, an active fault detection approach was pursued where a hierarchical model structure combined with external PRBS signals was proposed that proved to be particularly effective for classifying incipient faults.

Figure 9 :Figure 10 :Figure 11 :
Figure 9: Confusion Matrix on test data for the second level model of the hierarchical structure: a) After adding designed PRBS signal w.r.t.fault 15 b) After adding designed PRBS signal w.r.t.fault 9 and fault 15

Table 1
Process Faults for classification in TE Process

Table 2
Downs and Vogel, 1993)ed variables (fromDowns and Vogel, 1993) RNN based model with LSTM units used in the current study was developed with training and testing data sets generated from the Tennessee Eastman Process (TEP) simulation.The data are extracted from simulations of the system conducted at either the normal state or when each of the 20 different faults is occurring in the process.It is assumed that at each sampling interval, 52 different variables are measured and organized into a vector.Each such vector of measurements is acquired every 3 minutes.It should be noticed that during testing of the methods proposed in this study the normal state is considered as a different separate class and hence a total of 21 different classes, i.e. 20 faulty plus one normal operations, are considered for classification.The standard dataset can be downloaded from http://depts.washington.edu/control/LARRY/TE/download.html.The simulator is ran for 72 hours (training: 24 hours; testing: 48 hours) for each fault generating 1440 samples for each fault class and normal class.The data is then divided between calibration and validation data sets, where the first 480 samples are used as training data and the rest are used for testing for each class.This results in a total of 10,080 training samples and 19,200 testing samples.A small fraction of training dataset is used as validation dataset for selecting the optimal hyper-parameters.It is important to note that the number of training, validation and testing samples vary depending on the time horizon used in dynamical RNN based model.The results reported in the following section are based on the classification accuracy of test dataset, i.e. on data that was not used for model calibration.The experiments in this paper have been implemented on an Intel Core i7-7700HQ PC (2.80GHz, 16GB RAM) and NVIDIA GeForce GTX 1060 (6GB)

Table 1 )
the PRBS excitation ( ∈ [ , ] where = 0.0087 rad/s and = 1.74 rad/s) signal is introduced to the D feed ratio, in order to create a suitable excitation.After developing this PRBS signal, we added both signals to the process at different times during the simulation.For fault 15, the PRBS signal is designed with a frequency range of ∈ [ , ] where = 0.005 rad/s and = 1.74 rad/sFor the second level model, there are 1,796 training samples and 4,196 testing samples in total with a time horizon of 150 time-steps.The model consists of 284 encoder LSTM units in the first hidden layer, second layer consists of 100 LSTM units, followed by 278 LSTM units for processing of the output of the encoding layer.Thereafter, the output of the third LSTM layer is passed through a dense layer for classification.Hyper-parameters such as number of layers, number of LSTM units in each layer, classification weights, learning rate, time-horizon, weights in the loss function etc. are selected using the validation data which is part of the training dataset.The hyper-parameter search is implemented again using the keras-tuner.For the second level model, the samples corresponding to fault 0 (normal)

Table 4
Comparison of Fault Detection Rate with different methods with non-incipient faults only

Table 5
Comparison of Fault Detection rate with different methods (with all faults)