Data-Driven Fault Diagnosis Techniques: Non-Linear Directional Residual vs. Machine-Learning-Based Methods

Linear dependence of variables is a commonly used assumption in most diagnostic systems for which many robust methodologies have been developed over the years. In case the system nonlinearities are relevant, fault diagnosis methods, relying on the assumption of linearity, might potentially provide unsatisfactory results in terms of false alarms and missed detections. In recent years, many authors have proposed machine learning (ML) techniques to improve fault diagnosis performance to mitigate this problem. Although very powerful, these techniques require faulty data samples that are representative of any fault scenario. Additionally, ML techniques suffer from issues related to overfitting and unpredictable performance in regions which are not fully explored in the training phase. This paper proposes a non-linear additive model to characterize the non-linear redundancy relationships among the system signals. Using the multivariate adaptive regression splines (MARS) algorithm, these relationships are identified directly from the data. Next, the non-linear redundancy relationships are linearized to derive a local time-dependent fault signature matrix. The faulty sensor can then be isolated by measuring the angular distance between the column vectors of the fault signature matrix and the primary residual vector. A quantitative analysis of fault isolation and fault estimation performance is performed by exploiting real data from multiple flights of a semi-autonomous aircraft, thus allowing a detailed quantitative comparison with state-of-the-art machine-learning-based fault diagnosis algorithms.


Introduction
Fault diagnosis (FDi) systems are essential components of many engineering systems. These systems play a very important role for accident prevention, service continuity, and cost minimization, ultimately leading to increased human safety in transportation systems. Due to the increasing availability of multi-source sensorial data along with the significant computational power and storage capacity of modern computers, today, there is an increasing research interest in developing data-driven (DD) algorithms to tackle complex monitoring and control problems [1]. DD approaches are widely used in applications where a detailed physical knowledge of the system is unavailable or not readily available or in cases where system input-output relations are too complex or too uncertain. Typically, DD approaches applied to FD problems derive fault-sensitive signals (also known as diagnostic signals) directly from experimental models identified from sample datasets acquired from the monitored system during normal and faulty operations. The pioneers in this field are Isermann [2][3][4], Basseville [5][6][7], and Gertler [8][9][10].
Today, multivariate statistical process monitoring (SPM) methods, such as principal component analysis (PCA) [11] (and its variants) and parity space approaches [12][13][14][15], are widespread techniques used for system monitoring and fault diagnosis purposes. The widespread use of PCA-based monitoring techniques is due to its simplicity and its capability to efficiently manage large quantities of multivariate data. In this context, widespread approaches used for fault isolation (FI) are the so-called contribution plots methods, such as the reconstruction-based contributions method [11,13]. Although powerful and effective, this method might produce incorrect fault isolations due to the so-called smearing effect, i.e., the influence of a faulty sensor measurement on non-faulty sensors contributions. In a system characterized by a limited number of monitored variables, directional residuals methods have shown to be a valid alternative to SPM methods. Directional residual methods work by relating possible faulty sensors with the characteristic direction of the faults, isolating the faulty sensor as the one having the smallest distance between the residuals signal and the monitored sensor fault directions [12,13,16]. An important variant of the directional residual method is the structured residual approach [9].
A fundamental assumption underlying most SPM techniques is the linear dependence between modeling variables. While this assumption may be reasonable in some applications, in the case of strongly non-linear systems or significant variations in the operating range of the signals, these techniques could lead to inaccurate results. The direct consequence on fault diagnosis is that the linearity in covariate FD models may produce many false positives and false negatives in practical applications when applied to non-linear systems.
A typical approach to allow non-linear dependencies between the covariates is to augment their number by introducing several new terms derived by non-linear transformations of the original signals while retaining the linear dependence between the extended set of covariates. Although this approach is straightforward, it suffers from the problem that the set of extended non-linear functions cannot be defined 'a priori'. Instead, a complex and time-consuming tuning phase is needed to select a suitable set of non-linear transformations. These problems have motivated the scientific community to go beyond the linear dependence assumptions of covariates, promoting the development of non-parametric models based on generic non-linear smooth functions, such as splines, neural networks, and kernel models, and implement specific data-driven tuning algorithms to reduce the mean and variance of the mapping error to identify best-fitting models [17].
Another direction to deal with the problem of fault diagnosis in non-linear systems is related to the use of machine learning and deep learning techniques. The issue of Fault Isolation, in fact, can be easily set as a direct application of a classification or clustering problem. At the same time, ML regression techniques can also be employed for estimating the shape and amplitude of the fault (Fault Estimation (FE)) [18][19][20][21].
An interesting comparison of FI and FE performance of three popular classification algorithms, namely the support vector machine (SVM), K-nearest neighbor (KNN), and decision tree (DT), can be found in [17]. The classification algorithms are trained using data from optimized and non-optimized sensor subsets and then validated with new data characterized by varying degrees of fault severity. In [22], Erfani et al. present a hybrid model in which an unsupervised deep belief network (DBN) is trained to extract latent features. Then, a one-class support vector machine (SVM) based on the DBN features is trained to learn decision surfaces. In [23], Revathi and Kumar proposed a deep-learningbased anomaly detection and classification system in video sequences, where the final module classifies the detected events as usual or suspect. In [24], Pashazadeh et al. propose a data-driven fault detection and isolation (FDI) scheme based on the fusion of different classifiers for a wind turbine challenge system. Multi-layer perceptron (MLP), radial basis function (RBF), decision tree (DT), and K-nearest neighbor (KNN) classifiers are implemented in parallel, and the faulty sensor is identified using a majority voting method. Next, discrete-time up-down counters (UDCs) are used for each fault to reduce false alarms (FAs) and missed detections (MDs). In [25], an efficient strategy for fault detection and isolation (FDI) for an industrial gas turbine based on ensemble learning methods is introduced; specifically, a fault isolation scheme based on ensemble bagged trees is developed to isolate faults in a steady-state runtime.
In this study, we considered a natural non-linear extension of linear regression models of the form ∑ p 1 w i x i , i.e., the class of the so-called generalized additive models (GAM) [26] ∑ p 1 w i g i (x i ), where g i (x i ) are generic smooth functions to be identified from data. Due to the simple additive structure, GAMs are sufficiently versatile for capturing linear or non-linear relationships between response functions and covariates. The additive form is of particular interest for FDi problems because it allows the easy calculation of the fault sensitivity of the individual monitored variables. Although GAMs have been used extensively in many application fields, their employment in a data-driven FDi system has not been fully explored. In fact, to date, only a few studies have been presented, such as [16,27]. Motivated by the mentioned issues, in this paper, we propose GAMs to identify non-linear parity relationships in the monitored variables using cubic spline basis functions to characterize the non-linear functions g i (x i ) by exploiting the MARS modeling and estimation algorithm proposed in [28].
Next, to exploit the consolidated tools available for the fault diagnosis of linear parity space models, a local linearization of the identified GAM parity relations is performed to achieve a time-dependent fault sensitivity matrix.
Indeed, unlike standard linear parity methods, the so-called fault signature matrix is not constant but, rather, it depends on the operating point, implying that the resulting fault directions are not constant but time-varying. A state-dependent fault sensitivity model can better capture the effects of faults in a non-linear system compared to a standard linear and fixed fault sensitivity matrix. The proposed GAM plus linearization approach can immediately fit the directional residual FI method developed in the linear contest and applied in [14,15]. In addition to this first key innovative aspect, this study aims to show the effectiveness of the proposed technique compared to machine learning techniques applied to the problems of fault isolation and fault estimation. In particular, the second main contribution of this research is related to the comparison of the proposed directional residual-based technique with ML fault diagnosis methods. This is performed by investigating the advantages and disadvantages of each approach, highlighting the benefits in terms of performance, memory occupancy and robustness to unexpected fault amplitudes.
The proposed novel method is applied to design a complete fault isolation and estimation (FIE) system based on real sensor data taken from a semi-autonomous aircraft where single additive faults are artificially injected on eight primary sensors.

Non-Linear Additive Models for Fault Diagnosis
The set of the monitored (potentially faulty) sensors measurements is concatenated in the vector x(k) ∈ R n x , while the set of control signals and other no monitored sensors (assumed not faulty) are included in the vector u(k) ∈ R n u . The integer k is the discrete-time index at sample time t = k · ∆t (where ∆t is the sampling interval). Occasionally in the article, the dependence on k is omitted to simplify the notation. The proposed FD technique is based on analytical redundancy (AR) concepts [12,13,16]. It is assumed that a sensor measurement x i (k) is approximated by a non-linear additive model consisting of the linear combination of non-linear functions (g i,j and h i,j ) defined as followŝ where w xi,j and w ui,j are constant coefficients (to be estimated from data) and g i,j and h i,j are non-linear functions of the variables x j (k) and u j (k), respectively, typically representing a rectified linear unit (ReLU) or Gaussians or polynomial splines [26]. The functions h i,n u +1 are assumed to be constant and equal to one in order to take into account possible constant offsets in the models. The actual signal x i (k) is where ∆ i (k) characterizes the modeling error and sensor noise associated with the i-th sensor. The primary residual associated to each sensor is defined as At fault-free conditions, it results in the following: i.e., the residual is equal to the modeling uncertainty. Typically, |∆ i (k)| is a small amplitude signal, and in the ideal perfect modeling noise-free case, this is equal to 0. In the present study, we considered the occurrence of single additive sensor faults f j (k) ∈ R on the generic j-th sensor. In the presence of a sensor fault, the fault-free measurement x j (k) should be substituted by the faulty signal where f j (k) is a generic fault modelling function that is zero in fault-free conditions and different from zero in the presence of the sensor fault. When a fault is present on the j-th sensor, the impact on the residuals can be evaluated by substituting (5) in (1) and (3). It is immediate to verify that In the above Equation (6), the term g i,j (x j (k)) is not directly computable because the sensor reading is equal to x j (k) + f j (k) and not to x j (k) in the presence of a fault. For this reason, the Taylor approximation of g i,j (x j + f j ) around x j + f j is computed, that is where δ f j is a fault increment and ∆g ij contains higher-order terms of the Taylor expansion. In this study, we exploited model (7) to compute an approximation of g i,j (x j ). This can be easily achieved by taking Substituting Expression (8) for g i,j (x j ) in the residuals in (6) results in the following: that is Define: then (10) becomes the above expressions can be arranged in matrix form resulting in: The matrix in (13) (W(k) ∈ R n x ×n x ) is known as the fault sensitivity matrix and ∆ ∈ R n x ×n x contains all the uncertain terms in (13). It is observed that the matrix W(k) is time-dependent; in other words, it depends on the current measurements at time k.
The occurrence of a single fault at a time on a generic j-th sensor is assumed here; therefore, in (13), only the component f j is different from zero (that is In vector form where r(k) ∈ R n x is the primary residual vector and w j (k) ∈ R n x (the j-th column vectors of the matrix W(k)) defines the so-called fault direction. Assuming a sufficiently large fault f j compared to ∆ j , the residual vector direction tends to be alienated to the known direction of the vector w j (k). This directional information will be later exploited for sensor FI purposes. Unlike our previous papers in [12,13]-where the faults signature matrix is constant by construction-the fault signature matrix is time-varying in the present study.

Linear Model Case
In case the functions g i,j and h i,j are approximated by simple linear in the variables models, then Equation (1) simplifies to: where w xi,j and w ui,j are constant weights, implying that the time-dependent matrix W(k) becomes a constant matrix W [12,13] and the associated fault directions w j j = 1 . . . n x are fixed and constant vectors.

Fault Isolation (FI)
Starting from the considerations in Section 2 and considering Equation (15), the faulty sensor is inferred (isolated) by exploiting the fault directional properties of matrix W(k). Specifically, the faulty sensor is isolated by evaluating the angular distances between the direction of the residual vector r(k) and the n x fault directions, i.e., the columns of the matrix W(k). The sensor fault direction with the lowest angular distance from the residual direction flags the faulty sensor, that is where I F (k) is the fault index function that takes values from 1 to n x and indicates, at time k, the index of the isolated faulty sensor. This FI technique is previously applied in [12][13][14][15]. The same FI method is also unchanged when the simple linear models in (16) are used, with the only difference that the fault directions are now constant in time because, in this case, the matrix W is constant.

Fault Estimation (FE)
An advantage of operating with primary residuals, as defined in (3), is that the fault amplitude can be estimated directly. In fact, starting from Equation (15), assuming a negligible effect of ∆(k), the fault amplitude can be directly calculated as The same FE method also applies in the case of the linear models (16). NOTE-1: Before the FI and FE phases there is usually a fault detection (FD) phase dedicated to the detection of the occurrence of a generic anomaly condition. Since the primary purpose of this study is focused on the evaluation of FI and FE algorithms, we assumed here an "ideal" fault FD, i.e., the additive fault is detected as soon as it is injected into the sensors. Clearly, in practice, an FD delay and missed detection might be possible. The issue of FD is addressed in many studies; a detailed overview of FD techniques can be found in [29,30]. In addition, the authors have also discussed the issue of data-driven FD in previous works, e.g., in [12,14,15].

Multivariate Adaptive Regression Splines (MARS)
In this study, the data-driven identification of the non-linear functions (g i,j and h i,j ) defined in (1) is performed using Friedman's multivariate adaptive regression Splines algorithm [28] which is a well-known procedure used to identify non-parametric additive models from data. The MARS algorithm can be easily set to fit the structure of the nonlinear additive models in (1). In practice, the identification of the primary residuals is performed by exploiting the adaptive regression splines toolbox [31] (ARESLab). MARS is a non-parametric regression technique and can be viewed as a non-linear extension of linear regression models that can be used to model non-linear dependencies in high-dimensional data. Technically, MARS models consist of the linear combination of spline basis functions; in ARESLab, the number of basis functions and the parameters characterizing their shape is inferred directly from data through a forward-backward iterative approach [28,31]. Starting from (1), the considered MARS functions have the following form: where M x,i,j is the number of basis functions that are selected by the MARS forwardbackward iterative approach to identify the g i,j function. The α i,j,m x are constant coefficients and B i,j,m x is the m x -th basis function that depends only on the x j variable. Similar definitions can be attributed to M u,i,j , β i,j,m u , and B i,j,m u .
The ARESLab allows the selection of different classes of basis functions, such as piecewise ReLU and piecewise cubic splines. We used piecewise continuous cubic splines with continuous first derivatives to estimate the non-linear functions. An in-depth discussion and comparison between piecewise cubic models and piecewise linear models can be found in [28].
The considered piecewise cubic spline basis functions consist of one or two "complementary" basis functions B(x|s, κ − , κ, κ + ) . These are defined as follows: for x is a scalar, κ − < κ < κ + and The shape parameters κ − , κ + , and κ represent the lower side knot, the upper side knot, and the central knot, respectively. The first two knots define the change point between the functions, while the last influences the cubic and the linear functions. The ARESLab procedure automatically estimates the slope parameter and the number of Basis Functions to build the model. In particular, the design does not necessarily use both "complementary" basis functions; only the positive one or the negative one could be used in the estimation model.
In this study, the MARS algorithm for each sensor model in (1) is applied separately using the same data segment. Once the MARS spline basis functions are identified for all the monitored sensors, the local fault sensitivity matrix W(k) is analytically computed following the linearization procedure described in Section 2. NOTE-2: Since there are no iterations (multiplications) between the sensor measurements, the fault diagnosis method based on the MARS algorithm could be implemented exclusively through the matrix product between suitably defined matrices and the values measured by the sensors. In particular, combining Equations (1), (3), and (19)-(22), it is possible to calculate the residual r Similarly, the matrix W(k) can also be calculated through the matrix product between some matrices appropriately defined and the values measured by the sensors. In particular, the column vector and Ω( · · · ) depend on the shape parameters κ − , κ + of each basis function of the model and the values of the sensor measurements. Therefore, the implementation complexity of the proposed approach is related to the matrix multiplication algorithm that results, as known, in the worst case O(n 3 ).

Machine Learning-Based Fault Isolation and Estimation
The proposed directional residual-based FI and FE scheme can be compared with machine learning (ML) solutions. ML techniques are extensively applied to FD problems; an extensive literature exists on this issue [32,33]. In this study, ML structures with different complexity are built and compared using the same data used to identify and test the directional residual-based technique. Specifically, FI and FE are addressed separately. First, a ML classifier is used to estimate the faulty sensor index. Then, a second ML approximator is used to estimate the fault amplitude. The first is a typical classification problem, while the second is a typical regression problem. Classification and regression structures are built by exploiting the dedicated MATLAB toolboxes in more detail. For example, support vector machines (SVMs) [17,22], neural networks (NNs) [23,24], decision trees [17,24], and ensemble of decision trees [25] structures have been considered.

Dataset Preparation for ML Algorithms
In contrast to the primary residual-based FI techniques that are based only on fault-free data and on fault directions, the ML technique requires faulty data samples which are representative of each possible sensor fault for a wide range of fault amplitudes.
Since real sensor flight data with sensor faults are not easily available, additive faults are artificially injected on the fault-free data to simulate the occurrence of a sensor fault (this approach of generating artificial faulty data is widely used in the FDi community, see for instance [11][12][13][14][15][16]).
For this reason, a new 'ad-hoc' data set is produced based on the fault-free dataset used to identify the MARS models. This is performed by adding a random amplitude fault at each sampling time on a randomly selected sensor. Random amplitude fault on a random sensor is clearly not a realistic fault scenario; this approach is used only to generate a rich set of training data to promote generalization capacity and robustness in the ML schemes.
It is observed that this simple fault generation method is possible thanks to the fact that, in this study, our approach is "memoryless". Indeed, the estimation at the time k depends only on other signals at the same time k; therefore, serial temporal correlation of data has not been considered in the model. If the estimations at time k depend on signals at previous time instants (k − j), the above faulty data generation procedure can be easily extended by considering data segments of appropriate length. Next, the fault-free mean and standard deviation of the data are normalized, and two new labels (signals) are added to the data. The first is the index identifying the sensor where the fault is injected, while the second is the normalized amplitude of the fault.

ML Classifier for FI
The FI classifiers input is the vector z FI (k) = [x(k), u(k)] ∈ R n x +n u of the current sensors and inputs measurements, and the corresponding output is the label of the faulty sensors S(k) = l ∈ [1, 2 . . . n x ]. The data set described in Section 6 is used for the training. The set of classifiers that are evaluated and the main design parameters are reported in Table 1.

ML Estimator for FE
The FE estimator input is the vector z FE (k) = [x(k), u(k), S(k)] ∈ R n x +n u +1 of the current sensors and inputs measurements plus the index of the faulty sensor. The corresponding output is the amplitude of the normalized fault f (k) ∈ R injected on the fault-free measurements. The data set described in Section 6 is used for the training. The set of estimators and the main design parameters are shown in Table 1.

Online Operation of ML Algorithms
The previously trained ML structures are then used for FI and FE purposes in the online operation phase according to the scheme shown in Figure 1. Following the failure detection, the FI block processes the current input z FI (k) = [x(k), u(k)] and provides the estimation of the sensor fault indexŜ(k) in the output. This information, as well as the measured current signals z FE (k) = [x(k), u(k),Ŝ(k)], is then processed by the FE block that provides the estimationf (k) of fault amplitude at time k. NOTE-3: The ML classification and regression schemes introduced in Sections 5.2 and 5.3 have the same hyperparameters but differ in the inner architecture. For example, in the case of neural networks, the structure of the inner layers is the same, except the output layer, specifically in the case of a neural network classification, and the SoftMax activation function is applied. In contrast, linear activation functions are used in the regression neural network.

Semi-Autonomous Aircraft Flight Data
The FI and FE algorithms are designed and tested using flight data of a Tecnam P92 aircraft [34]. The data are acquired in a semi-autonomous mode; specifically, a pilot manually flew the plane during takeoff and approach/landing and flew autonomously at cruise conditions. A batch of six-flight datasets is considered in this study. The set of the 12 signals listed in Table 2 is considered [35], of which the first eight are the monitored sensors x(k), while the last four are the actuation (input) signals u(k). All the signals are normalized to zero mean and unitary variance.
Data from five flights (1 h and 21 min) are used to train models for a total of N = 48,661 data samples (with a data sampling period is 0.1 s). In contrast, the remaining batch of flight data (for a total of 11 min) is used for validation purposes for a total of N = 6659 samples.

Redundancy Relation Identification
As described in Section 4, the MARS algorithm is used to identify the non-linear additive models introduced in Section 2. The MARS algorithm automatically selects the number of piecewise cubic splines and weights for each model in (1), exploiting a forwardbackward iterative approach. Table 3 reports the model identification results for each of the additive models. The rows represent the identified model structure for each of the n x monitored sensors; the columns indicate the number of cubic splines selected by the algorithm to characterize the functions g i,j (x j ) and h i,j (u j ) in (1). It is observed that the algorithm does not choose all the available signals for modeling, thus also providing a feature selection.
Next, the same data are used to identify the linear redundancy relations parameters in (16) for each of the eight sensors. In this case, the model parameters are determined using standard least squares.

Accuracy of the Identified Additive Models
The first step to evaluate the effectiveness of the non-linear models is to compare their modeling accuracy with those of the linear models to estimate the sensor signals.
For each sensor, Table 4 reports the mean and standard deviation of the approximation error e(k) = x(k) −x(k) for the training and the test datasets in the form [mean value ± standard deviation]. The analysis of the table reveals that the non-linear models perform better than the linear ones. In fact, although the mean value of the estimation error for non-linear models is larger in some cases, this is still very close to zero. In contrast, the standard deviation of the estimation error is lower for non-linear models than linear models for all monitored sensors for both "Training" and "Test" data. The higher accuracy provided by the non-linear additive models is considered important from an FI perspective because more accurate models can detect smaller amplitude faults by limiting the number of false alarms caused by modeling errors that could be equivocated with the occurrence of faults.

Fault Sensitivity Matrix
This section reviews in detail the time-dependent fault sensitivity matrix W(k) to highlight the relevant difference between linear and non-linear FI schemes. For this purpose, the non-linear models generated by the MARS algorithm are linearized following the approach proposed in Section 2 to build the matrix of fault signatures (i.e., the fault sensitivity matrix).
In Figure 2

Design of Machine-Learning-Based FI Schemes
This section describes the design of ML-based FI schemes. First, the FI classifiers are trained. The input of the classifiers is the z FI (k) = [x(k), u(k)] ∈ R 12 vector of the eight monitored sensors and four inputs signals reported in Table 1. For each of the eight monitored sensors, single random amplitude faults are added to the fault-free signals at random time instants sampled from the training data. Positive and negative fault amplitudes are generated in the range −A max A max , whose values are reported in Table 5. The corresponding out at time k is the label of the faulty sensors S(k) = l ∈ [1, 2 . . . 8]. A total of N = 400, 000 random samples are generated for the training. The training of the 16 FI classifiers (the first column in Table 6) is then performed using MATLAB. Table 6 shows the accuracy [36] of the models obtained from cross-validation of the training data with the n-fold validation taking n = 5. The neural network FI classifiers achieved the highest accuracies compared to the other family of models; in particular, the wide neural network scores provided a 78.1%. Next, the 16 FI estimators are trained. The input vector z FE (k) = [x(k), u(k), S(k)] ∈ R 13 coincides with z FI (k) augmented with the index S(k) indicating the faulty sensor, and the output is the amplitude of the normalized fault amplitude f (k) ∈ R. A total of N = 400,000 training samples are generated using the same procedure used for the training of the FI classifiers. The second column of Table 6 reports the root mean square error (RMSE), , of the models achieved, cross-validating the training data with the n-fold method (n = 5). Once again, the wide neural network provides the lowest RMSE compared to the other family of models.

Metrics for Validating Fault Diagnosis Schemes
For validation and comparison purposes, additive constant bias faults of amplitude A( f (k) = A) are considered. The constant fault is applied at time k = 1 and is maintained for the entire duration of the validation flight. The following FI and FE metrics are used.

Fault Isolation Percentage (FIP)
The FI performance is measured in terms of the fault isolation percentage (FIP), defined as: The index FIP X (A) is calculated for each considered technique, for different fault amplitude A and for each monitored sensor X. The rest of the paper FIP X will refer to the average of FIP X (A) for the considered fault amplitudes injected on the monitored sensor X, i.e., FIP X = mean[FIP X (A)]. Similarly, FIP will refer to the average of the FIP X values evaluated over the eight monitored sensors, that is: FIP = mean[FIP X ].

Fault Estimation Percentage (FEP)
The primary residuals in models (1) and (16) for the non-linear and linear models, respectively, allow direct estimation (see Equation (15)) of the fault amplitudeÂ that is computed as the difference between the measured and predicted signal. In contrast, the ML models estimate the fault amplitudeÂ as a regression problem (see Section 5.3). The fault estimation percentage (FEP) is defined as: FEP X (A): Given a fault amplitude A, the fault estimation percentage ratio is the absolute value of the percent ratio between the fault amplitude reconstruction. The FEP X (A) is calculated as the difference between the actual fault amplitude A and the mean of the reconstructed fault amplitude throughout the validation flight.
Furthermore, the index FEP X (A) is calculated for each technique, a different fault amplitude A, and a monitored sensor X. In addition, FEP X refers to the average of FEP X (A) for the different fault amplitudes injected on the monitored sensor X, while FEP refers to the average of FEP X evaluate over the eight monitored sensors.

Complementary Fault Estimation Percentage (cFEP)
In order to achieve a performance metric that is 100% when the FE performance is perfect and 0% when it is completely unsatisfactory, the complementary FEP (cFEP) is defined as: From the cFEP index, the cFEP X (A), cFEP X , and cFEP indices are derived. In addition, to produce an overall performance ranking that takes into account both the FI and FE performance, the overall performance index J tot is defined: Perfect performance is archived when J tot = 100, i.e., in the case of perfect fault isolation and perfect fault reconstruction.

Comparison between Directional Residual and Machine Learning Techniques
This section compares the fault diagnosis performance provided by the directional residuals and machine-learning-based methods. This study is performed using the data of the validation flight. Positive and negative constant faults are added to the fault-free data, considering, for each sensor, fault amplitudes A equal to ±17%, ±33%, ±50%, ±67%, ±83%, and ±100% of the maximum fault amplitude taken from Table 5. The faults are added at time k = 1 to the faulty sensor and maintained constant for the entire flight duration. This procedure is repeated for all eight sensors and all the considered fault amplitudes, resulting in a total of about N = 640, 000 validation samples. The mean performance for all the sensors and fault amplitudes is evaluated using FIP, FEP, and cFEP indices already in Section 9. The results are reported in the first two columns of Table 7.
It is observed that the non-Linear technique (NL-DR) provides 71% in terms of FIP. Although satisfactory, it is also observed that medium neural network (M-NN) and the wide neural network (W-NN) methods perform slightly better. On the other side, considering the FEP performance, the NL-DR achieves an excellent 18% while M-NN and the W-NN provides a significant performance degradation equal to 35%. These facts are relevant because the NL-DR provides a high-level FI performance while maintaining an excellent capability for fault reconstruction. This fact does not apply to any one of the 16 ML techniques that provide a cFEP performance lower than 72%.
In summary, the best resulting method in terms of FI performance is the W-NN, and the worst is the F-SVM. Considering the FE performance, the best cFEP is provided by our proposed NL-DR method, and the worst is the L-SVM.

Overall Performance Comparison
The fourth column of Table 7 reports the index J tot for all the techniques. It is now evident that the resulting method with the best overall performance index is given by the proposed NL-DR (77%), followed by 3-NN (71%) and M-NN (69%) and by 2-NN (69%).
The last column of Table 7 also shows the combined memory occupancy of the isolation and estimation models. The SVM models have the highest memory occupation, up to about 50 MB, while the most parsimonious architectures are those based on directional residuals and neural networks.

In-Depth Performance Comparison of the Best Techniques
This section shows the FIP X and cFEP X indices for the best performing techniques explicitly evaluated for the eight monitored sensors. Figure 3 shows that the F-DT approach is biased toward the α sensor at the expense of the others. It perfectly isolates the faults on the α sensor, but it cannot correctly isolate any fault occurring on β, TaS, P, and θ sensors. It is also observed that the NL-DR technique performs better than the other techniques for four of the eight monitored sensors. In Figure 4, it can be deduced that ML techniques have much lower FE performance than those provided by the NL-DR method for all the sensors. In fact, the proposed NL-DR scheme performs significantly better than the others for six of the eight monitored sensors.

Performance Comparison Evaluated over a Wider Fault Range
The performance of any machine learning technique is strongly influenced by the data used for training. This implies that the available set training data strongly influences ML-based FIE methods in our specific case. In contrast, the proposed residual-based approach is virtually independent from the fault amplitude. In fact, in the primary residual modelling phase, no assumption is made about the magnitude of the faults, which leads to significant benefits in case the range of potential faults is under-or over-estimated. A clear example of this potential problem can be observed in Figure 5, where the response of FIP X (A) and FEP X (A) indices for TaS and φ sensors are compared for NL-DR and neural-network-based methods. In the upper part of Figure 5, it can be observed that by injecting faults with amplitudes twice than the nominal ranges used for training (nominal range: [ −2 2 ] m/s for the TaS sensor and [ −6 6 ] • for the φ sensor), the FI performance of the 3-NN deteriorates quickly with the increase in the fault amplitude.
Considering a failure on TaS equal to 2 m/s, the FIP TaS (2) index for the 3-NN is 90% (Point A), while a failure of 4 m/s the FIP TaS (4) descends to 2% (Point B). On the other hand, using the NL-DR technique, the FIP TaS (A) performance constantly increases with the fault amplitude (Points C and D), avoiding the paradox response provided by the ML methods. Similar considerations apply to the FEP φ (A) index trend for faults on sensor φ.
In terms of FE performance, it is observed that both the TaS and φ the NL-DR approaches provide a suitable monotone decreasing trend for the FEP X (A) index with an increase in the failure amplitude. At the same time, there is a rapid degradation of the FEP X (A) performance outside the nominal training range in the NN case. Specifically, a constant 5 • failure on φ produces a FEP φ (5) of 3% (Point E), while a failure of 10 • produces a FEP φ (10) of 35% (Point F). Using the NL-DR approach, the FEP φ (A) index increases from 6% for a failure of 5 • (Point G) to 2% in the case of a failure of 10 • (Point H). The above issue is a typical effect known as "Unseen Data Problem" or "generalization problem". It refers to the inability to make reliable predictions in regions outside those explored in the training data.
In our study, a simple method to limit this important problem is to provide the ML algorithms with a broader range of fault amplitudes to cover the unexplored regions in the training phase. However, on the other side, the excessive widening of the fault ranges used in the training data can lead to the inability to discriminate accurately small faults.
A direct example of this can be observed in Figure 6, where the tri-layer neural network is retrained (3-NN 2 ), considering a range of fault amplitudes which are twice the size of the nominal range, as shown in Table 5. Using the retrained network, a relevant increase in the FI performance is achieved for large fault amplitudes, but at the expense of performance degradation for medium and small amplitude faults.  Table 6.
In more detail, consider the case of a failure of 4 m/s on TaS. In this case, the FIP TaS (4) provided by the retrained neural network (3-NN 2 ) is now 95% (Point I), which is 93% more than the previous network (3-NN). Vice versa, in the case of a fault of −1 m/s , the 3-NN isolates the fault with an accuracy of 77% (Point J) while the 3-NN 2 of 25% (Point K) confirms what is previously conjectured. Moreover, by analyzing the FEP φ (A) index for the sensor φ, a similar conclusion can be drawn: that the FEP φ (A) for 10 • goes from 35% for 3-NN (Point F) to 9% for 3-NN 2 (Point L), while a failure of −2 • goes from 8% for 3-NN (Point M) to 60% for 3-NN 2 (Point N), as shown in Figures 5 and 6.

Overall Performance Evaluation for All the Monitored Sensors
To better compare the overall performance, Tables 8-15 report, for each monitored sensor, the percentage ratio between the actual area below the FIP X (A) function (as those in Figures 5 and 6) and the perfect performance area (4A max · 100%). The ideal area ratio is obviously 100%. The same process is also applied to the cFEP X (A) functions. The last column of the tables reports the mean between the FIP X (A) area ratio and the cFEP X (A) area ratio.  For almost all sensors, the area under the curve generated by 3-NN 2 is larger than that under the curve generated by 3-NN, indicating a generalized performance improvement. However, in most cases, the performance improvement is only achieved for large amplitude faults, at the expense of performance degradation for small amplitude faults. This fact indicates the need to find a compromise on the magnitude of the faults used to train the neural network. This issue substantially limits the applicability of ML techniques, especially in a real-world context where an 'a priori' knowledge of the fault amplitude range cannot be established. Moreover, analyzing the results obtained for all the monitored sensors, the performance of the proposed NL-DR technique is almost always better than both the 3-NN and 3-NN 2 techniques.

Time-Domain Performance Comparison with Same FIP/cFEP
This section evaluates and compares time domain FI and FE responses for the NL-DR and 3-NN 2 techniques. In order to achieve a meaningful comparison, the tests are performed by selecting faults whose amplitudes are such that the two methods provide the same value for the FIP X (A) or the cFEP X (A) indices in Figures 5 and 6, resulting in the selection of a fault of the amplitude of 2.5 m/s on TaS, and of 4 • on φ, respectively (as expected, the faults are injected at k = 1 and the constants for the whole flight are maintained). Figure 7 shows, for a failure on TaS, the evolution of this signal. The green portions indicate the instants in which the FI is correct, while the red portions indicate when the failure is incorrectly attributed to a 'wrong' sensor. The upper plot refers to the NL-DR technique, while the lower plot refers to the 3-NN 2 technique. For hypothesis, both methods isolate the fault with the same percentage (82%, Point O in Figure 6); the remarkable aspect is that the zones of wrong isolation are practically the same for the two techniques.
On the other side, there is a marked difference in FEP X (A) performance between the two techniques, see Figure 8. The fault amplitude estimated by the NL-DR method is much closer to the true value than the estimate provided by the 3-NN 2 technique, as deduced from Figure 6 in Points P and Q, respectively.  A similar analysis is then performed for the fault on the sensor φ. In this case, however, the fault amplitude is selected to be 4 • , i.e., the point R of Figure 6 where the FEP X (A) index is equal to 6% for the two techniques. Figure 9 shows the evolution of the estimation of the fault amplitude for the two techniques, where it is confirmed that the performances are equivalent for all practical purposes. On the other side, from Figure 10, it can be observed that the FIP X (A) performance of the NL-DR technique is significantly better than that of the 3-NN 2 technique (Points S and T in Figure 6).

Conclusions
The main purpose of the research effort described in this paper was to compare data-driven non-linear directional residual and machine-learning-based fault diagnosis techniques. The experimental study showed that the method based on primary residuals is virtually independent of the fault amplitude. Additionally, it was demonstrated that the FI and FE performance increases monotonically with increasing fault amplitude. In contrast, the performance of ML-based techniques depends heavily on the fault amplitudes used during training, producing potentially unpredictable results in regions not covered in the training phase. A partial solution to this problem was obtained by retraining the ML models using larger ranges for the faults injected in the training phase. The overall effect is that the FI and FE performance increases for large faults but, unfortunately, at the expense of a decrease in the estimation accuracy of small amplitude faults. In summary, it can be concluded that, while from the perspective of FI, the performance of residual-based and ML techniques is essentially equivalent, the residual-based approach results are more accurate and reliable than the ML-based approaches from the perspective of FE.

Abbreviations
The following abbreviations are used in this manuscript: