Next Article in Journal
Evolutionary System Design with Answer Set Programming
Next Article in Special Issue
In-Process Monitoring of Hobbing Process Using an Acoustic Emission Sensor and Supervised Machine Learning
Previous Article in Journal
A Scenario-Based Model Comparison for Short-Term Day-Ahead Electricity Prices in Times of Economic and Political Tension
Previous Article in Special Issue
A Real-Time Novelty Recognition Framework Based on Machine Learning for Fault Detection
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Toward Explainable AutoEncoder-Based Diagnosis of Dynamical Systems

School of Computer Science and IT, University College Cork (UCC), T12 R229 Cork, Ireland
Algorithms 2023, 16(4), 178;
Submission received: 15 September 2022 / Revised: 3 January 2023 / Accepted: 9 January 2023 / Published: 24 March 2023


Autoencoders have been used widely for diagnosing devices, for example, faults in rotating machinery. However, autoencoder-based approaches lack explainability for their results and can be hard to tune. In this article, we propose an explainable method for applying autoencoders for diagnosis, where we use a metric that maximizes the diagnostics accuracy. Since an autoencoder projects the input into a reduced subspace (the code), we define a theoretically well-understood approach, the subspace principal angle, to define a metric over the possible fault labels. We show how this approach can be used for both single-device diagnostics (e.g., faults in rotating machinery) and complex (multi-device) dynamical systems. We empirically validate the theoretical claims using multiple autoencoder architectures.

1. Introduction

1.1. Autoencoders in Diagnosis

Data-driven diagnosis using deep learning is starting to have a big impact in diagnostics. Applications have been developed for applications such as rotating machinery [1,2], as well as for other areas, such as medicine [3]. The common theme in most of these applications is the use of image data (e.g., X-rays in medicine) or time-series data. Hence, few applications that apply to complex systems, such as large process-control systems, have been developed.
One of the most common deep learning architectures for diagnosis is the autoencoder [2]. The autoencoder (AE) has been used for fault detection and for fault isolation. AE performance is comparable to using traditional signal-processing methods, but it has been supplanting these methods since data pre-processing is not needed in AE-based approaches, whereas it is a manual and time-consuming requirement for traditional signal-processing methods. A major drawback to AE-based approaches is the need for large volumes of (labeled) data, and the lack of explainability for the results.
Ideally, we use an AE as both a residual generator [4] and decision-logic tool to isolate faults. The question is how to train the AE to generate good diagnostic results, i.e., to best distinguish among the different failure conditions (or modes).
An AE has two steps: (1) the encoding step transforms the input ξ X into a representation ζ = f ( ξ ) that resides in a subspace W of X ; (2) the decoding step performs the inverse transform, performing the mapping ξ = f ( ζ ) . The standard metric (or loss function) used in an AE is the L 2 , or mean-squared error (MSE), metric defined over the input ξ and reconstructed input ξ ^ , i.e., ( ξ ξ ^ ) 2 . For diagnostics purposes, the goal is that the subspace W will enable us to classify different fault modes, i.e., it disentangles ξ [5,6]. The key to successfully using an AE for diagnosis is to identify the subspace W that best distinguishes the different nominal and failure modes of the system under analysis. Unfortunately, this L 2 loss function does not have clear semantics for diagnostics isolation in comparison to a metric based on an encoded subspace whose principal angle [7] best distinguishes the nominal and failure modes of the system. An example of such a metric is the gap metric [8]. In comparing a nominal input ξ and new input ξ ^ , the gap metric is [ 0 , 1 ] -bounded, and has well-defined interpretations for 0 (behaviors ξ and ξ ^ are identical) and 1 (behaviors ξ and ξ ^ are maximally different) [8]. In contrast, two behaviors can be similar even though their L 2 metric can be arbitrarily large ([9], p. 349).

1.2. Contributions

In this article, we use an autoencoder with a gap metric to provide clear semantics for diagnostics applications. The core idea that we exploit is the use of deep kernel architectures for the projection of fault-based information from data, and a representation of diagnosis metrics using principal angles. We show that the success of AE-based approaches to diagnosis depends on identifying an encoded subspace whose principal angle [7] best distinguishes the nominal and failure modes of the system. The principal angle between two vectors in a vector space X specifies a notion of canonical correlation between those vectors in a subspace W X , and enables the generation of a distance metric between vectors. As a consequence, we can use this distance metric to distinguish vectors representing nominal and failure modes: vectors with similar δ will tend to be in the same mode, and with the metric that it is far away, it will be in different modes. This new metric thus provides more accurate diagnostics performance on the experiments conducted, compared to the traditional L 2 metric.
Our contributions are as follows.
We define an explainable autoencoder in terms of a gap metrics, whose induced subspace generates a principal angle [7] to best distinguish the nominal and failure modes of the system.
We show how an autoencoder can diagnose linear time-invariant (LTI) dynamical systems. We extend the application of AE-based approaches to time-series data and dynamical systems, and we provide a clear semantics (and hence explainability) for such approaches.
We illustrate our approach using examples from a time-series benchmark, the Numenta Anomaly Benchmark (NAB) dataset [10,11].
This article is organized as follows. Section 2 compares and contrasts related work of relevance. Section 3 introduces the technical material for this article, overviewing dynamical systems, autoencoders and the temporal sequence that we use for diagnostics input. Section 4 defines the metrics that we adopt in this article. Section 5 describes our methodology that we use for computing AE-based diagnosis via metrics over subspaces. Section 6 presents our empirical analysis, where we compare standard and SPA metrics in terms of diagnostics accuracy.

2. Related Work

There is significant work related to this paper. We look at work on the theoretical basis we adopt, gap metrics, and on the various applications of AE-based methods.

2.1. Gap Metrics and AE-Based Diagnosis

Some recent papers have applied autoencoders and gap metrics for diagnostics. Ref. [12] shows how gap metric techniques can be applied to fault detection performance analysis and fault isolation schemes. Ref. [13] presents an integrated model-based and data-driven gap metric method for fault detection and isolation. Ref. [14] describes a related fault detection approach, which is based on a data-driven K-gap metric [12]. This has been applied to ship propulsion systems. Ref. [15] details a comparative study of K-gap metric-based techniques, based on different data-driven stable kernel representation methods. Ref. [16] empirically evaluates gap-metric based, multi-model control schemes for nonlinear systems.

2.2. Diagnosis Applications of AEs

AEs have been applied to a number of different diagnosis applications. The typical types of input data are time-series and image, although some categorical/table-based inputs have also been studied.
In typical diagnosis applications, an AE is augmented with a threshold to assess whether data are faulty or not. Ref. [17] extends this notion and proposes a latent reconstruction error metric as a health indicator for machine condition monitoring, which is built in the latent space of a deep autoencoder. Ref. [17] defines the latent reconstruction error of a sample ξ i as the Euclidean distance between its latent representation ζ i and the latent representation of its reconstruction ζ ^ i : d ( ξ ) = ζ i ζ ^ i 2 .

2.2.1. Diagnosis of Individual Machines

The most common time-series application domain is that of rotating machinery. Ref. [18] presents a survey on deep learning-based bearing fault diagnosis, in which AEs play a prominent role. Many papers have been published on applying autoencoders to rotating machinery fault diagnosis, such as [1,2,19,20].
We can contrast our proposed AE-based method with a closely related method [21], which uses a similar Hankel matrix-based input for the condition monitoring and diagnosis of rolling element bearings. In this approach, they compute anomalies using the eigenvalues of the input Hankel matrix and use as a metric the Frobenius norm.
In comparison to these other papers, our approach should have higher isolation accuracy than any approach that does not use a similar metric space.

2.2.2. Data-Driven Diagnosis of Complex Dynamical Systems

A significant body of work has been generated on the data-driven analysis of LTI systems, based around the seminal work of [22]. This area is referred to as the behavioral approach to systems theory; see [23] for a survey and tutorial introduction. The behavioral approach considers a dynamical system not in state–space form but as a set of trajectories, based on a generator defined by LTI properties.
Recent work has applied subspace methods to the analysis of sets of trajectories, e.g., [23]; however, no AE-based solutions have been proposed yet within this behavioral framework.

2.3. Other Applications

Image Inputs: Ref. [24] surveys the use of deep autoencoders in pattern recognition. The most common application domain is that of medical diagnosis, e.g., X-ray or other medical imaging data. Ref. [25] surveys research on the application of autoencoder algorithms to diagnose rare diseases.
Prognosis: Autoencoders have recently found use in prognostics. Ref. [26] surveyed the use of variational autoencoders for prognosis and health management of industrial systems. Ref. [27] propose a prognostic health indicator, defined using Kullback–Leibler divergence, and applied to predict the remaining useful life of concrete structures.

3. Preliminaries

3.1. LTI System

We focus on discrete-time linear time-invariant (LTI) dynamical systems in this article. We can define an LTI system [28] in state space form as
x k + 1 = A x k + B u k + w k ; x ( 0 ) = x 0 ;
y k = C x k + D u k ,
where x R n is the state vector, x 0 is the initial condition of the system, w k N ( 0 , Q ) is the uncorrelated zero-mean Gaussian measurement noise and standard deviation Q, and u and y are the plant input and output vectors, respectively. Matrices A , B , C , D are real constant matrices of appropriate dimensions.
We can extend the nominal system (Equations (1) and (2)) to incorporate faults. Faults may be additive or multiplicative, and we consider faults to both actuators and sensors. If we define multiplicative fault parameter matrices Γ a , Γ s to both actuators and sensors, respectively, we obtain
x k + 1 = A x k + B Γ a u k + w k ; x ( 0 ) = x 0 ;
y k = Γ s ( C x k + D u k ) ,
The fault matrices are diagonal matrices with the [ 0 , 1 ] parameters along the diagonal denoting the fault extent, with 0 denoting totally faulty and 1 normal. For example, we may have actuator loss of effectiveness encoded by this approach. We may define additive faults in an analogous manner.
We refer to a mode as an operating condition of an LTI system, as characterized by the system parameters ( A , B , C , D , Γ a , Γ s ) .

3.2. AutoEncoders

This section introduces the autoencoder and reviews how it has been used for diagnosis. We first describe the vanilla (or traditional) AE, and the outline the standard variants, namely the denoising autoencoder (DAE), sparse AE, contractive autoencoder (CAE), and variational autoencoder (VAE). An introduction of AEs is presented in [29,30].
Autoencoders have widely been used for anomaly detection and fault detection [1]. Given data ξ , a vanilla autoencoder A consists of two parts: (1) an encoder f θ E ( ξ ) = h that creates the latent code h in terms of parameters θ E ; (2) a decoder g θ D ( h ) = ξ ^ that re-creates the input from the latent code h in terms of decoder parameters θ D .
Figure 1 depicts an autoencoder for diagnosing faults in time-series systems. An autoencoder projects the input data ξ into a smaller subspace, called the code z, such that linear classifiers can be used on the structure of the code. Autoencoders can thus perform nonlinear dimensionality reduction, as well as classification. Mathematically, learning an autoencoder can be expressed as the following optimization task:
min f , g ξ g ( f ( ξ ) ) 2 2
When we train A with nominal data, then an input ξ that is anomalous will produce a higher than average anomaly score. However, in many AE-based methods, the loss function L is typically defined in terms of MSE, or of a regularized balancing of (a) accuracy-based MSE and (b) some sparsity metric over the code L = L M S E + λ L s p a r s i t y , where λ is the regularization parameter.
We can use A as a decision tool to identify faults, where an input ξ is anomalous if some measure (an anomaly score) over the reconstruction ξ ^ and input ξ by more than a threshold ϵ . Several definitions of the anomaly score have been proposed, such as reconstruction error ( ϵ R = | ξ ξ ^ | ), or Mahalanobis distance, ϵ M = ( ξ ^ ξ ) T Σ 1 ( ξ ^ ξ ) , where Σ is the covariance matrix of the training input.
Other AE approaches add regularization to reduce and/or minimize the size of the latent code. Common approaches include the sparse autoencoder (which minimizes the number of active neurons in the code, i.e., min z ζ | z | ), or the contractive autoencoder (which limits the sensitivity to small changes in the input, using the Frobenius norm · F of the Jacobian matrix of the encoder).
A problem with using A is the lack of semantics for the approach, and the challenge of scaling to complex systems, e.g., dynamical systems such as process-control factories. Further, the loss function associated with Equation (5) best addresses single-fault diagnosis (fault/no-fault) cases: it aims to match the input x with the output x ^ . This representation focuses on identifying an anomalous sequence; hence, it does not address fault isolation.

3.3. Temporal Sequence Representation

As LTI systems are universal approximators of temporal sequences [31], a temporal sequence can be regarded as the output of an LTI system of unknown parameters. Hence, we can use a temporal sequence to diagnose an LTI system.
Assume that we have a system S that generates a language B consisting of valid temporal sequences, or behaviors. A behavior is a temporally indexed sequence ξ = { ξ 1 , , ξ T } ; a behavior is valid if ξ is consistent with an underlying system model, e.g., as described by Equations (1) and (2). In the following, we consider that each behavior is valid, i.e., ξ B . In this article, we assume that each behavior is time invariant, linear, and has bounded zero-mean noise. We define a vector ξ | λ as a λ -length trajectory of a signal.
We can represent a signal of length λ T in terms of a data matrix. We adopt a Hankel matrix of depth λ T for signal ξ R q T , as follows:
Definition 1
(Hankel matrix).
H λ ( ξ ) = ξ 1 ξ 2 ξ T λ + 1 ξ 2 ξ 3 ξ T λ + 2 ξ λ ξ λ + 1 ξ T
where ξ = [ ξ 1 , ξ 2 , , ξ T ] denotes an acquired signal sequence, λ is the embedding dimension, and T λ + 1 is the length of each sub-sequence.
With this input encoding, the size of the associated Hankel matrix H is q λ × ( T λ + 1 ) . As can be seen in Definition 6, if the Hankel matrix H is square, H is also a symmetric matrix. The Hankel matrix, as can be seen in Definition 6, encodes a time-series signal with minimal pre-processing, and hence places limited computational burden on the overall inference process.
The Hankel matrix H embeds the observability matrix Γ of the system, since H = Γ X , where X is the sequence of hidden states of the LTI system, and Γ is defined as follows:
Γ = C C A C A λ
Therefore, H provides information about the dynamics of the temporal sequence. Given an input–output measurement ( u , y ) , we can estimate A , C , and the initial state x 0 to identify the corresponding LTI system for classification purposes. The identification of the triple is, however, a non-convex problem and thus, given a finite measurement ( u , y ) T , the triple ( A , C , x 0 ) used to generate such a measurement is not guaranteed to be unique [32]. Consequently, system identification is computationally expensive and not robust, which makes it unsuitable for diagnosis purposes.
These problems can be avoided by using subspace identification on Hankel matrices associated with the measurement signals [33]. Here, given a dynamical system, all output measurements lie on a single subspace, assuming a noiseless output. This means that the subspace spanned by the columns of a Hankel matrix is equivalent to the subspace of the associated LTI system. Therefore, the subspace spanning an LTI system can be computed and used for diagnosis, without identifying the underlying LTI system.
Equation (7) shows that the columns of H and Γ span the same subspace regardless of the initial values of the LTI system. This means that given two measurements ξ and ξ from the same LTI system operating in the same mode, the smallest principal angle between the subspaces of the Hankel matrices H ( ξ ) and H ( ξ ) is zero [32]. In other words, the subspace angles between the Hankel matrices of two output measures can be used to identify whether these outputs could be produced by the same LTI system. Even though the relation between output signals and LTI systems is non-unique, for diagnosis purposes, we assume that two Hankel matrices which share the same subspace belong to the same LTI system, and thus belong to the same diagnosis mode (or class label in learning terms).

4. Gap Distance Metric

4.1. Diagnosis via Sub-Space Analysis

This section describes how we compute diagnoses via sub-space projections. We first introduce our diagnosis task over a Hilbert space. A Hilbert space H is a vector space with an inner product · , · . Given a vector μ H , we compute a metric using the norm of μ , which is induced by the inner product, μ = μ · μ .
Definition 2
(Diagnosis Task). We define a classification (e.g., diagnosis) problem in Hilbert space as follows: given a subspace J H , check if μ H belongs to J .
We solve this task using two steps: (i) project μ onto J ; and then (ii) calculate the distance between μ and its projection using the induced norm.
This notion of inner products is generalized through the notion of kernels. Kernel methods [34,35] have been used extensively in machine learning for classification, i.e., to capture the nonlinear complex patterns underlying data. Kernel methods include support vector machines (SVMs), kernel Fisher discriminant analysis (KFDA), and Gaussian process models. They have been successfully applied to a wide variety of machine learning problems [34,35]. These methods map data points from the input space to the feature space, i.e., higher dimensional reproducing kernel Hilbert space (RKHS), such that relatively simple linear algorithms can deliver impressive performance.
Definition 3
(Kernels via feature maps). Let Ω be a nonempty set. A feature map ϕ is any function ϕ : Ω H , where ( H , · , · H ) is any Hilbert space (the feature space). The function
K ( x , y ) : = ϕ ( x ) , ϕ ( y ) H x , y Ω ,
is a positive definite kernel on Ω.
Denoting the d-dimensional input space X R d ( R denotes the set of real numbers), the kernel function K : X × X R induces an implicit mapping ϕ into a higher dimensional RKHS H in which even simple linear models inferred are highly effective compared to their nonlinear counterparts learned directly in the input space X .
Instead of formulating an optimization criterion with a fixed kernel K , one can leave the kernel K as a combination of a set of predefined kernels, which results in the problem of multiple kernel learning (MKL) [36]. MKL maps each sample to a multiple-kernel-induced feature space, and a linear classifier is learned in this space.
So far, we do not know which kernel will produce the most discriminative AE in terms of diagnostics isolation. We now show how the notion of principal angle provides a metric for the necessary kernel function.

4.2. Metrics for Temporal Sequences

We can extend this notion to temporal sequences of vectors as follows. Assume that we have a system S that generates a language B consisting of valid behaviors. We define length- λ behaviors as subspaces of equal dimension, which may be represented directly by data matrices. Thus, length- λ behaviors may be identified with points on the Grassmannian G r ( k , N ) , i.e., the set of all subspaces of dimension k in R λ , endowed with the structure of a (quotient) manifold.
Proposition 1.
The function d is a metric on the set of all restricted behaviors ξ | λ , with λ > l , whenever d is a metric on G r ( m λ + n , q λ ) .
Definition 4
(Gap Function Projection Operator P). Let H = X × Z , with X and Z Hilbert spaces. Let P : d o m ( P ) Z and P ˜ : d o m ( P ˜ ) Z be closed operators, with d o m ( P ) and d o m ( P ˜ ) being subspaces of X . The gap between P and P ˜ is defined as
g a p H ( P , P ˜ ) = g a p H ( G ( P ) , G ( P ˜ ) ) .
Definition 5.
The set of all behaviors ξ | λ , with λ > l , equipped with gap λ is a metric space.
The underlying Grassmannian structure enables us to induce metrics between behaviors. We represent the distance between behaviors ξ and ξ ˜ using the function g a p λ ( ξ , ξ ˜ ) .
Dynamical systems projections are viewed as a graph, which is defined as follows:
Definition 6
(Graph). Let T : X Z be a map between sets. Then the graph G ( T ) X × Z is given by G ( T ) = { ( ξ , T ξ ) : ξ X } . If T is linear then G ( T ) is a linear subspace of X × Z .
The gap metric has a well-known geometric interpretation in terms of the sine of the largest principal angle between two subspaces. We define the principal angle as follows.
Definition 7
(Principal Angle). Consider two subspaces X and Z of R n with dimensions r and s respectively, where r s . The principal angles between X and Z , denoted as θ 1 , , θ r , are defined recursively as follows for j = 2 , , r . :
θ 1 = min x 1 X , z 1 Z arccos x 1 z 1 x 1 z 1 , θ j = min x j x 1 , , x j 1 x j X , z j Z z j z 1 , , z j 1 arccos x j z j x j z j
The vectors x 1 , x r and z 1 , z r , are called principal vectors. The dimension of X Z is the multiplicity of zero as a principal angle. It is straightforward to compute the principal angles by calculating the singular values of X T Z , where X and Z are orthonormal bases for X and Z , respectively. The singular values of X T Z are then c o s θ 1 , , c o s θ r .
A principal angle induces several distance metrics on the Grassmann manifold. One example is the (squared) chordal distance D c 2 ( X , Z ) , given by
D c 2 ( X , Z ) = i = 1 s sin 2 θ i .
Hence, when λ > l , the λ -gap between ξ | λ and ξ ˜ | λ is g a p λ ( ξ , ξ ˜ ) = s i n θ m a x , where θ m a x is the largest principal angle between the sub-spaces ξ | λ and ξ ˜ | λ .
One key property of such a metric is that it generates admissible systems of neighborhoods; this enables the classification of different modes by neighborhood. We shall call such metrics robust [8].
There are many variants of the gap metric in the literature, e.g., [37,38,39]. All of these metrics are equivalent, and induce the same graph topology.
Theorem 1
([8], Theorem 7). Every robust metric creates the same topology as the gap metric.
The induced metrics differ in terms of advantages and disadvantages in particular applications and analyses. The standard autoencoder uses an L 2 (norm) metric; this metric is less explainable than than the gap metric since the gap metric is [ 0 , 1 ] -bounded with clear interpretations for 0 (behaviors ξ and ξ ˜ are identical) and 1 (behaviors ξ and ξ ˜ are maximally different) [8]. In contrast, two behaviors can be similar even though the L 2 metric between them can be arbitrarily large ([9], p. 349).
We will focus on a principal angle metric in the following.

4.3. SVD of Hankel Matrix

Once we encode a signal of length λ T in terms of a Hankel matrix of depth λ T for signal x R q T , the size of the associated Hankel matrix H is q L × ( T λ + 1 ) . As can be seen in Definition 6, if the Hankel matrix H is square, H is also a symmetric matrix.
The λ -gap between behaviors can be directly computed from the knowledge of trajectories. Let ξ and ξ ˜ be T-length trajectories of order λ , with λ > l . Let
H λ ( ξ ) = U 1 U 2 S 0 0 0 V 1 V 2 H λ ( ξ ˜ ) = U ˜ 1 U ˜ 2 S ˜ 0 0 0 V ˜ 1 V ˜ 2
be the singular value decomposition (SVD) of the Hankel matrices H λ ( ξ ) and H λ ( ξ ˜ ) with U 1 R q λ × ( m λ + n ) and U ˜ 1 R q λ × ( m λ + n ) , respectively. Then
g a p λ ( ξ , ξ ˜ ) = U 1 U 1 T U ˜ 1 U ˜ 1 T 2 = U ˜ 2 U 1 T 2
where the first equality comes from the fact that g a p λ ( ξ , ξ ˜ ) = P ξ | λ P ξ ˜ | λ 2 and we substitute P ξ | λ = U 1 U 1 T and P ξ ˜ | λ = U ˜ 1 U ˜ 1 T [40].

5. Approach

This section describes our approach: we outline our input data, and how we transform those data for diagnostic purposes.
We assume that we measure data ξ , where each ξ is a q-vector in a space X R q . A discrete-step time-series signal is denoted by a time set Z + and signal ξ R Z + .
We assume that the data-generator (LTI system, such as a pump or motor) generates data with multiple different distributions; we call each of these a distinguished mode. In diagnostic terms, the modes correspond to nominal or faulty conditions of the motor. We define the set of modes as Γ = { γ 1 , , γ J } .
Given a data set D , we want to train an autoencoder (AE) A θ with parameters θ in parameter set Θ to distinguish the data according to their mode distribution. Assume that we have a loss function that scores correctly diagnosing faults in inputs ξ in a test set D t e s t .
Task 1
(Autoencoder Learning). Our learning task is to generate the autoencoder parameters θ * that minimize diagnostics loss over a test set while maximizing mode distance, given true fault label y ^ i for test instance i.
θ * = arg min θ Θ max γ j Γ i D t e s t L ( y ^ i , A θ ( ξ i ) ) ,
where L ( y ^ i , A θ ( ξ i ) ) measures the loss between the true fault label y ^ i and the autoencoder output A θ ( ξ i ) .
We outline our approach in Figure 1. We train an AE and use a loss function L that aims to optimize diagnostic accuracy. Specifically, we encode a loss function L S P A that generates a code based on the subspace principal angle (SPA). Here, we encode the input in a specific form (Hankel matrix), and perform AE-based diagnosis by defining the code as a representation that aims to maximize the distance of the SPA, ζ = f ρ ( h ) , as follows:
Encode the input data ξ as a Hankel matrix.
Compute, from the SVD of the corresponding Hankel matrix h = f S V D ( ξ ) , the principal angles ρ .
Compute subspace distance in terms of SPA.
Output the loss as the minimum SPA.
Previous work, e.g., [12,41,42], addressed system identification of an arbitrary LTI dynamical system S , based on data consisting of input/output pairs ( u , y ) such that S is controllable. In this case, we require that (a) our language B properly encodes the input/output relations of an LTI system S , and (b) our inputs u are persistently exciting (a signal is persistently exciting if its spectrum contains a sufficiently large number of harmonics [43]; the use of persistently exciting signals ensures that a system identification experiment produces informative data). In this article, we restrict the data to just outputs y.

6. Empirical Analysis

This section describes our empirical analysis of the impact of principal angle methods.

6.1. Experimental Design

We train the autoencoder using an optimization framework as follows. The primary objective is to minimize the difference between the input matrix H and the reconstructed matrix H ^ . We then apply regularization to enforce the code ζ , maximizing the metric distance between different modes. If we define ξ i and ξ j as behaviors from modes indexed by i , j I , respectively, we can define this constraint as
max i j , i , j I k = 1 N g a p λ ( ξ i , ξ ˜ j )
We can also aim to restrict the size of the code ζ , as is typically done in AE training.
We designed experiments to study how the AE architecture and loss function affect diagnostic performance. Table 1 summarizes the experiments that we conducted. We developed two main types of architecture, based on fully connected and long short-term memory (LSTM) architectures. We also compare the impact of different loss functions.
We use the architecture to encode the time-series model. A fully connected architecture is the simplest approach, providing a clear method for the model, although it has not a specific structure for encoding temporal representations.
The LSTM is designed to encode temporal features, and hence should be better than a fully connected architecture for time-series models.

6.2. Data

We use the Numenta Anomaly Benchmark (NAB) dataset [10,11] for evaluating the prediction performance of the proposed framework. It provides artificial and real-world time-series data containing labeled anomalous periods of behavior. Data are ordered and timestamped, with single-valued metrics. The timestamp is of the format “yyyy-mm-dd” representing the year, month, day, hour, minute, and second. The difference of adjacent timestamps is fixed; however, such a difference varies for various datasets. An entry is labeled as anomalous if its timestamp is specified explicitly in a separate file.
Our AE-based approach conducts semi-supervised anomaly detection, so it does not rely on labels to train anomaly detection models; instead, we use the labels for performance evaluation. We use the art_daily_small_noise.csv file for training and the art_daily_jumpsup.csv file for testing. The data consist of 288 timesteps/day, i.e., a value every 5 min for 14 days. We use a batch size of 3745 sequences.
Pre-processing: We normalize the data prior to inference.

6.3. Architecture

In this article, we compare two different deep network architectures, which are summarized in Figure 2 and Figure 3: a fully connected (dense) and an LSTM architecture. The dense network contains 10 times more trainable parameters than the LSTM network.

6.4. Loss Function

As a baseline, we study the “traditional” AE loss function, MAE. This loss function tries to get the input ξ and output ξ ^ to agree, by minimizing L = ( ξ ξ ^ ) 2 .
We encode the subspace approach in terms of a PA-based loss function.
  • Input ξ : a T-length time-series vector of data.
  • Transform ξ into a Hankel matrix H .
  • Apply an SVD projection of H ([44] describes a kernel SVD algorithm for such a task).
  • Compute the SPA metric P S P A ( z ) to maximize distance between the modes in this subspace.

6.5. Results

We now summarize our results. We rank the performance of the classification models using mean classification accuracy:
A c c u r a c y = # C o r r e c t l y   c l a s s i f i e d   o b s e r v a t i o n s # T o t a l   O b s e r v a t i o n s × 100 .
We average the results over 50 different experiments, as shown in Table 2. This table shows that the LSTM architecture outperforms the fully connected architecture. Second, the SPA metric outperforms the MAE metric for both architectures.
Figure 4 and Figure 5 show some details of the results for fully connected and LSTM-based AE architectures, respectively.
Fully-connected AE architecture: Figure 4a plots the true data against the predicted values, and shows that the model trained using MAE has significant discrepancies. In contrast, Figure 4b shows that the model trained using SPA has fewer discrepancies. In terms of test loss, Figure 4c shows that the model trained using MAE is worse than the model trained using SPA; for example, the MAE loss spans 0.5 through 1.5, whereas the loss for the SPA model spans 0.4–0.9.
LSTM-based AE architecture: Figure 5 shows that the LSTM-based architecture performs much better than the fully connected AE. Figure 5a plots the true data against the predicted values and shows that the model trained using MAE has greater discrepancies than that shown in Figure 5b, which shows the true vs. the predicted values for the model trained using SPA. In terms of test loss, Figure 5c shows that the model trained using MAE is worse than the model trained using SPA; for example, the MAE loss spans 0.13 through 0.6, whereas the loss for the SPA model (Figure 5d) spans 0.08–0.33.

7. Conclusions

We showed the theoretical underpinnings for a semantically well-defined AE for diagnostics inference on time-series data. The key element is the introduction of a well-defined metric space over the AE, based on the SPA. Traditional AE architectures are not trained with such a metric space, and hence can be shown to be sub-optimal with regard to diagnostics isolation accuracy.
We showed how the data-driven AE with SPA metrics can be applied to time-series isolation of machinery faults and of LTI systems. We empirically validated this theoretical position with experiments over two AE architectures.
This work provides the basis for improving the performance of AEs in diagnostics applications. Future work aims to further empirically justify this approach with additional real-world data sets.


This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 and 13/RC/2094.

Data Availability Statement

All data used in this study are publicly available at (accessed on 15 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.


The following abbreviations are used in this manuscript:
SVDSingular Value Decomposition
SPASmallest Principal Angle
LTILinear Time Invariant
LSTMLong Short-Term Memory
Symbol         Meaning
xstate variable
youtput variable
uinput variable
ξ time-series vector
H Hankel matrix
θ subspace angle


  1. Liu, R.; Yang, B.; Zio, E.; Chen, X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mech. Syst. Signal Process. 2018, 108, 33–47. [Google Scholar] [CrossRef]
  2. Shao, H.; Jiang, H.; Zhao, H.; Wang, F. A novel deep autoencoder feature learning method for rotating machinery fault diagnosis. Mech. Syst. Signal Process. 2017, 95, 187–204. [Google Scholar] [CrossRef]
  3. Khan, S.; Yairi, T. A review on the application of deep learning in system health management. Mech. Syst. Signal Process. 2018, 107, 241–265. [Google Scholar] [CrossRef]
  4. Chen, J. Robust Residual Generation for Model-Based Fault Diagnosis of Dynamic Systems. Ph.D. Thesis, University of York, York, UK, 1995. [Google Scholar]
  5. Chen, R.T.; Li, X.; Grosse, R.B.; Duvenaud, D.K. Isolating sources of disentanglement in variational autoencoders. arXiv 2018, arXiv:1802.04942. [Google Scholar]
  6. Li, J.; Wang, Y.; Zi, Y.; Zhang, H.; Wan, Z. Causal Disentanglement: A Generalized Bearing Fault Diagnostic Framework in Continuous Degradation Mode. IEEE Trans. Neural Netw. Learn. Syst. 2021. [Google Scholar] [CrossRef] [PubMed]
  7. Knyazev, A.V.; Zhu, P. Principal angles between subspaces and their tangents. arXiv 2012, arXiv:1209.0523. [Google Scholar]
  8. El-Sakkary, A. The gap metric: Robustness of stabilization of feedback systems. IEEE Trans. Autom. Control 1985, 30, 240–247. [Google Scholar] [CrossRef]
  9. Zhou, K.; Doyle, J.C. Essentials of Robust Control; Prentice Hall: Upper Saddle River, NJ, USA, 1998; Volume 104. [Google Scholar]
  10. Lavin, A.; Ahmad, S. Evaluating real-time anomaly detection algorithms—The Numenta anomaly benchmark. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 9–11 December 2015; IEEE: New York, NY, USA, 2015; pp. 38–44. [Google Scholar]
  11. Singh, N.; Olinsky, C. Demystifying Numenta anomaly benchmark. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: New York, NY, USA, 2017; pp. 1570–1577. [Google Scholar]
  12. Li, L.; Ding, S.X. Gap metric techniques and their application to fault detection performance analysis and fault isolation schemes. Automatica 2020, 118, 109029. [Google Scholar] [CrossRef]
  13. Jin, H.; Zuo, Z.; Wang, Y.; Cui, L.; Li, L. An integrated model-based and data-driven gap metric method for fault detection and isolation. IEEE Trans. Cybern. 2021, 52, 12687–12697. [Google Scholar] [CrossRef]
  14. Li, H.; Yang, Y.; Zhao, Z.; Zhou, J.; Liu, R. Fault detection via data-driven K-gap metric with application to ship propulsion systems. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 6023–6027. [Google Scholar] [CrossRef]
  15. Li, H.; Yang, Y.; Zhang, Y.; Qiao, L.; Zhao, Z.; He, Z. A Comparison Study of K-gap Metric Calculation Based on Different Data-driven Stable Kernel Representation Methods. In Proceedings of the 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), Dali, China, 24–27 May 2019; pp. 1335–1340. [Google Scholar] [CrossRef]
  16. Prasad, G.M.; Rao, A.S. Evaluation of gap-metric based multi-model control schemes for nonlinear systems: An experimental study. ISA Trans. 2019, 94, 246–254. [Google Scholar] [CrossRef]
  17. González-Muñiz, A.; Díaz, I.; Cuadrado, A.A.; García-Pérez, D. Health indicator for machine condition monitoring built in the latent space of a deep autoencoder. Reliab. Eng. Syst. Saf. 2022, 224, 108482. [Google Scholar] [CrossRef]
  18. Hoang, D.T.; Kang, H.J. A survey on deep learning based bearing fault diagnosis. Neurocomputing 2019, 335, 327–335. [Google Scholar] [CrossRef]
  19. Yang, D.; Karimi, H.R.; Sun, K. Residual wide-kernel deep convolutional auto-encoder for intelligent rotating machinery fault diagnosis with limited samples. Neural Netw. 2021, 141, 133–144. [Google Scholar] [CrossRef]
  20. Shen, C.; Qi, Y.; Wang, J.; Cai, G.; Zhu, Z. An automatic and robust features learning method for rotating machinery fault diagnosis based on contractive autoencoder. Eng. Appl. Artif. Intell. 2018, 76, 170–184. [Google Scholar] [CrossRef]
  21. Sun, W.; Zhou, Y.; Xiang, J.; Chen, B.; Feng, W. Hankel Matrix-Based Condition Monitoring of Rolling Element Bearings: An Enhanced Framework for Time-Series Analysis. IEEE Trans. Instrum. Meas. 2021, 70, 1–10. [Google Scholar] [CrossRef]
  22. Willems, J.C.; Rapisarda, P.; Markovsky, I.; De Moor, B.L. A note on persistency of excitation. Syst. Control. Lett. 2005, 54, 325–329. [Google Scholar] [CrossRef] [Green Version]
  23. Markovsky, I.; Dörfler, F. Behavioral systems theory in data-driven analysis, signal processing, and control. Annu. Rev. Control 2021, 52, 42–64. [Google Scholar] [CrossRef]
  24. Chen, J.; Xie, B.; Zhang, H.; Zhai, J. Deep autoencoders in pattern recognition: A survey. In Bio-Inspired Computing Models and Algorithms; World Scientific: Singapore, 2019; pp. 229–255. [Google Scholar]
  25. Pratella, D.; Ait-El-Mkadem Saadi, S.; Bannwarth, S.; Paquis-Fluckinger, V.; Bottini, S. A Survey of Autoencoder Algorithms to Pave the Diagnosis of Rare Diseases. Int. J. Mol. Sci. 2021, 22, 10891. [Google Scholar] [CrossRef]
  26. Zemouri, R.; Lévesque, M.; Boucher, É.; Kirouac, M.; Lafleur, F.; Bernier, S.; Merkhouf, A. Recent Research and Applications in Variational Autoencoders for Industrial Prognosis and Health Management: A Survey. In Proceedings of the 2022 Prognostics and Health Management Conference (PHM-2022 London), London, UK, 27–29 May 2022; pp. 193–203. [Google Scholar]
  27. Nguyen, T.K.; Ahmad, Z.; Kim, J.M. A Deep-Learning-Based Health Indicator Constructor Using Kullback–Leibler Divergence for Predicting the Remaining Useful Life of Concrete Structures. Sensors 2022, 22, 3687. [Google Scholar] [CrossRef]
  28. Glad, T.; Ljung, L. Control Theory; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  29. Zhai, J.; Zhang, S.; Chen, J.; He, Q. Autoencoder and its various variants. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 415–419. [Google Scholar]
  30. Michelucci, U. Autoencoders. In Applied Deep Learning with TensorFlow 2; Springer: Berlin/Heidelberg, Germany, 2022; pp. 257–283. [Google Scholar]
  31. Sontag, E. Nonlinear regulation: The piecewise linear approach. IEEE Trans. Autom. Control 1981, 26, 346–358. [Google Scholar] [CrossRef]
  32. Presti, L.L.; La Cascia, M.; Sclaroff, S.; Camps, O. Hankelet-based dynamical systems modeling for 3D action recognition. Image Vis. Comput. 2015, 44, 29–43. [Google Scholar] [CrossRef] [Green Version]
  33. Van Overschee, P.; De Moor, B. Subspace algorithms for the stochastic identification problem. Automatica 1993, 29, 649–660. [Google Scholar] [CrossRef]
  34. Hofmann, T.; Schölkopf, B.; Smola, A.J. Kernel methods in machine learning. Ann. Stat. 2008, 36, 1171–1220. [Google Scholar] [CrossRef] [Green Version]
  35. Pillonetto, G.; Dinuzzo, F.; Chen, T.; De Nicolao, G.; Ljung, L. Kernel methods in system identification, machine learning and function estimation: A survey. Automatica 2014, 50, 657–682. [Google Scholar] [CrossRef] [Green Version]
  36. Wang, T.; Zhang, L.; Hu, W. Bridging deep and multiple kernel learning: A review. Inf. Fusion 2021, 67, 3–13. [Google Scholar] [CrossRef]
  37. Ding, S.X.; Li, L.; Zhao, D.; Louen, C.; Liu, T. Application of the unified control and detection framework to detecting stealthy integrity cyber-attacks on feedback control systems. arXiv 2021, arXiv:2103.00210. [Google Scholar] [CrossRef]
  38. Zames, G. Unstable systems and feedback: The gap metric. In Proceedings of the Allerton Conference, Monticello, IL, USA, 8–10 October 1980; pp. 380–385. [Google Scholar]
  39. Vinnicombe, G. Uncertainty and Feedback: H [Infinity] Loop-Shaping and the [nu]-Gap METRIC; World Scientific: Singapore, 2001. [Google Scholar]
  40. Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, ML, USA, 2013. [Google Scholar]
  41. Zhao, H.; Luo, H.; Wu, Y. A Data-Driven Scheme for Fault Detection of Discrete-Time Switched Systems. Sensors 2021, 21, 4138. [Google Scholar] [CrossRef]
  42. Padoan, A.; Coulson, J.; van Waarde, H.J.; Lygeros, J.; Dörfler, F. Behavioral uncertainty quantification for data-driven control. arXiv 2022, arXiv:2204.02671. [Google Scholar]
  43. Ljung, L. Characterization of the Concept of ‘Persistently Exciting’ in the Frequency Domain; Department of Automatic Control, Lund Institute of Technology (LTH): Lund, Sweden, 1971. [Google Scholar]
  44. Park, C.H.; Park, H. Nonlinear discriminant analysis using kernel functions and the generalized singular value decomposition. SIAM J. Matrix Anal. Appl. 2005, 27, 87–102. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The autoencoder approach we adopt, where the input ξ is a encoded to a code z and then decoded to create the reconstructed input ξ ^ . We use the loss function L ( ξ , ξ ^ ) to control the quality of the reconstruction.
Figure 1. The autoencoder approach we adopt, where the input ξ is a encoded to a code z and then decoded to create the reconstructed input ξ ^ . We use the loss function L ( ξ , ξ ^ ) to control the quality of the reconstruction.
Algorithms 16 00178 g001
Figure 2. Specification for AE with dense architecture.
Figure 2. Specification for AE with dense architecture.
Algorithms 16 00178 g002
Figure 3. Specification for AE with LSTM-based architecture.
Figure 3. Specification for AE with LSTM-based architecture.
Algorithms 16 00178 g003
Figure 4. Results for the fully connected AE. Plots (a,b) show the true vs. predicted values for MAE and SPA loss functions, respectively. Plots (c,d) show the test loss for MAE and SPA loss functions, respectively.
Figure 4. Results for the fully connected AE. Plots (a,b) show the true vs. predicted values for MAE and SPA loss functions, respectively. Plots (c,d) show the test loss for MAE and SPA loss functions, respectively.
Algorithms 16 00178 g004
Figure 5. Results for the LSTM-based AE. Plots (a,b) show the true vs. predicted values for MAE and SPA loss functions, respectively. Plots (c,d) show the test loss for MAE and SPA loss functions, respectively.
Figure 5. Results for the LSTM-based AE. Plots (a,b) show the true vs. predicted values for MAE and SPA loss functions, respectively. Plots (c,d) show the test loss for MAE and SPA loss functions, respectively.
Algorithms 16 00178 g005
Table 1. Experiments to compare model types and loss functions. MAE stands for mean absolute error and PA for principal angle.
Table 1. Experiments to compare model types and loss functions. MAE stands for mean absolute error and PA for principal angle.
ModelLoss FunctionRemarks
fully connectedMAEdeep network with 2 hidden dense
fully connectedPAlayers in encoder/decoder
LSTMMAEdeep network with 2 LSTM
LSTMPAlayers per encoder/decoder
Table 2. Summary of performance of model types and loss functions. MAE stands for mean absolute error and SPA for smallest principal angle.
Table 2. Summary of performance of model types and loss functions. MAE stands for mean absolute error and SPA for smallest principal angle.
ModelLoss FunctionAccuracy
fully connectedMAE73
fully connectedSPA79
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Provan, G. Toward Explainable AutoEncoder-Based Diagnosis of Dynamical Systems. Algorithms 2023, 16, 178.

AMA Style

Provan G. Toward Explainable AutoEncoder-Based Diagnosis of Dynamical Systems. Algorithms. 2023; 16(4):178.

Chicago/Turabian Style

Provan, Gregory. 2023. "Toward Explainable AutoEncoder-Based Diagnosis of Dynamical Systems" Algorithms 16, no. 4: 178.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop