Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes

Wei, Lingli; Wang, Xinyuan; Liu, Hongbin

doi:10.3390/app16125941

Open AccessArticle

Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes

by

Lingli Wei

,

Xinyuan Wang

and

Hongbin Liu

^*

Jiangsu Co-Innovation Center of Efficient Processing and Utilization of Forest Resources, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 5941; https://doi.org/10.3390/app16125941

Submission received: 19 May 2026 / Revised: 5 June 2026 / Accepted: 11 June 2026 / Published: 12 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Autoencoders (AEs) have been widely used for industrial process fault detection owing to their ability to learn nonlinear representations from normal operating data. However, conventional AE methods rely heavily on reconstruction errors and may miss weak faults due to overgeneralization. In addition, insufficient modeling of temporal evolution and operating condition variations may reduce their sensitivity to dynamic faults. To address these issues, this study proposes a memory-enhanced and prediction-assisted conditional variational autoencoder named MI-CVAE for unsupervised fault detection. In the proposed framework, statistical features extracted from sliding windows are used as condition information to describe variable operating states. A memory module stores representative normal prototypes to constrain reconstruction and reduce overgeneralization to faulty samples. Meanwhile, an Informer branch captures temporal dependencies and provides complementary prediction residuals. Reconstruction and prediction residuals are fused to construct squared prediction error and squared Mahalanobis distance statistics, with control limits determined by kernel density estimation. The proposed method is validated on the Benchmark Simulation Model No. 1 wastewater treatment benchmark and a real papermaking process dataset. The results show that MI-CVAE outperforms the evaluated comparison methods, particularly in detecting weak and dynamic faults, while maintaining a low false alarm rate.

Keywords:

conditional variational autoencoder; deep learning; fault detection; Informer; memory module

1. Introduction

Modern process industries are increasingly required to sustain continuous production under stringent and variable conditions. As system structures become more integrated and working environments more complex, faults are more likely to arise during production. If such events remain undetected, they may compromise production stability, degrade product quality, and increase safety risks. Therefore, accurate fault detection is of great significance for identifying abnormal states at an early stage and ensuring the safe and reliable operation of industrial systems [1].

During the past decades, various fault detection methods have been developed for industrial process monitoring. Traditional model based methods usually rely on accurate mathematical descriptions or mechanistic knowledge of the monitored system [2]. Although these methods have clear physical meanings and strong interpretability, their application is often limited when dealing with complex industrial processes, since it is difficult to establish accurate mechanistic models for systems with strong nonlinearity, time-varying behavior, and multivariable coupling. In contrast, data-driven fault detection methods do not require explicit mechanistic models and can directly extract useful information from historical process data [3]. Classical data-driven methods mainly include principal component analysis (PCA) [4], partial least squares [5], independent component analysis [6], canonical variate analysis [7], support vector machine, Gaussian mixture model, and k-nearest neighbor methods [8,9]. These methods have been widely applied in process monitoring and have achieved satisfactory performance in many industrial scenarios. Nevertheless, most of them are based on shallow feature representations or statistical assumptions, and their ability to describe complex nonlinear and dynamic characteristics is still limited. When the monitored process contains strong temporal dependence and hidden nonlinear relationships, traditional methods may fail to extract discriminative fault features effectively.

Deep learning has provided an effective modeling framework for fault detection in industrial processes [10]. Unlike conventional models, deep learning approaches can automatically extract hierarchical representations from process data, thereby improving the characterization of complex nonlinear relationships and dynamic behaviors. For example, a cascaded monitoring network named MoniNet was proposed to simultaneously capture temporal dynamic correlations and local spatial correlations, enabling effective anomaly detection in real industrial processes [11]. Recurrent and convolutional architectures were systematically evaluated for early fault detection in the Tennessee Eastman process, showing that deep learning models can improve detection performance while reducing the dependence on manual feature engineering [12]. Bayesian recurrent neural networks were used for chemical process fault detection, enabling nonlinear dynamic modeling while providing uncertainty information for monitoring decisions [13]. Deep recurrent neural networks have also been incorporated into residual control charts for autocorrelated process monitoring and verified using papermaking process data [14]. In addition, recurrent neural networks have been used for sensor fault detection and isolation in nonlinear systems [15]. Compared with conventional recurrent neural networks, long short-term memory (LSTM) networks can better preserve long-term dependencies through gating mechanisms, making them suitable for fault detection in dynamic processes. By combining an LSTM-based attention model with the sequential probability ratio test, early fault warning can be achieved by evaluating the statistical deviation of prediction residuals [16].

In practical industrial scenarios, normal operating data are typically abundant, whereas fault samples remain scarce [17]. Moreover, fault conditions are often diverse and difficult to exhaustively collect, while accurate data labeling requires substantial labor and time costs. Consequently, modeling approaches that rely heavily on labeled fault data are often difficult to meet the practical requirements of industrial process monitoring. In contrast, unsupervised fault detection methods characterize regular process behavior using normal operating data and detect anomalies by measuring the deviation of new observations from the learned reference. Therefore, they are more suitable for industrial applications with limited labeled fault samples. In this context, the autoencoder (AE) has been widely applied to industrial process fault detection because of its clear structure, relatively stable training process, and ability to learn nonlinear feature representations from normal operating data [18,19].

A typical autoencoder consists of an encoder and a decoder. The encoder maps input samples into a low-dimensional latent space, while the decoder reconstructs the original inputs from the latent representations. When trained only on normal operating data, an AE can capture the main features and distribution characteristics of regular process behavior and accurately reconstruct samples within this range. Faulty samples that deviate from this reference usually produce reconstructed outputs that differ significantly from the original inputs. The resulting reconstruction errors are used to construct anomaly scores for fault detection [20]. Various AE-based models have been developed to improve fault detection performance in industrial processes. Deep AE based feature learning has shown its ability to extract representative process features for process pattern recognition [21]. To capture coexisting linear and nonlinear characteristics, PCA was combined with a stacked autoencoder to enhance fault detection in complex industrial processes [22]. In addition, sparse autoencoder combined with adaptive slow feature analysis has been applied to fault detection in time-varying processes [23]. For wastewater treatment applications, a stacked denoising autoencoder was employed for sensor validation in real plants and achieved fault detection rates of up to 98% [24]. Moreover, a multistage variational autoencoder (VAE) was designed for wastewater treatment process monitoring by combining stage division with probabilistic latent modeling [25].

However, AE based fault detection methods generally assume that faulty samples cannot be well reconstructed by a model trained only with normal operating data. This assumption does not always hold in practice. Previous studies have reported that deep autoencoders may generalize well to samples outside the normal distribution and reconstruct some faulty samples with small errors, especially when the fault magnitude is weak or the faulty pattern is close to the normal operating distribution [10,26]. As a result, faulty samples may be incorrectly identified as normal, leading to missed detections. This phenomenon is commonly associated with the overgeneralization problem of AE models [27].

Memory augmented autoencoders provide a promising strategy for alleviating this problem. Instead of directly using latent features for reconstruction, memory augmented AE models store representative normal prototypes in an external memory bank. During reconstruction, the latent representation of the current sample is used as a query to retrieve the most relevant memory items, and the retrieved normal prototypes are then used to guide the decoding process. In this way, the model tends to reconstruct samples according to stored normal patterns, thereby limiting its ability to recover faulty samples and increasing the reconstruction discrepancy between normal and faulty conditions [27,28].

Beyond reconstruction constraints, effective fault detection in industrial processes requires explicit modeling of temporal evolution. Since industrial process data usually exhibit temporal dependence and dynamic correlations, insufficient dynamic modeling may reduce the sensitivity to weak or slowly evolving faults [29,30]. Another practical challenge is that industrial processes often operate under variable conditions caused by load fluctuations, set point changes, and process adjustments [31,32]. Such variations may shift the statistical characteristics of normal samples and make the boundary of normal operating patterns more difficult to describe accurately [33].

Motivated by these considerations, this paper proposes MI-CVAE, a memory enhanced and prediction assisted conditional variational autoencoder for unsupervised fault detection in industrial processes. In the proposed framework, local statistical information is incorporated into the VAE as auxiliary condition input to better characterize normal operating behavior under varying process conditions. A memory module is embedded in the latent space to store representative normal prototypes and constrain the reconstruction process, thereby alleviating the overgeneralization problem of AE models. To further capture process dynamics, an Informer prediction branch is introduced to learn the temporal evolution of process variables. Reconstruction and prediction errors are then jointly used to construct monitoring statistics for fault detection. The main contributions of this paper are summarized as follows.

(1) A memory-enhanced conditional VAE framework is proposed for unsupervised industrial fault detection. Local statistical information is used to characterize variations in normal operating states, while memory prototypes constrain reconstruction and suppress the excessive recovery of abnormal samples.

(2) An Informer prediction branch is introduced into the reconstruction model to jointly use reconstruction and prediction errors for fault detection. The reconstruction branch measures deviations from normal patterns, while the prediction branch captures abnormal dynamic evolution, thereby improving the detection of weak and dynamic faults.

(3) The effectiveness of the proposed MI-CVAE method is validated on the Benchmark Simulation Model No. 1 (BSM1) wastewater treatment benchmark and a real papermaking process dataset. Experimental results demonstrate that MI-CVAE outperforms the comparison methods while maintaining a low false alarm rate.

2. Dataset Description

2.1. Case 1: BSM1

BSM1 is a standardized simulation platform widely used in the field of wastewater treatment [34], and its system configuration is shown in Figure 1. This platform simulates a typical activated sludge treatment process, consisting of five biological reactors and one secondary clarifier. Internal and external recirculation streams are also incorporated to enable the effective removal of nitrogen and carbon pollutants. To construct process monitoring data, BSM1 was simulated under dry-weather conditions. The influent profile covered 14 consecutive days with a sampling interval of 15 min, yielding a total of 1345 data points. Considering their relevance to effluent quality and operational regulation, 15 key process variables were selected for analysis, including influent flow rate, dissolved oxygen concentration, suspended solids, and various nitrogen-containing component concentrations. Detailed information on these variables is provided in Table 1.

Eight typical fault conditions were constructed on the BSM1 simulation platform. Faults 1–4 are process faults, which were introduced by changing biochemical reaction parameters, settling performance parameters, or actuator output signals, causing the system dynamics to deviate from normal operation. Faults 5–8 are sensor faults, mainly involving abnormal variations in control setpoints or measurement signals, such as bias, drift, and complete failure. These faults are used to assess the model’s detection performance for both process disturbances and measurement abnormalities.

For the process faults, Faults 1 and 2 simulate reduced microbial activity by decreasing the maximum specific growth rates of autotrophic and heterotrophic microorganisms, respectively. Fault 3 represents deterioration of settling performance by reducing the settling velocity in the secondary clarifier. Fault 4 is introduced by increasing the nitrate actuator output signal, resulting in abnormal changes in internal recirculation and nitrogen-related variables. These faults reflect typical abnormalities in biochemical reactions, settling separation, and operational regulation.

Sensor faults are used to simulate measurement abnormalities in monitoring and control loops. Fault 5 corresponds to a shift in the dissolved oxygen controller setpoint, Faults 6 and 7 represent fixed bias and linear drift of the dissolved oxygen sensor, respectively, and Fault 8 denotes complete sensor failure. Since the BSM1 system involves feedback control, sensor faults may not only affect state observation but also propagate to related process variables through the control loop.

The detailed settings and parameter descriptions of the eight faults are summarized in Table 2. To illustrate the dynamic influence of process disturbances, Figure 2 shows the temporal responses of all variables under Fault 1 and compares them with those under normal operating conditions.

2.2. Case 2: Papermaking Process Monitoring Dataset

To validate the applicability and robustness of the proposed fault detection model in practical industrial processes, production data collected from a papermaking enterprise from January to December 2024 were used in this study. The dataset covers four key sections of the papermaking process: the approach flow, wire, press, and drying sections. These sections are sequentially connected and exhibit strong coupling and dynamic transmission among process variables, making them representative for process monitoring. The raw field data were first screened to exclude invalid records associated with production shutdown, operating condition switching, and abnormal missing values. Consequently, 442 valid samples were retained for each process section. The approach flow, wire, press, and drying sections include 21, 10, 27, and 13 process variables, respectively.

In practical industrial processes, severe faults generally occur with low frequency, and field monitoring data often lack sufficient and accurate fault annotations. Therefore, representative abnormal patterns of process variables are commonly constructed based on normal operating data in industrial process monitoring studies to evaluate the identification capability of fault detection methods under different dynamic disturbance conditions. Referring to commonly observed abnormal evolution patterns of variables in industrial process monitoring, four types of faults were constructed in this study, including drift, cycle, scale up, and scale down faults. In combination with the actual operating characteristics of the papermaking process, disturbances were introduced into selected key variables at specified time points to simulate different variation trends that may occur under abnormal operating conditions.

The detailed fault settings, including the fault number, corresponding process section, fault type, and affected variables, are summarized in Table 3. All faults were introduced at the 293rd sample. The model was trained using data collected under normal operating conditions, while the fault data were used for testing and performance evaluation. Figure 3 compares the variable trajectories under normal and fault-injection conditions for Fault 1, showing the characteristic changes in process variables after fault occurrence.

3. Materials and Methods

3.1. Data Preprocessing

The original process data are first standardized, and time-series samples are then constructed using a sliding window, with the next observation used as the prediction target.

For the i-th input window, the condition vector is constructed by concatenating the mean and standard deviation vectors of all variables within the window:

c_{i} = [μ_{i}^{c}, σ_{i}^{c}]

(1)

where

μ_{i}^{c} \in ℝ^{D}

and

σ_{i}^{c} \in ℝ^{D}

represent the mean and standard deviation vectors of the variables in the window, respectively. Hence,

c_{i} \in ℝ^{2 D}

.

For the p-th variable, they are calculated as:

μ_{i, p}^{c} = \frac{1}{L} \sum_{t = 1}^{L} x_{t, p}^{(i)}

(2)

σ_{i, p}^{c} = \sqrt{\frac{1}{L} \sum_{t = 1}^{L} {(x_{t, p}^{(i)} - μ_{i, p}^{c})}^{2}}

(3)

where L denotes the window length, and

x_{t, p}^{(i)}

denotes the value of the p-th variable at the t-th step within the i-th window. In this way, the condition vector retains the local level and fluctuation information of each window, thereby providing auxiliary constraints for latent distribution learning.

3.2. Conditional Variational Autoencoder

In this work, conditional information is introduced into the VAE framework [35] to construct a CVAE, whose structure is shown in Figure 4. Compared with the conventional VAE, the CVAE incorporates a condition vector derived from window-based statistical features into both the encoder and decoder. This enables the latent distribution to be learned under local process constraints, thereby improving the representation of normal operating modes.

For an input sample

x_{i}

and its corresponding condition vector

c_{i}

, the encoder maps them into the latent distribution:

q_{ϕ} (z_{i} | x_{i}, c_{i}) = N (z_{i}; μ_{i}, σ_{i}^{2})

(4)

In Equation (4), z_i is the latent variable, μ_i and

σ_{i}^{2}

represent the mean and variance of the latent variable distribution; ϕ denotes the encoder parameters. The reparameterization trick is then used to sample the latent variable:

z_{i} = μ_{i} + σ_{i} ⊙ ε, ε ~ N (0, I)

(5)

During decoding,

z_{i}

is concatenated with

c_{i}

and fed into the decoder to reconstruct the input sample:

p_{θ} (x_{i} | z_{i}, c_{i})

(6)

where

θ

denotes the decoder parameters. By introducing

c_{i}

into both the encoder and decoder, the CVAE can learn latent representations related to the current local process state.

The CVAE training objective consists of a reconstruction loss and a Kullback–Leibler divergence term:

L_{CVAE} = L_{rec} + β L_{KL}

(7)

The parameter β is the weight factor of the KL divergence term. The reconstruction loss

L_{rec}

is defined as:

L_{rec} = \frac{1}{N L D} \sum_{i = 1}^{N} ‖ x_{i} - {\hat{x}}_{i} ‖^{2}

(8)

Here,

L

is the window length,

D

is the number of variables, and

N

is the number of samples constructed from sliding windows. The KL divergence term is given by:

L_{KL} = - \frac{1}{2 N} \sum_{i = 1}^{N} \sum_{j = 1}^{d_{z}} (1 + \log σ_{i, j}^{2} - μ_{i, j}^{2} - σ_{i, j}^{2})

(9)

The symbol d_z denotes the dimensionality of the latent variable, and β is the weight factor for the KL term. By jointly optimizing these two terms, the model learns a smooth latent representation while retaining the ability to reconstruct normal samples under local process constraints.

3.3. Memory Module

The memory module stores representative normal patterns in the latent space and retrieves the most relevant memory information for the current sample [27]. The memory matrix is defined as:

M = [m_{1}, m_{2}, \dots, m_{K}] \in ℝ^{K \times d_{z}}

(10)

where

K

denotes the number of memory units, and

m_{k}

represents the latent prototype vector of the k-th memory unit. For the i-th input sample, the encoder first outputs the parameters of the latent variable distribution,

μ_{i}

and

σ_{i}

. Here,

μ_{i}

serves as the query vector for the memory module, which is used to compute the similarity between the input and each memory unit. To eliminate the influence of the vector norm difference on the matching results, both the query vector and the memory vectors are

L_{2}

-normalized, and the cosine similarity between them is computed as:

sim (μ_{i}, m_{k}) = \frac{μ_{i}^{⊤} m_{k}}{‖ μ_{i} ‖_{2} ‖ m_{k} ‖_{2}}

(11)

This similarity measures the directional consistency between μ_i and each memory prototype. Subsequently, a temperature coefficient

τ

is introduced to scale the similarity, and a softmax function is applied to obtain the attention weight for each memory unit:

a_{i, k} = \frac{\exp (sim (μ_{i}, m_{k}) / τ)}{\sum_{j = 1}^{K} \exp (sim (μ_{i}, m_{j}) / τ)}

(12)

The parameter

τ

adjusts the sharpness of the attention distribution. When

τ

is small, the model focuses more on a few memory units with high similarity. When

τ

is large, the attention distribution becomes smoother. After obtaining the attention weights, a weighted sum over the memory units is computed to obtain the memory read vector:

z_{i}^{m} = \sum_{k = 1}^{K} a_{i, k} m_{k}

(13)

This vector represents the information most relevant to the current input in terms of normal operating patterns. It is fused with the latent variable

z_{i}

obtained from the reparameterization trick using a fusion coefficient

α

to form the final enhanced latent representation:

z_{i}^{f} = (1 - α) z_{i} + α z_{i}^{m}

(14)

Here,

α

is the fusion coefficient. The fused latent vector preserves both the characteristics of the current input sample and the memory-enhanced information, serving as input for subsequent decoder reconstruction.

3.4. Informer-Based Prediction Module

The Informer-based prediction module is used to model the temporal evolution immediately following the input window. By introducing a prediction branch, the model complements the reconstruction branch in capturing dynamic dependencies and improves its sensitivity to abnormal temporal variations.

Let the input window sequence be:

X_{i} = [x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{L}] \in ℝ^{L \times D}

(15)

where

L

is the window length and

D

is the number of variables. The Informer module takes the window sequence

X_{i}

as input and outputs the prediction

{\hat{y}}_{i} \in ℝ^{D}

at the next time step. First, a linear mapping projects the original input into a high-dimensional feature space, and positional encoding is added to retain temporal order information, yielding the initial embedding representation:

H_{i}^{0} = X_{i} W_{e} + P

(16)

with

W_{e}

and P denoting the input projection matrix and positional encoding matrix, respectively.

H_{i}^{0} \in ℝ^{L \times d_{model}}

represents the initial feature embedding of the input sequence, and

d_{m o d e l}

is the embedding dimension.

During the encoding phase, Informer employs probabilistic sparse (ProbSparse) self-attention to model the long-range temporal dependencies in the input sequence [36]. Figure 5 shows the structure of ProbSparse self-attention. Instead of computing attention for all queries, ProbSparse self-attention selects the top-u queries with the highest sparsity scores for attention calculation. This strategy reduces computational complexity while preserving the dominant dependency relationships in the sequence.

For the l-th encoder layer, the input features

H_{i}^{l - 1}

are linearly projected to obtain the query, key, and value matrices:

Q = H_{i}^{l - 1} W_{Q}, K = H_{i}^{l - 1} W_{K}, V = H_{i}^{l - 1} W_{V}

(17)

The matrices

W_{Q}

,

W_{K}

,

W_{V}

correspond to the query, key, and value projections, respectively. For each query vector

q_{t}

, Informer calculates a sparsity measure using its dot product with all key vectors to evaluate its contribution to overall attention:

M (q_{t}, K) = \max_{j} \{\frac{q_{i} k_{j}^{T}}{\sqrt{d_{k}}} - \frac{1}{L_{K}} \sum_{j = 1}^{L_{K}} \frac{q_{i} k_{j}^{T}}{\sqrt{d_{k}}}\}

(18)

Here,

d_{k}

is the key dimension and

L_{K}

is the sequence length. This metric reflects the sparsity of attention for query

q_{t}

; queries that have stronger correlations with a few keys receive higher sparsity scores and are retained for subsequent attention calculation.

Based on the sparsity measure, the most important

u

queries are selected from all queries to form the sparse query set

\bar{Q}

, and the ProbSparse attention is then computed accordingly:

\bar{Q} = {Top}_{u} (Q, M (Q, K))

(19)

ProbSparse (Q, K, V) = Softmax (\frac{\bar{Q} K^{T}}{\sqrt{d_{k}}}) V

(20)

In the multi-head mechanism, the features are projected into multiple subspaces, and the outputs of all heads are concatenated and linearly transformed:

MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W_{O}

(21)

where the output of the r-th attention head is:

{head}_{r} = ProbSparse (Q_{r}, K_{r}, V_{r})

(22)

After ProbSparse attention, the features are fed into the feedforward network with residual connections and layer normalization for stable training. Thus, the output of the l-th layer of the encoder can be expressed as:

{\tilde{H}}_{i}^{l} = LayerNorm (H_{i}^{l - 1} + MultiHead (Q, K, V))

(23)

H_{i}^{l} = LayerNorm ({\tilde{H}}_{i}^{l} + FFN ({\tilde{H}}_{i}^{l}))

(24)

Here,

FFN (\cdot)

denotes the position-wise feed-forward network.

To improve sequence modeling efficiency, Informer incorporates a distilling mechanism between encoder layers. Specifically, after partial encoding, a 1D convolution, activation, and pooling compress the temporal dimension, reducing redundant information while highlighting dominant dynamic features:

H_{i}^{l, dist} = MaxPool (ELU (Conv 1 D (H_{i}^{l})))

(25)

After multi-layer ProbSparse attention and distilling, the encoder outputs high-level temporal features, which are projected to the prediction layer to obtain the multivariate forecast for the next time step:

h_{i} = Pool (H_{i}^{enc})

(26)

{\hat{y}}_{i} = f_{pred} (h_{i})

(27)

where

h_{i}

represents the aggregated window-level temporal features from the encoder,

f_{pred} (\cdot)

is the prediction projection function, and

{\hat{y}}_{i} \in ℝ^{D}

is the predicted multivariate value at the next time step.

3.5. Joint Loss Function

To jointly optimize reconstruction, prediction, latent distribution regularization, and memory representation learning, a joint loss function is adopted. The overall loss function is defined as:

L = λ_{rec} L_{rec} + λ_{pred} L_{pred} + β L_{KL} + λ_{pull} L_{pull} + λ_{ent} L_{ent} + λ_{decay} L_{decay}

(28)

In Equation (28),

L_{rec}

,

L_{pred}

and

L_{KL}

denote the reconstruction loss, prediction loss, and KL divergence loss, respectively. The terms

L_{pull}

,

L_{ent}

and

L_{decay}

are introduced to constrain memory representation learning. The coefficient β is the same KL divergence weight as defined in Equation (7). The parameters λ_rec, λ_pred, λ_pull, λ_ent, and λ_decay are the weights assigned to the reconstruction loss, prediction loss, pull loss, entropy regularization term, and memory weight decay term, respectively.

The reconstruction loss and KL divergence loss have been introduced in Section 3.2 and are not elaborated here. The prediction loss measures the deviation between the predicted and actual next-step states:

L_{pred} = \frac{1}{N H D} \sum_{i = 1}^{N} ‖ y_{i} - {\hat{y}}_{i} ‖^{2}

(29)

where

D

is the number of variables,

H

is the prediction horizon (for this model

H = 1

since single-step prediction is used). Additionally, the memory module introduces corresponding constraints. The pull loss reduces the discrepancy between the mean latent variable and the memory read vector:

L_{pull} = \frac{1}{N} \sum_{i = 1}^{N} ‖ μ_{i} - z_{i}^{m} ‖^{2}

(30)

The entropy regularization term is used to control the sharpness of memory attention allocation, formulated as:

L_{ent} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} a_{i, k} \log (a_{i, k})

(31)

The memory weight decay term is expressed as:

L_{decay} = \frac{1}{K} \sum_{k = 1}^{K} ‖ m_{k} ‖^{2}

(32)

By optimizing these loss terms together, the model learns normal reconstruction patterns, temporal prediction relationships, and memory-enhanced latent representations, providing a basis for subsequent monitoring statistic construction and fault detection.

4. Process Monitoring Framework Based on MI-CVAE

4.1. The Proposed MI-CVAE Model

The MI-CVAE model consists of a conditional variational autoencoder, a memory module, and an Informer-based prediction branch, forming a dual-branch monitoring framework that combines reconstruction and prediction. As shown in Figure 6, the model takes the time-series window sample

x_{i}

and its condition vector

c_{i}

as inputs. The condition vector is constructed from the mean and standard deviation of each variable within the window to provide local operating-condition information.

In the reconstruction branch, the encoder maps

x_{i}

and

c_{i}

into a latent distribution, and the latent representation is obtained through the reparameterization trick. The memory module retrieves representative normal patterns from the memory bank using the latent mean as the query, and then fuses the retrieved memory information with the original latent variable for reconstruction. This memory-enhanced mechanism constrains the reconstruction process with normal operating patterns, thereby reducing the over-generalization of the autoencoder to abnormal samples and improving the sensitivity of reconstruction errors to faults. In the prediction branch, the Informer module learns temporal dependencies from the input window and predicts the process state at the next time step. This branch complements the reconstruction branch by capturing the normal temporal evolution of process variables. When faults occur, the deviation from the learned evolution pattern leads to increased prediction errors.

Overall, MI-CVAE integrates normal-pattern reconstruction and temporal prediction within a unified framework. The reconstruction branch captures latent distribution characteristics, while the prediction branch models dynamic evolution patterns, providing a more comprehensive feature basis for fault detection.

Figure 7 illustrates the implementation procedure of the MI-CVAE-based fault detection model. The detailed steps are as follows.

Step 1: The raw process data are standardized, and the input samples and one-step-ahead prediction targets are constructed using a sliding-window strategy. Meanwhile, the mean and standard deviation of each variable within the window are extracted to form the condition vector.

Step 2: The MI-CVAE model is trained using samples collected under normal operating conditions. The model is jointly optimized by the reconstruction loss, prediction loss, KL divergence loss, and constraint terms associated with the memory module.

Step 3: The reconstruction and prediction errors are calculated using the normal samples in the training set. The fused squared prediction error (SPE) and squared Mahalanobis distance (MD²) statistics are then constructed, and the corresponding control limits are determined by kernel density estimation.

Step 4: The test samples are fed into the trained MI-CVAE model to calculate the corresponding SPE and MD² statistics. These statistics are compared with the control limits to discriminate between normal and abnormal operating states.

Step 5: Based on the detection results in the normal and faulty periods, the fault detection rate and false alarm rate are calculated to evaluate the fault detection performance of the model.

4.2. Monitoring Statistics

Monitoring statistics are used to quantify the deviation of process samples from normal operating conditions. In this work, SPE and MD² monitoring statistics are constructed from both the reconstruction and prediction branches.

For an arbitrary input window sample, the reconstruction error and prediction error are defined as:

e^{rec} = x - \hat{x}

(33)

e^{pred} = y - \hat{y}

(34)

where x denotes the input window sample, and

\hat{x}

is the reconstructed output of the model. y denotes the true prediction target, and

\hat{y}

is the output of the prediction branch.

e^{rec}

and

e^{pred}

represent the reconstruction error and prediction error, respectively.

According to the definition of SPE, the SPE statistics corresponding to the reconstruction and prediction branches are expressed as:

{SPE}^{rec} = {(e^{rec})}^{T} e^{rec}

(35)

{SPE}^{pred} = {(e^{pred})}^{T} e^{pred}

(36)

To avoid scale differences between the two branches, the SPE statistics are standardized using the mean and standard deviation calculated from the training set:

{SPE}_{z}^{rec} = \frac{{SPE}^{rec} - μ_{SPE}^{rec}}{σ_{SPE}^{rec}}

(37)

{SPE}_{z}^{pred} = \frac{{SPE}^{pred} - μ_{SPE}^{pred}}{σ_{SPE}^{pred}}

(38)

where

μ_{SPE}^{rec}

and

σ_{SPE}^{rec}

denote the mean and standard deviation of the SPE statistic of the reconstruction branch in the training set, respectively. Similarly,

μ_{SPE}^{pred}

and

σ_{SPE}^{pred}

denote the mean and standard deviation of the SPE statistic of the prediction branch.

{SPE}_{z}^{rec}

and

{SPE}_{z}^{pred}

are the standardized SPE statistics of the reconstruction and prediction branches, respectively.

The standardized SPE statistics from the two branches are then fused in a weighted manner to obtain the final SPE monitoring statistic:

SPE = ω_{rec} {SPE}_{z}^{rec} + ω_{pred} {SPE}_{z}^{pred}

(39)

The coefficients

ω_{rec}

and

ω_{pred}

are assigned to the reconstruction and prediction branches, respectively, and satisfy:

ω_{rec} + ω_{pred} = 1

(40)

To further consider the covariance structure among residual variables, MD² is introduced as a complementary monitoring statistic. For the reconstruction and prediction branches, the instantaneous MD² statistics are defined as:

{MD}_{t}^{2, rec} = {(e_{t}^{rec} - μ_{t}^{rec})}^{T} {(\sum_{t}^{rec})}^{- 1} (e_{t}^{rec} - μ_{t}^{rec})

(41)

{MD}_{t}^{2, pred} = {(e_{t}^{pred} - μ_{t}^{pred})}^{T} {(\sum_{t}^{pred})}^{- 1} (e_{t}^{pred} - μ_{t}^{pred})

(42)

where

e_{t}^{rec}

denotes the reconstruction residual at the t-th time step within the input window, and

e_{t}^{pred}

denotes the prediction residual at the t-th prediction step.

μ_{t}^{rec}

and

μ_{t}^{pred}

are the mean vectors of the corresponding residuals in the training set, while

\sum_{t}^{rec}

and

\sum_{t}^{pred}

are the corresponding residual covariance matrices.

Since local anomalies may be diluted by averaging over the window, the maximum instantaneous MD² is used as the window-level statistic:

{MD}^{2, rec} = \max \{{MD}_{1}^{2, rec}, {MD}_{2}^{2, rec}, \dots, {MD}_{L}^{2, rec}\}

(43)

The MD² statistic of the prediction branch is defined as:

{MD}^{2, pred} = \max \{{MD}_{1}^{2, pred}, {MD}_{2}^{2, pred}, \dots, {MD}_{H}^{2, pred}\}

(44)

where L denotes the length of the input window, and H denotes the prediction horizon. When the prediction horizon is equal to 1, the MD² statistic of the prediction branch is calculated from the residual of a single prediction step.

Similar to SPE, the branch-specific MD² statistics are standardized using the corresponding training statistics:

{MD}_{z}^{2, rec} = \frac{{MD}^{2, rec} - μ_{MD}^{rec}}{σ_{MD}^{rec}}

(45)

{MD}_{z}^{2, pred} = \frac{{MD}^{2, pred} - μ_{MD}^{pred}}{σ_{MD}^{pred}}

(46)

Here,

μ_{MD}^{rec}

and

σ_{MD}^{rec}

denote the mean and standard deviation of the MD² statistic of the reconstruction branch in the training set, respectively.

μ_{MD}^{pred}

and

σ_{MD}^{pred}

denote the mean and standard deviation of the MD² statistic of the prediction branch, respectively.

{MD}_{z}^{2, rec}

and

{MD}_{z}^{2, pred}

are the standardized MD² statistics of the two branches.

The final MD² monitoring statistic is obtained as:

{MD}^{2} = ω_{rec} {MD}_{z}^{2, rec} + ω_{pred} {MD}_{z}^{2, pred}

(47)

After obtaining the monitoring statistics from the training set, kernel density estimation (KDE) is employed to model the probability distribution of the training statistics in a nonparametric manner, thereby avoiding prior assumptions about their distribution forms. Let

{J_{k}}_{k = 1}^{n}

denote a set of monitoring statistic values calculated from the training samples. The corresponding probability density function can be estimated as:

\hat{f} (J) = \frac{1}{n b} \sum_{k = 1}^{n} K (\frac{J - J_{k}}{b})

(48)

In Equation (48), n is the number of training samples, b is the bandwidth, and K(⋅) represents the kernel function. A Gaussian kernel is used in this work. Given a confidence level α, the control limit δ is determined by:

P (J \leq δ) = α

(49)

Here, J denotes the monitoring statistic to be modeled, which can be either the fused SPE or MD², and δ represents the corresponding control limit under the confidence level α. The confidence level of the KDE-based control limit was set to α = 0.99, following the common practice in process monitoring studies where a 99% confidence limit is used to determine monitoring thresholds [37,38]. This setting corresponds to an approximate nominal false alarm probability of 1% under normal operating conditions, thereby helping to suppress unnecessary false alarms while preserving sufficient sensitivity to faults.

During testing, each sample is fed into the trained MI-CVAE model, and its fused SPE and MD² statistics are compared with the control limits. A sample is identified as faulty if either

SPE > δ_{SPE}

or

{MD}^{2} > δ_{MD}

; otherwise, it is regarded as normal.

4.3. Evaluation Metrics

The fault detection rate (FDR) and false alarm rate (FAR) were adopted as evaluation metrics. Based on the confusion matrix, TP denotes the number of fault samples correctly identified as abnormal, FN denotes the number of fault samples incorrectly classified as normal, FP denotes the number of normal samples incorrectly identified as abnormal, and TN denotes the number of normal samples correctly classified as normal. The two metrics are defined as follows:

FDR = \frac{TP}{TP + FN} \times 100 %

(50)

FAR = \frac{FP}{TN + FP} \times 100 %

(51)

A higher FDR indicates stronger capability in identifying fault samples, whereas a lower FAR indicates fewer false alarms under normal operating conditions. Therefore, a desirable fault detection method should achieve a high FDR while maintaining a low FAR.

5. Case Studies

5.1. Case 1: BSM1

Before model training, grid search was performed for the sliding window length, latent dimension, memory module parameters, prediction branch parameters, and loss weights. In particular, the number of memory units K was selected by considering the trade-off between representation capacity and model complexity. To further analyze the influence of the number of memory units, a sensitivity analysis was conducted by testing K in {5, 10, 20, 30, 40} on the BSM1 dataset. As shown in Figure 8, the average FDR increased as K increased from 5 to 20, indicating that a larger memory bank can better represent the diversity of normal operating patterns. However, when K was further increased to 30 and 40, the improvement in FDR became marginal, while model complexity increased. Therefore, K = 20 was selected as a balanced setting considering both detection performance and false alarm control.

The optimal parameters were selected based on the cross-validation results. The final parameter settings are listed in Table 4.

Table 5 and Table 6 present the fault detection results of different methods on the BSM1 dataset, and the corresponding monitoring curves are shown in Figure 9 and Figure 10. In addition to the ablated variants, AE-Transformer and LSTM-GAN are included as representative deep learning baselines. The proposed MI-CVAE achieves the best performance under both monitoring statistics. Specifically, the average FDRs based on SPE and MD² are 97.1% and 98.9%, respectively, while the corresponding average FARs are 3.4% and 3.8%. Relative to AE-Transformer and LSTM-GAN, MI-CVAE increases the average SPE-based FDR by 10.7% and 18.0%, respectively. For the MD² statistic, the corresponding gains are 3.8% and 7.0%. These results suggest that the proposed framework learns more discriminative fault representations than models using only Transformer-based temporal modeling or GAN-based distribution learning.

For Faults 1, 5, and 8, most methods achieve high detection rates, suggesting that these faults cause obvious deviations from normal operating conditions. MI-CVAE reaches 100.0% FDR for these three faults under both statistics. Faults 5 and 8 are directly related to dissolved oxygen control or sensor failure, which can induce pronounced changes in the monitored variables and are therefore relatively easy to detect. As shown in Figure 9 and Figure 10, the SPE and MD² statistics increase rapidly after fault occurrence and remain above the control limits for most faulty samples. The advantages of MI-CVAE are more evident for Faults 4, 6, and 7. Fault 4 is caused by an abnormal increase in the nitrate actuator output signal, which may affect the system state through the control loop and internal recirculation. Such abnormal variations are not always sufficiently reflected by reconstruction errors alone. Under SPE, the FDR of CVAE for Fault 4 is only 36.9%, whereas MI-CVAE increases it to 89.9%. Under MD², the FDR is further improved from 45.6% to 94.1%. This improvement indicates that the proposed model is more sensitive to actuator-related dynamic disturbances. The periodic peaks in the monitoring curves also show that MI-CVAE can capture repeated abnormal deviations caused by this fault. Faults 6 and 7, corresponding to dissolved oxygen sensor bias and drift, are also difficult to detect because their effects may be partially compensated by the feedback control system, resulting in weak or gradual abnormal changes. For Fault 6, the SPE-based FDR of CVAE is only 8.2%, while MI-CVAE increases it to 96.3%; under MD², MI-CVAE further reaches 100.0%. For Fault 7, MI-CVAE achieves FDRs of 94.4% and 100.0% under SPE and MD², respectively. These results indicate that the prediction branch effectively captures abnormal temporal evolution, while the memory module strengthens the distinction between normal fluctuations and fault-induced deviations. Thus, MI-CVAE shows clear advantages in detecting weak sensor abnormalities and slow dynamic deviations.

The FAR results in Table 6 further demonstrate the robustness of the proposed method. MI-CVAE obtains the lowest average FARs under SPE and MD², with values of 3.4% and 3.8%, respectively. Although MD² is generally more sensitive to subtle distributional shifts, it does not cause excessive false alarms in MI-CVAE. For Faults 2 and 3, the MD²-based FARs are both 0.0%, while the corresponding FDRs remain above 97%, indicating a good balance between detection sensitivity and false alarm control. These results confirm that MI-CVAE has clear advantages in detecting weak disturbances, sensor bias and drift faults, and actuator-related dynamic faults, while maintaining stable performance for faults with more distinct abnormal patterns.

5.2. Case 2: Papermaking Process Monitoring Dataset

As described in Section 2.2, the papermaking process monitoring dataset contains 442 samples. In this experiment, the dataset was divided into training and test sets at a ratio of 6:4. The training set contains 265 samples collected under normal operating conditions, while the test set contains 177 samples, of which the last 150 samples correspond to fault data. Although its size is relatively limited due to the practical constraints of industrial data acquisition, it is used as a real industrial case to examine the applicability of the proposed method. The evaluation is further supported by the standard BSM1 benchmark dataset, and the conclusions are drawn from the combined results of both datasets.

Table 7 and Table 8 present the detection results of different methods on the papermaking dataset, and the corresponding monitoring curves are shown in Figure 11 and Figure 12. MI-CVAE achieves the highest average detection performance among all evaluated methods. Under the SPE statistic, its average FDR reaches 94.8%, exceeding those of CVAE, CVAE-Memory, CVAE-Informer, AE-Transformer, LSTM-GAN, and MI-VAE by 21.0%, 14.0%, 6.2%, 3.9%, 10.0%, and 3.4%, respectively. Under the MD² statistic, the average FDR further reaches 96.1%, with improvements of 18.0%, 9.3%, 5.3%, 3.0%, 7.2%, and 3.1% over the corresponding comparison methods. Meanwhile, the average FARs remain only 0.0% and 1.1%, indicating that the improvement in detection sensitivity is not achieved at the expense of excessive false alarms.

The performance differences can be understood from the fault characteristics and model structures. For Faults 1 and 2, the abnormal information is mainly reflected in relatively evident changes in material flow and consistency related variables, which can be partially captured by reconstruction-based methods. However, MI-CVAE uses memory enhanced normal prototypes to strengthen the distinction between normal fluctuations and fault induced deviations. For Fault 4, the cyclic pressure disturbance may be mixed with normal pressure variations, so its abnormality is not always prominent in reconstruction errors. In this case, CVAE and CVAE-Memory are limited because they do not explicitly model whether the process state evolves according to normal temporal patterns. By introducing the prediction branch, MI-CVAE can better capture such abnormal dynamic evolution. For Faults 5–7, the faults involve coupled changes in vacuum, drying temperature, and steam pressure related variables. The integration of reconstruction modeling, memory representation, and temporal prediction enables MI-CVAE to describe both variable deviations and dynamic correlation changes, leading to more stable detection performance on the papermaking dataset.

6. Conclusions and Limitations

6.1. Conclusions

This study proposed an unsupervised fault detection framework named MI-CVAE for nonlinear and dynamic industrial processes. By incorporating condition information, memory constrained reconstruction, and Informer based temporal prediction into a unified framework, the proposed method improves the representation of normal operating patterns and enhances the detection sensitivity to weak and dynamic faults.

The effectiveness of MI-CVAE was validated on the BSM1 wastewater treatment benchmark and a real papermaking process dataset. On the BSM1 dataset, MI-CVAE achieved average FDR values of 97.1% and 98.9% under the SPE and MD² statistics, respectively, while maintaining average FAR values of 3.4% and 3.8%. On the papermaking dataset, MI-CVAE obtained average FDR values of 94.8% and 96.1% under the SPE and MD² statistics, respectively, with average FAR values of 0.0% and 1.1%. These results demonstrate that MI-CVAE achieves more stable detection performance than the comparison methods, especially for weak and dynamic faults, without causing excessive false alarms.

6.2. Limitations and Future Work

Despite these encouraging results, several limitations should be acknowledged. Due to practical constraints in industrial data acquisition, the real papermaking dataset used in this study is relatively limited in size. Model adaptability to evolving operating conditions also requires further consideration. When intentional process improvements or operational adjustments change the statistical distribution of normal data, new normal patterns outside the training data may be temporarily identified as faults. The current MI-CVAE framework supports offline retraining with newly collected normal operating data, whereas online adaptive updating has not yet been fully implemented. Another limitation lies in its data-driven nature, as physical mechanisms of the monitored process are not explicitly incorporated, leaving room for further improvement in interpretability. Since FDR and FAR are mainly reported as point estimates in this study, systematic uncertainty analysis should also be further strengthened.

Future work will be carried out from the following aspects. More papermaking process data under broader operating conditions will be collected to extend the validation scope of MI-CVAE. To improve its adaptability to evolving industrial processes, retraining and adaptive updating strategies will be further investigated. The integration of MI-CVAE with physical models and signal processing features will also be explored to enhance model interpretability and dynamic feature representation. In addition, statistical uncertainty analysis based on repeated experiments with different random seeds, confidence interval estimation, and bootstrap based evaluation will be introduced to quantify the uncertainty of FDR and FAR more comprehensively.

Author Contributions

Methodology, L.W.; data collection, X.W.; supervision, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Shandong Provincial Natural Science Foundation, China (ZR2021MF135) and Natural Science Foundation of Jiangsu Provincial Universities, China (22KJA530003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from third party and are available from the authors with the permission of the third party.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Venkatasubramanian, V.; Rengaswamy, R.; Yin, K.; Kavuri, S.N. A review of process fault detection and diagnosis: Part I: Quantitative model-based methods. Comput. Chem. Eng. 2003, 27, 293–311. [Google Scholar] [CrossRef]
Isermann, R. Model-based fault-detection and diagnosis—Status and applications. Annu. Rev. Control 2005, 29, 71–85. [Google Scholar] [CrossRef]
Qin, S.J. Survey on data-driven industrial process monitoring and diagnosis. Annu. Rev. Control 2012, 36, 220–234. [Google Scholar] [CrossRef]
Teppola, P.; Mujunen, S.-P.; Minkkinen, P.; Puijola, T.; Pursiheimo, P. Principal component analysis, contribution plots and feature weights in the monitoring of sequential process data from a paper machine’s wet end. Chemom. Intell. Lab. Syst. 1998, 44, 307–317. [Google Scholar] [CrossRef]
Godoy, J.L.; Vega, J.R.; Marchetti, J.L. A fault detection and diagnosis technique for multivariate processes using a PLS-decomposition of the measurement space. Chemom. Intell. Lab. Syst. 2013, 128, 25–36. [Google Scholar] [CrossRef]
Lee, J.-M.; Yoo, C.; Lee, I.-B. Statistical process monitoring with independent component analysis. J. Process Control 2004, 14, 467–485. [Google Scholar] [CrossRef]
Russell, E.L.; Chiang, L.H.; Braatz, R.D. Fault detection in industrial processes using canonical variate analysis and dynamic principal component analysis. Chemom. Intell. Lab. Syst. 2000, 51, 81–93. [Google Scholar] [CrossRef]
MacGregor, J.F.; Kourti, T. Statistical process control of multivariate processes. Control Eng. Pract. 1995, 3, 403–414. [Google Scholar] [CrossRef]
He, Q.P.; Wang, J. Fault Detection Using the k-Nearest Neighbor Rule for Semiconductor Manufacturing Processes. IEEE Trans. Semicond. Manuf. 2007, 20, 345–354. [Google Scholar] [CrossRef]
Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
Yu, W.; Zhao, C.; Huang, B. MoniNet with Concurrent Analytics of Temporal and Spatial Information for Fault Detection in Industrial Processes. IEEE Trans. Cybern. 2022, 52, 8340–8351. [Google Scholar] [CrossRef]
Lomov, I.; Lyubimov, M.; Makarov, I.; Zhukov, L.E. Fault detection in Tennessee Eastman process with temporal deep learning models. J. Ind. Inf. Integr. 2021, 23, 100216. [Google Scholar] [CrossRef]
Sun, W.; Paiva, A.R.C.; Xu, P.; Sundaram, A.; Braatz, R.D. Fault detection and identification using Bayesian recurrent neural networks. Comput. Chem. Eng. 2020, 141, 106991. [Google Scholar] [CrossRef]
Chen, S.; Yu, J. Deep recurrent neural network-based residual control chart for autocorrelated processes. Qual. Reliab. Eng. Int. 2019, 35, 2687–2708. [Google Scholar] [CrossRef]
Kumar, S.R.; Devakumar, J. Recurrent neural network based sensor fault detection and isolation for nonlinear systems: Application in PWR. Prog. Nucl. Energy 2023, 163, 104836. [Google Scholar] [CrossRef]
Pang, C.; Duan, D.; Zhou, Z.; Han, S.; Yao, L.; Zheng, C.; Yang, J.; Gao, X. An integrated LSTM-AM and SPRT method for fault early detection of forced-oxidation system in wet flue gas desulfurization. Process Saf. Environ. Prot. 2022, 160, 242–254. [Google Scholar] [CrossRef]
Brito, L.C.; Susto, G.A.; Brito, J.N.; Duarte, M.A.V. An explainable artificial intelligence approach for unsupervised fault detection and diagnosis in rotating machinery. Mech. Syst. Signal Process. 2022, 163, 108105. [Google Scholar] [CrossRef]
Zeng, L.; Jin, Q.; Lin, Z.; Zheng, C.; Wu, Y.; Wu, X.; Gao, X. Dual-attention LSTM autoencoder for fault detection in industrial complex dynamic processes. Process Saf. Environ. Prot. 2024, 185, 1145–1159. [Google Scholar] [CrossRef]
El Mokhtari, K.; McArthur, J.J. Autoencoder-Based fault detection using building automation system data. Adv. Eng. Inf. 2024, 62, 102810. [Google Scholar] [CrossRef]
Qian, J.; Song, Z.; Yao, Y.; Zhu, Z.; Zhang, X. A review on autoencoder based representation learning for fault detection and diagnosis in industrial processes. Chemom. Intell. Lab. Syst. 2022, 231, 104711. [Google Scholar] [CrossRef]
Yu, J.; Zheng, X.; Wang, S. A deep autoencoder feature learning method for process pattern recognition. J. Process Control 2019, 79, 1–15. [Google Scholar] [CrossRef]
Li, J.; Yan, X. Process monitoring using principal component analysis and stacked autoencoder for linear and nonlinear coexisting industrial processes. J. Taiwan Inst. Chem. Eng. 2020, 112, 322–329. [Google Scholar] [CrossRef]
Tan, S.; Zhou, X.; Shi, H.; Song, B. Adaptive slow feature analysis—Sparse autoencoder based fault detection for time-varying processes. J. Taiwan Inst. Chem. Eng. 2023, 142, 104599. [Google Scholar] [CrossRef]
Ba-Alawi, A.H.; Vilela, P.; Loy-Benitez, J.; Heo, S.; Yoo, C. Intelligent sensor validation for sustainable influent quality monitoring in wastewater treatment plants using stacked denoising autoencoders. J. Water Process Eng. 2021, 43, 102206. [Google Scholar] [CrossRef]
Peng, C.; Kai, W.; Kun, Z.; Fanchao, M. Monitoring of wastewater treatment process based on multi-stage variational autoencoder. Expert Syst. Appl. 2022, 207, 117919. [Google Scholar] [CrossRef]
Spigler, G. Denoising Autoencoders for Overgeneralization in Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 998–1004. [Google Scholar] [CrossRef] [PubMed]
Ma, Z.L.; Li, X.J.; Nian, F.Q. An Interpretable Fault Detection Approach for Industrial Processes Based on Improved Autoencoder. IEEE Trans. Instrum. Meas. 2025, 74, 3518813. [Google Scholar] [CrossRef]
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; van den Hengel, A. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
Li, G.; Ge, M.; Wan, J.; Han, D.; Li, M.; Zhou, M. MemMambaAD: Memory-augmented state space model for multivariate time series anomaly detection. Eng. Appl. Artif. Intell. 2025, 158, 111308. [Google Scholar] [CrossRef]
Dong, Y.; Qin, S.J. A novel dynamic PCA algorithm for dynamic data modeling and process monitoring. J. Process Control 2018, 67, 1–11. [Google Scholar] [CrossRef]
Choi, S.W.; Martin, E.B.; Morris, A.J.; Lee, I.-B. Adaptive Multivariate Statistical Process Control for Monitoring Time-Varying Processes. Ind. Eng. Chem. Res. 2006, 45, 3108–3118. [Google Scholar] [CrossRef]
Zhao, C.; Sun, H. Dynamic Distributed Monitoring Strategy for Large-Scale Nonstationary Processes Subject to Frequently Varying Conditions Under Closed-Loop Control. IEEE Trans. Ind. Electron. 2019, 66, 4749–4758. [Google Scholar] [CrossRef]
Zhao, C. Perspectives on nonstationary process monitoring in the era of industrial artificial intelligence. J. Process Control 2022, 116, 255–272. [Google Scholar] [CrossRef]
Alex, J.; Benedetti, L.; Copp, J.; Gernaey, K.; Jeppsson, U.; Nopens, I.; Pons, M.-N.; Rieger, L.; Rosen, C.; Steyer, J. Benchmark Simulation Model No. 1 (BSM1); Report by the IWA Taskgroup on benchmarking of control strategies for WWTPs; Lund University: Lund, Sweden, 2008; Volume 1. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 11106–11115. [Google Scholar]
Cai, L.; Tian, X.; Chen, S. Monitoring Nonlinear and Non-Gaussian Processes Using Gaussian Mixture Model-Based Weighted Kernel Independent Component Analysis. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 122–135. [Google Scholar] [CrossRef] [PubMed]
Deng, X.; Cai, P.; Cao, Y.; Wang, P. Two-Step Localized Kernel Principal Component Analysis Based Incipient Fault Diagnosis for Nonlinear Industrial Processes. Ind. Eng. Chem. Res. 2020, 59, 5956–5968. [Google Scholar] [CrossRef]

Figure 1. Sewage treatment process BSM1 benchmark platform.

Figure 2. Sample data of fault 1 in BSM1.

Figure 3. Sample data of fault 1 in papermaking dataset.

Figure 4. Structure of the conditional variational autoencoder.

Figure 5. Structure of probabilistic sparse self-attention.

Figure 6. Structure of the MI-CVAE network.

Figure 7. Flowchart of the MI-CVAE fault detection framework.

Figure 8. Sensitivity analysis of the number of memory units K on the BSM1 dataset: (a) Average FDR under different K values; (b) Average FAR under different K values.

Figure 9. SPE monitoring results of MI-CVAE on the BSM1 dataset.

Figure 10. MD² monitoring results of MI-CVAE on the BSM1 dataset.

Figure 11. Fault detection results of MI-CVAE using SPE on the papermaking dataset.

Figure 12. Fault detection results of MI-CVAE using MD² on the papermaking dataset.

Table 1. Variables for modeling and monitoring in BSM1.

No.	Variable	Description	Sampling Location	Unit
1	Q_in	Influent flow rate	Influent of wastewater	m³/day
2	S_NHin	Influent ammonia nitrogen concentration	Influent of wastewater	mg N/m³
3	S_NO2	Nitrite concentration	Reactor 2	mg N/m³
4	S_O3	Dissolved oxygen concentration	Reactor 3	mg COD/m³
5	S_O4	Dissolved oxygen concentration	Reactor 4	mg COD/m³
6	S_O5	Dissolved oxygen concentration	Reactor 5	mg COD/m³
7	T_SS4	Total suspended solids	Reactor 4	mg SS/m³
8	T_SS5	Total suspended solids	Reactor 5	mg SS/m³
9	T_SSe	Total suspended solids	External recirculation	mg SS/m³
10	T_SSw	Total suspended solids	Effluent of wastewater	mg SS/m³
11	T_SSr	Total suspended solids	Internal recirculation	mg SS/m³
12	KLa₅	Oxygen transfer coefficient	Reactor 5	1/day
13	Q_intr	Internal recirculation flow rate	Effluent of reactor 5	m³/day
14	S_NHeff	Effluent ammonia nitrogen concentration	Effluent of wastewater	mg N/m³
15	S_NOeff	Effluent nitrite concentration	Effluent of wastewater	mg N/m³

Table 2. Types of faults of the BSM1 model.

Fault No.	Fault Description	Simulation Mode
1	Decrease the maximum specific growth rate of autotrophs bacteria	Step plus ramp compound disturbance
2	Decrease the maximum specific growth rate of heterotrophs bacteria	Step plus ramp compound disturbance
3	Decrease the settling velocity of the secondary clarifier	Step plus ramp compound disturbance
4	Increase the output signal of the nitrate actuator	Step plus ramp compound disturbance
5	Change the setpoint of the DO controller	Step
6	Bias fault of the DO sensor	Bias
7	Drift fault of the DO sensor	Drift
8	Complete failure of the DO sensor	Constant value

Table 3. Fault scenarios and affected variables in the papermaking dataset.

Fault No.	Section	Fault Name	Fault Type	Affected Variables
1	Approach flow section	High-consistency pulp blockage	Drift	High-consistency pulp flow rate, headbox pressure, headbox stock consistency
2	Approach flow section	Dilution-water screen failure	Scale up	Dilution-water screen inlet pressure, white-water consistency
3	Wire section	Vacuum system leakage	Drift	Bottom-/top-wire vacuum-box level
4	Press section	First-press shoe overpressure	Cycle	First-press shoe loading-zone oil pressure, shoe internal pressure, edge-pressure ratio
5	Press section	Insufficient vacuum dewatering	Drift	Suction-box vacuum pressure, transfer-felt suction-roll vacuum level
6	Drying section	Exhaust system failure	Scale down	Exhaust-air temperature, supply-air temperature
7	Drying section	Steam pressure fluctuation	Drift	Inlet steam flow to the dryer drum, drum steam pressure

Table 4. Parameter configuration for the MI-CVAE model.

Parameter Name	Symbol or Variable	Parameter Setting
Sliding window length	L	48
CVAE latent dimension	d	32
Number of memory units	K	20
Memory module temperature coefficient	τ	0.10
Memory fusion coefficient	α	0.25
Reconstruction loss weight	λ_rec	1.0
Prediction loss weight	λ_pred	0.8
KL divergence weight	β	0.005
Memory pull constraint weight	λ_pull	0.10
Attention entropy constraint weight	λ_ent	0.001
Memory weight decay coefficient	λ_decay	0.0001
Optimizer	/	Adam
Learning rate	Lr	0.001
Batch size	BATCH_SIZE	32
Number of training epochs	EPOCHS	120
Reconstruction branch fusion weight	ω_rec	0.6
Prediction branch fusion weight	ω_pred	0.4

Table 5. Comparison of FDR (%) for different methods on the BSM1 dataset.

Fault No.		1	2	3	4	5	6	7	8	Average FDR
CVAE	SPE	95.7	81.1	54.5	36.9	100.0	8.2	45.9	100.0	65.3
CVAE	MD²	100.0	94.8	97.1	45.6	100.0	64.2	92.2	100.0	86.7
CVAE-Memory	SPE	98.9	87.5	78.1	39.3	100.0	14.2	58.4	100.0	72.1
CVAE-Memory	MD²	100.0	94.2	97.9	46.2	100.0	63.4	93.1	100.0	86.9
CVAE-Informer	SPE	98.6	91.4	87.1	71.8	100.0	43.1	74.6	100.0	83.3
CVAE-Informer	MD²	100.0	94.7	97.6	80.4	100.0	80.4	96.7	100.0	93.7
AE-Transformer	SPE	99.0	92.5	89.3	76.8	100.0	54.6	79.0	100.0	86.4
AE-Transformer	MD²	100.0	96.0	98.5	83.6	100.0	85.4	97.2	100.0	95.1
LSTM-GAN	SPE	97.5	88.0	82.6	61.5	99.5	38.5	64.8	100.0	79.1
LSTM-GAN	MD²	100.0	94.0	96.0	74.5	100.0	74.0	96.5	100.0	91.9
MI-VAE	SPE	99.5	95.2	94.3	81.3	100.0	68.9	85.2	100.0	90.6
MI-VAE	MD²	100.0	97.1	99.0	88.0	100.0	91.9	99.5	100.0	96.9
MI-CVAE	SPE	100.0	97.0	99.5	89.9	100.0	96.3	94.4	100.0	97.1
MI-CVAE	MD²	100.0	97.4	99.8	94.1	100.0	100.0	100.0	100.0	98.9

Table 6. Comparison of FAR (%) for different methods on the BSM1 dataset.

Fault No.		1	2	3	4	5	6	7	8	Average FAR
CVAE	SPE	3.9	3.9	3.9	3.9	7.0	8.1	2.5	7.7	5.1
CVAE	MD²	4.6	2.1	2.5	5.6	5.6	6.3	12.7	6.3	5.7
CVAE-Memory	SPE	6.7	4.6	4.6	7.7	8.1	8.8	2.5	8.8	6.5
CVAE-Memory	MD²	4.9	2.5	2.5	6.0	6.0	6.7	15.5	6.7	6.4
CVAE-Informer	SPE	4.6	2.8	2.5	5.6	6.0	6.3	1.1	6.0	4.4
CVAE-Informer	MD²	4.2	1.4	1.4	5.3	5.3	5.6	3.2	5.3	4.0
AE-Transformer	SPE	4.2	2.1	1.8	5.0	5.5	6.0	1.4	5.7	4.0
AE-Transformer	MD²	4.0	1.2	1.0	5.4	5.2	4.6	4.0	5.4	3.9
LSTM-GAN	SPE	5.0	3.2	3.0	6.2	6.5	7.0	2.0	6.8	5.0
LSTM-GAN	MD²	4.8	2.8	2.8	6.0	6.2	6.5	5.8	6.2	5.1
MI-VAE	SPE	3.9	1.1	1.1	4.9	5.3	5.6	0.7	5.3	3.5
MI-VAE	MD²	3.9	5.3	1.1	1.1	4.9	5.3	5.3	6.0	4.1
MI-CVAE	SPE	4.4	0.4	0.4	4.8	5.2	5.6	0.4	5.6	3.4
MI-CVAE	MD²	4.0	0.0	0.0	5.2	4.8	5.2	6.0	5.2	3.8

Table 7. Comparison of FDR (%) for different methods on the papermaking dataset.

Fault No.		1	2	3	4	5	6	7	Average FDR
CVAE	SPE	71.3	82.7	68.7	74.0	79.3	63.3	77.3	73.8
CVAE	MD²	76.0	88.7	73.3	81.3	85.3	64.0	78.0	78.1
CVAE-Memory	SPE	75.3	82.0	73.3	78.7	79.3	87.3	90.0	80.8
CVAE-Memory	MD²	85.3	76.0	82.0	89.3	90.7	91.3	93.3	86.8
CVAE-Informer	SPE	87.3	90.7	85.3	92.7	84.7	85.3	94.0	88.6
CVAE-Informer	MD²	89.3	92.0	88.7	95.3	90.0	89.3	91.3	90.8
AE-Transformer	SPE	89.3	93.3	88.0	93.3	88.7	90.7	92.7	90.9
AE-Transformer	MD²	91.3	96.0	90.7	96.7	92.0	91.3	94.0	93.1
LSTM-GAN	SPE	84.0	89.3	82.7	88.0	82.7	82.0	84.7	84.8
LSTM-GAN	MD²	86.7	92.0	87.3	92.7	88.0	86.7	88.7	88.9
MI-VAE	SPE	90.0	96.0	87.3	92.0	92.7	91.3	90.7	91.4
MI-VAE	MD²	91.3	98.0	92.0	94.0	94.7	88.0	93.3	93.0
MI-CVAE	SPE	90.7	99.3	90.0	98.7	94.7	94.0	96.0	94.8
MI-CVAE	MD²	92.7	100.0	94.7	100.0	94.0	95.3	96.0	96.1

Table 8. Comparison of FAR (%) for different methods on the papermaking dataset.

Fault No.		1	2	3	4	5	6	7	Average FAR
CVAE	SPE	11.1	3.7	7.4	14.8	0.0	18.5	3.7	8.5
CVAE	MD²	14.8	7.4	11.1	18.5	3.7	22.2	7.4	12.2
CVAE-Memory	SPE	7.4	3.7	7.4	11.1	0.0	14.8	3.7	6.9
CVAE-Memory	MD²	11.1	3.7	7.4	14.8	3.7	18.5	7.4	9.5
CVAE-Informer	SPE	3.7	0.0	3.7	7.4	0.0	11.1	0.0	3.7
CVAE-Informer	MD²	7.4	0.0	3.7	11.1	0.0	14.8	0.0	5.3
AE-Transformer	SPE	3.7	0.0	3.7	7.4	0.0	7.4	0.0	3.2
AE-Transformer	MD²	3.7	0.0	0.0	7.4	0.0	11.1	0.0	3.2
LSTM-GAN	SPE	7.4	3.7	7.4	11.1	0.0	11.1	3.7	6.3
LSTM-GAN	MD²	0.0	3.7	0.0	11.1	3.7	14.8	7.4	5.8
MI-VAE	SPE	3.7	0.0	0.0	3.7	0.0	7.4	0.0	2.1
MI-VAE	MD²	3.7	0.0	0.0	7.4	0.0	0.0	3.7	2.1
MI-CVAE	SPE	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
MI-CVAE	MD²	0.0	3.7	0.0	3.7	0.0	0.0	0.0	1.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, L.; Wang, X.; Liu, H. Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes. Appl. Sci. 2026, 16, 5941. https://doi.org/10.3390/app16125941

AMA Style

Wei L, Wang X, Liu H. Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes. Applied Sciences. 2026; 16(12):5941. https://doi.org/10.3390/app16125941

Chicago/Turabian Style

Wei, Lingli, Xinyuan Wang, and Hongbin Liu. 2026. "Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes" Applied Sciences 16, no. 12: 5941. https://doi.org/10.3390/app16125941

APA Style

Wei, L., Wang, X., & Liu, H. (2026). Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes. Applied Sciences, 16(12), 5941. https://doi.org/10.3390/app16125941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Memory-Enhanced and Prediction-Assisted Conditional Variational Autoencoder for Unsupervised Fault Detection in Industrial Processes

Abstract

1. Introduction

2. Dataset Description

2.1. Case 1: BSM1

2.2. Case 2: Papermaking Process Monitoring Dataset

3. Materials and Methods

3.1. Data Preprocessing

3.2. Conditional Variational Autoencoder

3.3. Memory Module

3.4. Informer-Based Prediction Module

3.5. Joint Loss Function

4. Process Monitoring Framework Based on MI-CVAE

4.1. The Proposed MI-CVAE Model

4.2. Monitoring Statistics

4.3. Evaluation Metrics

5. Case Studies

5.1. Case 1: BSM1

5.2. Case 2: Papermaking Process Monitoring Dataset

6. Conclusions and Limitations

6.1. Conclusions

6.2. Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI