A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance

Liu, Zhaohui; Zhou, Qihao; Liu, Hua

doi:10.3390/computers15010022

Open AccessArticle

A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance

by

Zhaohui Liu

^1,*,

Qihao Zhou

^1,*

and

Hua Liu

²

¹

School of Computing/Software, University of South China, Hengyang 421001, China

²

School of Electrical Engineering, University of South China, Hengyang 421001, China

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(1), 22; https://doi.org/10.3390/computers15010022

Submission received: 10 December 2025 / Revised: 24 December 2025 / Accepted: 30 December 2025 / Published: 5 January 2026

(This article belongs to the Special Issue Generative Artificial Intelligence and Machine Learning in Industrial Processes and Manufacturing)

Download

Browse Figures

Versions Notes

Abstract

In nuclear power plant fault diagnosis, traditional machine learning models (e.g., SVM and KNN) require full retraining on the entire dataset whenever new fault categories are introduced, resulting in prohibitive computational overhead. Deep learning models, on the other hand, are prone to catastrophic forgetting under incremental learning settings, making it difficult to simultaneously preserve recognition performance on both old and newly added classes. In addition, nuclear power plant fault data typically exhibit significant class imbalance, further constraining model performance. To address these issues, this study employs SHAP-XGBoost to construct a feature evaluation system, enabling feature extraction and interpretable analysis on the NPPAD simulation dataset, thereby enhancing the model’s capability to learn new features. To mitigate insufficient temporal feature capture and sample imbalance among incremental classes, we propose a cascaded spatiotemporal feature extraction network: LSTM is used to capture local dependencies, and its hidden states are passed as position-aware inputs to a Transformer for modeling global relationships, thus alleviating Transformer overfitting on short sequences. By further integrating frequency-domain analysis, an improved Adaptive Time–Frequency Network (ATFNet) is developed to enhance the robustness of discriminating complex fault patterns. Experimental results show that the proposed method achieves an average accuracy of 91.36% across five incremental learning stages, representing an improvement of approximately 20.7% over baseline models, effectively mitigating the problem of catastrophic forgetting.

Keywords:

nuclear power plant fault diagnosis; incremental fault diagnosis; feature extraction; interpretability analysis; SHAP-XGBoost; LSTM-Transformer; ATFNet; SCLIFD

1. Introduction

The safe operation of nuclear power plants is one of the most important tasks in the development of nuclear energy. Over the past decades, nuclear power fault diagnosis technologies have made significant progress. Through various methods and algorithms—such as machine learning, deep learning, and genetic algorithms—nuclear power plant fault diagnosis has continuously evolved to improve diagnostic accuracy and efficiency [1,2]. At present, nuclear fault diagnosis mainly follows two technical paradigms: knowledge-driven and data-driven approaches.

Knowledge-driven methods rely on expert experience to construct knowledge bases. Early techniques primarily adopted rule-based expert systems, such as Signed Directed Graphs (SDGs) and fault tree analysis. In recent years, Dynamic Bayesian Networks (DBNs) and Dynamic Uncertain Causality Graphs (DUCGs) have further enhanced reasoning capabilities for complex causal relationships. In contrast, data-driven approaches automatically extract features and patterns from large amounts of historical data (e.g., hybrid algorithms involving SVM, PCA, and ANN [3,4]), making them more suitable for addressing nonlinear, complex, and highly coupled fault patterns in nuclear systems.

Consequently, data-driven techniques have become widely applied in various types of fault diagnosis. For example, Mu Yu (2011) investigated technical challenges in nuclear power plant fault diagnosis and explored the use of decision trees and parameter reduction algorithms [3]. Y. Shi and X. Xue (2022) proposed a two-stage fault detection approach based on image data and deep learning, achieving high accuracy, fast response, and strong efficiency [5]. A model-based fault detection system jointly developed by Argonne National Laboratory of the U.S. Department of Energy and Florida Power & Light was employed for online monitoring and fault detection of nuclear power plant signals [6]. China General Nuclear Power Group applied DUCG-based methods to nuclear power plant fault diagnosis and conducted practical verification [7]. In addition, Wang et al. proposed an end-to-end LSTM-based fault diagnosis method for small modular pressurized water reactors, achieving high accuracy in identifying sensor and actuator faults [8].

Despite the remarkable success of data-driven techniques in fault diagnosis, several challenges remain in the nuclear power domain. First, fault data vary across different operating conditions, and the distribution differences between working scenarios often cause some fault types to have far more samples than others, while normal data usually exceed fault data by a large margin. This leads to severe class imbalance [9], greatly increasing the difficulty of accurate fault identification. To address this, Dai et al. used Generative Adversarial Networks to generate and augment imbalanced nuclear fault data, thereby constructing a more balanced training set and improving diagnostic accuracy [10]. Furthermore, Liu et al. proposed a nuclear power plant fault diagnosis method based on SDG, improving diagnostic reliability by analyzing feature correlations [11]. Although approaches based on deep subdomain adaptation and improved subdomain adaptation alleviate class imbalance to some extent, the continual increase in fault categories with the development of the nuclear industry presents new challenges. Existing methods focus mainly on class imbalance but largely overlook a critical issue—models’ adaptive capability to recognize new fault categories, particularly marginal or rare samples. To overcome this, Hanrong Zhang and Yifei Yao proposed a supervised contrastive knowledge extraction framework for class-incremental fault diagnosis under limited and imbalanced data conditions.

Although existing studies have made progress in nuclear fault diagnosis, they still fall short in capturing dynamic temporal evolution and complex coupling relationships inherent in nuclear time-series data. Current feature extraction methods struggle to fully handle temporal characteristics and class imbalance, preventing them from meeting the practical requirements for accuracy and reliability under real operating conditions. To address the threefold challenges of class imbalance, class incremental learning, and temporal complexity, this work proposes an integrated “feature–time–frequency–incremental” solution for nuclear power plants, with the main contributions summarized as follows:

To effectively address sample imbalance and catastrophic forgetting in nuclear fault diagnosis, we propose an improved class-incremental fault diagnosis framework based on SCLIFD. This framework integrates an LSTM–Transformer mechanism to capture long-term dependencies in nuclear fault data, thereby mitigating catastrophic forgetting.
To resolve feature selection challenges within the time-series nuclear dataset NPPAD, a SHAP-XGBoost-based feature interpretation and evaluation system is constructed, enhancing both accuracy and interpretability in the feature selection process.
To capture both local and global dependencies in time-series data, we enhance the original feature extractor by integrating an improved ATFNet framework that combines time-domain and frequency-domain modules, thereby further strengthening the model’s feature recognition capability.

Scope of contribution. This work does not aim to introduce a fundamentally new continual/incremental learning algorithm. Instead, our contribution lies in systematically integrating and adapting established components (physics-guided screening, interpretable feature ranking, hybrid temporal modeling, and time–frequency fusion) into a unified class-incremental fault diagnosis pipeline tailored to safety-critical nuclear power plant monitoring, and validating it with extensive incremental-session experiments.

2. Experimental Motivation

In this experiment, we adopt the open-source NPPAD dataset provided by Tsinghua University. To ensure broader coverage of physical quantities (pressure, temperature, flow rate, and power), five representative fault types—FLB, LOCA, SGATR, SGBTR, and SLBIC—are selected. To preserve feature randomness, five feature parameters are randomly sampled multiple times, and their data characteristics are examined. Based on the observed feature distributions, t-SNE visualizations are generated for analysis, as shown in Figure 1, where different colors represent distinct fault categories. Unless otherwise stated, all t-SNE visualizations in this paper are generated using sklearn.manifold. TSNE with a unified configuration: perplexity = 30, learning_rate = 200, n_iter = 1500, early_exaggeration = 12, init = “pca”, metric = “euclidean”, and random_state = 0; the input is the normalized feature vectors.

The visualization reveals that different fault types exhibit markedly distinct structural characteristics in the low-dimensional feature space. Specifically, FLB features appear as multi-colored points distributed across several branches, showing typical multimodal and dispersed patterns that may correspond to different stages or subtypes of the fault. LOCA features form a smooth, continuous curve, indicating an ordered and gradually evolving relationship, suggesting stable trend variations in its fault parameters. SGATR mainly follows a primary curve accompanied by a few small clusters, which may represent anomalous samples or special operating conditions, reflecting a bimodal data structure. SGBTR features follow an undulating, wave-like pattern along a curved trajectory, possibly revealing periodicity or nonlinear fluctuations in its internal parameters. In contrast, SLBIC displays both a main curve and a dense cluster located in the lower-right region; the strong color mixture within the local cluster suggests two typical distribution modes, potentially associated with different fault mechanisms or sample subsets.

From a global perspective, some fault features overlap and intertwine, while others deviate significantly from the main trend, creating ambiguous class boundaries and increasing classification difficulty. Moreover, certain features may be affected by periodic factors, further complicating feature extraction and pattern discrimination. The morphological differences among fault categories—such as the multi-branch structure of FLB, the single-curve structure of LOCA, and the wave-like distribution of SGBTR—as well as variations in spatial scale, density, and structural coherence, collectively contribute to a highly diverse and complex feature space.

Therefore, this experiment urgently requires a feature extraction method capable of effectively capturing marginalized and structurally complex patterns while adapting to potential periodic variations. However, existing feature extractors often fail to adequately address these structural discrepancies and periodic characteristics, resulting in performance degradation when confronted with intricate distributions.

Furthermore, due to the substantial differences in feature distributions across fault categories, the addition of new classes in class-incremental learning scenarios can easily trigger catastrophic forgetting [12], impairing the model’s ability to retain previously learned representations. Thus, achieving stable feature transfer and knowledge preservation under class-incremental conditions is one of the critical challenges addressed in this study.

3. Methodology

3.1. Data Preparation and Preprocessing

3.1.1. NPPAD Dataset

In this study, we use the open-source Nuclear Power Plant Accident Data (NPPAD) dataset released by Qi et al. (Tsinghua University, 2022) as the training and testing data for the proposed model [13]. The dataset is an open time-series simulation dataset generated using the nuclear power plant simulation software PcTran, which simulates a three-loop PWR nuclear power plant under different operating conditions. It contains multiple potential fault scenarios that may occur in nuclear power plants, such as loss of coolant accidents, steam line ruptures, and fuel rod failures. The dataset records the state variations and operation logs of different subsystems at the time of accidents, including the temporal evolution of key parameters such as pressure, temperature, and flow rate. Each sample is represented as a 2D parameter matrix, and each sample contains 97 parameters in total. The dataset can be accessed at: https://github.com/thu-inet/NuclearPowerPlantAccidentData, accessed on 30 December 2020.

To ensure that the model can effectively discriminate changes in different physical quantities (pressure, temperature, flow rate, and power), we aim to cover as broad a range of physical variables as possible so as to emulate the unknown nature of new, unseen faults. Accordingly, five fault types are selected as experimental data: FLB, LOCA, SGATR, SGBTR, and SLBIC, together with a normal (fault-free) operating condition, as listed in Table 1. To align with the MFF dataset setting in the original SCLIFD framework, we reselect 23 features from the original 97 parameters for each fault type. The detailed feature selection procedure is described in the subsequent sections.

3.1.2. Data Selection and Processing

In the original SCLIFD framework, the simulated Tennessee–Eastman Process (TEP) dataset and the Multiphase Flow Facility (MFF) dataset were adopted. The TEP dataset [14], which simulates a realistic chemical process, has been widely used and recognized in the fault diagnosis community. It contains 52 variables, including temperatures and pressures monitored continuously by sensors. In the original SCLIFD experiments, nine fault types and one normal type (type 0) were selected to demonstrate the effectiveness of SCLIFD. The MFF dataset [15] originates from the three-phase flow facility at Cranfield University and contains 24 process variables sampled at 1 Hz. It involves six fault types and 23 variables collected from the real MFF system. In the original framework, faults 1, 2, 3, and 4, as well as the normal class 0, were selected. Since the TEP dataset is more consistent with the time-series characteristics of nuclear power plants, we adopt the TEP dataset as a benchmark dataset for comparative analysis in this work.

It should be emphasized that the TEP dataset originates from a chemical process rather than a nuclear power plant. Therefore, TEP is not regarded as a direct surrogate for nuclear operating conditions, but rather as a standardized, publicly available benchmark dataset. TEP and nuclear power plants both belong to complex continuous industrial process systems, sharing common characteristics such as strong multivariable coupling, nonlinear dynamics, and significant temporal evolution of operating/fault modes. More importantly, TEP has been widely used in studies on class imbalance and class-incremental fault diagnosis. Based on these considerations, we retain the TEP setting used in the original SCLIFD work. This design enables a fair performance comparison under consistent data conditions. Moreover, it allows us to evaluate the generalization capability of the proposed method across different industrial process scenarios. The NPPAD dataset, in turn, is employed to evaluate the applicability and engineering significance of the method in the target nuclear domain.

For both the TEP and NPPAD datasets, we configure the number of new classes introduced per incremental session (i.e., the novel-class shot), the total number of incremental sessions, the training/testing set sizes for normal and faulty classes, and the size of the memory buffer K. Specifically, for the NPPAD dataset, we use 5 incremental sessions (S1–S5), where each session introduces one new fault category. Each new class contains only one training sample (1-shot), which is used as an extreme stress-test setting to examine the proposed framework under severe data scarcity. In addition, we report results with larger shot numbers (e.g., 2-shot and 5-shot) to better reflect more practical data availability (Table 2). The detailed settings are summarized in Table 3. The implications of these attributes will be further clarified in the model improvement section. For both TEP and NPPAD, samples are randomly drawn multiple times to ensure statistical reliability and experimental robustness.

3.2. Feature Selection Method

3.2.1. Initial Feature Screening

Efficient feature selection is a key step in improving the performance of fault diagnosis models. It not only reduces data redundancy and computational cost but also significantly enhances the generalization capability of the model. Liu Yong-kuo, Liu Zhen et al. [11] proposed a nuclear power plant fault diagnosis method based on the graphical properties of Signed Directed Graphs (SDGs). Its advantage lies in its intuitive and transparent interpretation of fault propagation processes and parameter correlations, which helps operators quickly locate problems. In the initial stage of our experiments, the feature selection procedure is based on the SDG model developed by Liu Yong-kuo, Ayodeji Abiodun et al. [16]. On top of the NPPAD dataset, 23 candidate variables are first obtained via SDG-based physical screening (see Table 4). In the second stage, SHAP-XGBoost is employed to compute feature importance scores for these candidate variables. The top 23 features are then retained as the final feature set (see Table 5), ensuring consistency with the original SCLIFD input dimensionality.

3.2.2. Interpretable Model and SHAP-XGBoost

The goal of interpretable machine learning is to enable researchers to understand how a model makes predictions, including the relationships between inputs and outputs and which features contribute most to the predictions. Existing interpretable machine learning methods can be roughly divided into two categories: model-specific and model-agnostic methods. Model-specific approaches explain specific types of models by analyzing their internal structures and parameters, whereas model-agnostic methods do not rely on the internal form of the model and instead infer explanations from input–output behavior.

XGBoost is one of the most widely adopted methods for supervised classification and regression tasks [17]. It is a gradient boosting algorithm that sequentially ensembles decision trees using gradient descent optimization to minimize prediction error [18].

For tree-based models such as XGBoost, the SHAP value of the j-th feature of a sample x is defined as:

ϕ_{j} (x) = \sum_{t \in T_{j}} \frac{1}{2^{d_{t} - 1}} (v_{t} - {\bar{v}}_{t}),

where

T_{j}

is the set of all decision tree nodes involving feature j,

d_{t}

is the depth of node t,

v_{t}

is the prediction value of node t, and

{\bar{v}}_{t}

is the prediction value of the parent node of t.

The importance score

S_{j}

of feature j is then defined as the mean absolute SHAP value over all samples:

S_{j} = \frac{1}{N} \sum_{i = 1}^{N} | ϕ_{j} (x_{i}) |,

where N is the total number of samples and

x_{i}

is the i-th sample. A feature is retained if

S_{j}

exceeds a predefined threshold

τ

.

3.2.3. Feature Evaluation System Based on SHAP-XGBoost

The construction of SDG relies on expert prior knowledge and, in the presence of measurement noise, may struggle to capture high-order nonlinear couplings and implicit correlations. To address these issues, we propose a two-stage “SDG-Shield” strategy, as illustrated in Figure 2.

In the first stage, we still follow the SDG model in the literature to select a feature subset

Ω_{SDG}

from the 19 candidate variables that are strongly physically correlated with LOCA, SGATR, and SGBTR faults, thereby ensuring physical interpretability of the selected features.

In the second stage, we introduce a SHAP-XGBoost-based framework to further filter redundant features within

Ω_{SDG}

in a data-driven manner and to mine deeper intrinsic relationships among features. This process improves model performance without sacrificing physical interpretability. The detailed procedure is summarized in Algorithm 1.

Algorithm 1 SDG-Shield Two-Stage Feature Selection Algorithm

Input: Original feature set

X = {x_{1}, x_{2}, \dots, x_{D}}

(

D = 23

),
label y, threshold

τ

Output: Final selected feature set

X^{*}

1:: Stage 1 (SDG Physical Screening):
2:: Select feature subsets strongly correlated with faults based on the SDG model:
3:: $Ω_{SDG} = {x_{j} \in X ∣ corr (x_{j}, y) > θ_{SDG}}$
4:: where $corr (\cdot)$ denotes the physical correlation metric, and $θ_{SDG}$ is the SDG screening threshold.
5:: Stage 2 (SHAP Data-Driven Screening):
6:: a. Train an XGBoost model M, and calculate the SHAP values $ϕ_{j}$ of features in $Ω_{SDG}$ (see the SHAP definition above);
7:: b. Compute the feature importance score $S_{j} = \frac{1}{N} \sum_{i = 1}^{N} | ϕ_{j} (x_{i}) |$ ;
9:: c. Retain features with scores higher than the threshold:
10:: $X^{*} = {x_{j} \in Ω_{SDG} ∣ S_{j} > τ}$
11:: Return: $X^{*}$

To quantitatively validate the effectiveness of the feature selection strategy, we design the following comparative experiments:

Baseline-ALL: directly uses all 23 original variables;
Baseline-SDG: trains XGBoost only with the SDG-selected subset $Ω_{SDG}$ ;
Proposed: further applies SHAP-based screening on $Ω_{SDG}$ and retains the Top-k features with the largest mean absolute SHAP values.

Studies have shown that, after appropriate parameter tuning, XGBoost often outperforms alternative methods such as random forests or deep neural networks in many tasks. Moreover, XGBoost is highly compatible with SHAP: its TreeSHAP algorithm allows efficient and exact computation of SHAP values for tree-based models. Motivated by this, we adopt SHAP-XGBoost as the interpretable feature selection method in this work to enhance feature discriminability and model generalization. The experimental results presented later further confirm the superiority of SHAP-XGBoost in feature selection and importance evaluation.

Building on the above, we construct a two-stage “SDG-Shield → SHAP-XGBoost” feature selection framework. The Top-k physically consistent features resulting from this pipeline are used as the model input, forming a complete “feature → representation → classification” learning chain and enabling an end-to-end processing flow from physically consistent feature extraction to deep representation learning and, finally, to fault classification.

Through the SHAP-XGBoost strategy, we successfully remove redundant variables and identify a compact subset of key features with strong physical relevance. However, high-quality input features are only the first step. To fully exploit these time-series features, we still require a more powerful feature extractor capable of capturing deep dynamic dependencies while mitigating forgetting in incremental learning. Therefore, Section 3.3 will introduce the proposed model improvements in detail.

3.3. Model Improvements

3.3.1. Class-Incremental Fault Diagnosis Framework Based on Supervised Contrastive Knowledge Distillation

The Supervised Contrastive Knowledge Distillation-based class-incremental fault diagnosis framework (SCLIFD) is a machine learning framework designed for industrial fault diagnosis [19], particularly suitable for scenarios with limited data, class imbalance, and long-tailed distributions. It enables the model to retain previously learned knowledge while improving recognition performance for minority and newly introduced fault classes. In this work, we adopt SCLIFD as the baseline framework for nuclear fault diagnosis and extend it to better fit the characteristics and requirements of nuclear power systems.

The core idea of SCLIFD is to combine supervised contrastive learning with knowledge distillation. The contrastive loss

L_{con}

encourages samples of the same class to be close in the feature space while pushing samples of different classes apart. The distillation loss

L_{dis}

forces the new model (student) to mimic the responses of the old model (teacher) on old classes, thereby preserving decision boundaries for previous classes when learning new ones and effectively alleviating catastrophic forgetting. In addition, MES (margin-based exemplar selection) is used to store representative samples in the memory buffer of size K, and a Balanced Random Forest (BRF) is used as the final classifier to mitigate class imbalance. Specifically, the memory buffer is managed following a class-balanced exemplar strategy. When a new class arrives, candidate exemplars are first evaluated using the margin-based exemplar selection (MES) criterion, which favors samples located close to the class decision boundary and thus better represent intra-class variability. The buffer is then updated by retaining a fixed number of exemplars per class. If the total buffer capacity K is exceeded, older classes proportionally reduce their stored exemplars to ensure balanced memory allocation across all learned classes. This per-class capacity constraint prevents dominance of early classes and helps maintain stable performance throughout incremental sessions. The overall framework is illustrated in Figure 3.

The total loss of Supervised Contrastive Knowledge Distillation (SCKD) consists of the contrastive loss

L_{con}

and the distillation loss

L_{dis}

:

L_{SCKD} = λ \cdot L_{con} + (1 - λ) \cdot L_{dis},

(1)

where

λ \in [0.3, 0.7]

is a trade-off coefficient.

The contrastive loss that pulls positive (same-class) samples together and pushes negative (different-class) samples apart is given by [20]:

L_{con} = - \frac{1}{N} \sum_{i = 1}^{N} log (\frac{\sum_{j \in P_{i}} exp (s (f_{i}, f_{j}) / τ)}{\sum_{k \in P_{i} \cup N_{i}} exp (s (f_{i}, f_{k}) / τ)}),

(2)

where

P_{i}

and

N_{i}

denote the sets of positive and negative samples of sample i, respectively;

s (f_{i}, f_{j})

is the cosine similarity between feature vectors; and

τ \in [0.05, 0.2]

is the temperature parameter.

The distillation loss encourages the student model to imitate the output distribution of the teacher model:

L_{dis} = - \sum_{c = 1}^{C} p_{tea} (c) log (p_{stu} (c)),

(3)

where

p_{tea} (c)

and

p_{stu} (c)

denote the softmax probabilities of class c predicted by the teacher and student models, respectively.

3.3.2. Improved LSTM-Transformer Fusion Layer

This section first introduces the basic structures of LSTM and Transformer, and then discusses the importance of combining them into a hybrid LSTM-Transformer layer for time-series modeling.

Long Short-Term Memory (LSTM) networks are a special type of recurrent neural network (RNN) that incorporates gating mechanisms (forget gate, input gate, output gate) and a cell state [21], effectively addressing the gradient explosion and vanishing problems encountered by conventional RNNs when training on long sequences. Their strong adaptability to time-series forecasting has been widely demonstrated [21,22,23], with a key advantage being the ability to precisely control the retention and forgetting of temporal information. LSTM networks have been applied to various domains. For instance, Fazle Karim and Somshubra Majumdar (2023) proposed LSTM-FCN and Attention LSTM-FCN (ALSTM-FCN) architectures for multivariate time-series classification, achieving strong performance on challenging benchmarks [24]. Felix et al. also showed that LSTM can address many time-series tasks that fixed-window feedforward networks cannot [25]. Another related model, the Gated Recurrent Unit (GRU), has also been widely used for time-series prediction. Liu and Chen [26] proposed MIXGU, a linear mixed gating unit that combines GRU and MGU. The overall architecture of LSTM is shown in Figure 4.

Transformer is a deep learning architecture based on the self-attention mechanism, composed of an encoder and a decoder [27]. The encoder leverages self-attention to capture global dependencies within the input sequence, while the decoder generates outputs conditioned on encoder representations and uses masked attention to prevent information leakage. The model employs multi-head attention to enhance multi-dimensional semantic understanding and positional encoding to compensate for the lack of sequential order information, enabling highly efficient parallel computation. Transformer has become a fundamental architecture for tasks such as machine translation and text generation. Its overall structure is shown in Figure 5.

Compared with LSTM/CNN, Transformer relies on self-attention to model global dependencies and can be easily trained in parallel on modern hardware [27]. However, its attention computation has

O (n^{2})

complexity with respect to the sequence length n, which can be costly for long sequences. On the other hand, when sequences are relatively short or data are limited, LSTM may be more computationally efficient. Unlike LSTM and CNN, Transformer allows each key to attend to all queries, and the resulting attention weights are applied to all values in the sequence, enabling flexible global interaction.

Nevertheless, a standalone Transformer may suffer from high computational cost when handling long sequences and is constrained by its encoder–decoder architecture design [28]. A standalone LSTM, in turn, may be less effective in modeling highly variable nuclear time-series data. Therefore, we modify the backbone network of the SCLIFD framework and propose a new time-series architecture, the LSTM-Transformer, by fusing the two models [29]. The resulting hybrid layer forms a powerful representation module that strengthens feature dependence modeling, better addresses the classification challenges of nuclear time-series data, and further alleviates the impact of class imbalance. The improved architecture is illustrated in Figure 6.

In the proposed design, the LSTM first processes the input sequence and extracts local temporal patterns. The resulting LSTM hidden states, which encode local temporal information, are then fed into the Transformer. The Transformer further captures global dependencies and high-level semantic information from the sequence of LSTM hidden states. This hierarchical feature extraction mechanism enables the model to jointly capture local and global characteristics of time-series data, which is essential for accurately modeling complex and highly coupled temporal patterns.

Unlike naïve model ensembles, our LSTM-Transformer layer allows dynamic information interaction between the LSTM and Transformer components. During training, the Transformer attention mechanism can guide the LSTM to focus on more relevant temporal segments, while the sequential processing of the LSTM provides rich contextual information for the Transformer’s attention computation. This dynamic interplay enables the model to better adapt to the varying properties of time-series data and improves its ability to model complex temporal relationships.

By integrating LSTM and Transformer into a unified layer, we simultaneously strengthen feature representation and temporal dependency modeling. The short-term dependency modeling capability of LSTM and the long-range dependency modeling capability of Transformer complement each other. As a result, the LSTM-Transformer layer can more effectively represent complex dependencies in time-series data, which is beneficial for handling the multi-time-scale and multi-process-coupling characteristics of nuclear power plant time-series signals.

3.3.3. Improved Framework with ATFNet-Based Time–Frequency Fusion

Since nuclear fault data typically exhibit pronounced periodic characteristics, we further introduce a frequency-domain feature branch on top of the original time-domain modeling, forming a dual-branch time–frequency framework. In incremental learning scenarios, purely time-domain features are prone to distribution drift across operating conditions, which may cause significant fluctuations in model parameters. In contrast, frequency-domain features (e.g., specific harmonic components) are often more translation-invariant and robust to operating variations. Therefore, incorporating a frequency-domain branch as an auxiliary feature pathway not only enables global modeling of periodic information from a frequency perspective and provides relatively stable “memory anchors”, but also maintains strong discriminative capability under class imbalance and gradually evolving data distributions. Based on this motivation, we adopt an ATFNet architecture that fuses time- and frequency-domain features [30] to enhance robustness and generalization performance for incremental nuclear fault diagnosis.

The improved approach integrates predictions from time-domain and frequency-domain modules, leveraging the strengths of both. For time series with different periodic characteristics, the model can dynamically adjust the relative weights of the two modules according to the strength of periodicity in the nuclear data, effectively tuning the contribution of time and frequency representations. By extending the DFT and complex-valued spectral attention mechanisms, the improved ATFNet also addresses the “frequency misalignment” issue in conventional methods and achieves superior long-term forecasting performance on multiple real-world datasets. A notable feature of the enhanced ATFNet is the harmonic energy weighting mechanism, which adaptively adjusts the contribution of each module based on the specific periodic patterns present in the time series.

To avoid instability in the computation of

α

due to short-sequence fluctuations, we apply exponential moving averages (EMAs) to smooth the energy terms

E_{T}

and

E_{F}

and introduce a small constant

ϵ = 10^{- 6}

in the denominator of the energy normalization to improve numerical stability.

In our experiments, we employ the improved ATFNet framework to analyze nuclear fault data. Specifically, we feed the original time-series data into the time-domain module (T-block) while simultaneously transforming the data into the frequency domain via the Fourier transform and feeding it into the frequency-domain module (F-block). In this way, the time-domain module captures the dynamic evolution along the time axis, whereas the frequency-domain module reveals periodic behaviors and frequency components. The frequency-domain module can effectively extract key frequency components associated with faults, which may correspond to equipment operating cycles, potential vibration frequencies, or other periodic disturbance factors in nuclear fault data. By jointly exploiting time- and frequency-domain information, the improved ATFNet framework achieves more comprehensive fault feature extraction. The overall architecture of the improved ATFNet is shown in Figure 7.

Let the time-series signal be

x (t) \in R^{T}

, where T is the sequence length. Its discrete Fourier transform (DFT) is defined as

X (k) = \sum_{t = 0}^{T - 1} x (t) e^{- j 2 π k t / T}, k = 0, 1, \dots, T - 1,

(4)

where j is the imaginary unit and

X (k)

is the complex amplitude at frequency index k.

Let

y_{T}

and

y_{F}

denote the outputs of the time-domain and frequency-domain modules, respectively. Their fusion is given by

y = α \cdot y_{T} + (1 - α) \cdot y_{F},

(5)

where the fusion weight

α

is dynamically adjusted according to the energies of the time- and frequency-domain representations:

α = \frac{E_{T}}{E_{T} + E_{F}},

(6)

with

E_{T} = \sum_{t = 0}^{T - 1} y_{T} {(t)}^{2}, E_{F} = \sum_{k = 0}^{T - 1} {| X (k) |}^{2},

(7)

being the time-domain energy and frequency-domain energy, respectively.

For the harmonic energy weighting, the weight of the m-th harmonic component is defined as

ω_{m} = \frac{| X (m f_{0}) |^{2}}{\sum_{n = 1}^{M} {| X (n f_{0}) |}^{2}},

(8)

where

f_{0}

is the fundamental frequency and M is the maximum harmonic order. The harmonic weights satisfy

\sum_{m = 1}^{M} ω_{m} = 1 .

(9)

4. Experimental Design

4.1. Experimental Environment Configuration

The experiments were conducted on macOS 14.x with an Apple M2 (8-core CPU), 16 GB RAM, and PyTorch 2.1.2. All results were averaged over three runs using random seeds 0, 1, 2, and reported as mean ± standard deviation. In addition, 95% confidence intervals were estimated using the t-distribution (df = 2), and their relative magnitudes indicate stable performance across runs. The normalization parameters were estimated solely from the training set and shared across sessions to avoid temporal leakage.

4.2. Data Preparation

To evaluate the performance of the proposed improved class-incremental fault diagnosis framework under class-imbalanced conditions for nuclear power plant fault detection, this study selected five typical fault types from the NPPAD dataset: FLB, LOCA, SGATR, SGBTR, and SLBIC. First, the System Dependence Graph (SDG) method was used to identify 23 key feature parameters for each fault type, as shown in Table 4.

Subsequently, the SHAP-XGBoost feature importance analysis framework was applied to extract the optimal 23-dimensional feature subset from all candidate features. The final selected key features are summarized in Table 5. This two-stage feature selection strategy effectively enhances the discriminative capability of the model inputs while suppressing the influence of redundant features.

After feature selection, an independent data processing directory was created to store the preprocessed data. The original NPPAD data were provided in CSV format; to satisfy the requirements of the experimental workflow, all samples were converted to TXT format, and the corresponding labels were stored separately. A sliding-window method was then employed to segment the time-series data, with a window length of

L = 50

and a stride of

s = 1

. The training and testing sets were split strictly following the non-leakage principle across sessions and temporal order to ensure reliable evaluation. Finally, the fault and normal data were individually normalized and subsequently merged to construct the final sample set used for model training and testing.

4.3. Model Training

The model was trained on both the TE and NPPAD datasets following the class-incremental learning protocol described above. A suitable memory buffer size was configured, the number of training iterations was set to 100, one new fault class was introduced at each incremental iteration, and the learning rate was fixed at 0.001 throughout all experiments.

Table 6 summarizes the key hyperparameters used in the proposed framework, including those of the LSTM-Transformer and ATFNet modules, as well as the incremental-learning related coefficients (e.g.,

λ

and

τ

). Unless otherwise stated, all hyperparameters were selected based on validation performance for each dataset and then kept fixed across all incremental sessions to avoid session-dependent tuning and ensure experimental reproducibility.

The network takes time-series input of shape (batch size, sequence length, 23) and processes it through the following steps:

Multiple residual blocks in the encoder extract the input features, producing an output of (batch size, sequence length, 512).
The extracted features are fed into an LSTM layer (hidden size 512, unidirectional or bidirectional), yielding (batch size, sequence length, $512 \times directions$ ).
After reshaping, the features are passed through a Transformer layer, maintaining the same shape. The output is then reshaped back, and the last time step is taken, resulting in (batch size, $512 \times directions$ ).
If the frequency-domain processing module (F_Block) is enabled, the above features are reshaped into (batch size, feature dimension, 1) and passed into the F_Block. The predicted sequence is generated, and the last time step is taken to obtain frequency-domain features of shape (batch size, output feature dimension).
The time-domain and frequency-domain features are concatenated to form the fused representation of shape (batch size, $512 \times directions + output feature dimension)$ .
The fused features are fed into a fully connected layer to perform classification, producing predictions of shape (batch size, number of classes).

5. Experimental Results and Analysis

This section first compares different combinations of network structures on the TEP and NPPAD datasets to verify the cross-scenario generalization capability of the proposed method. Next, the improved method is compared horizontally with representative class-incremental learning algorithms, and its discriminative ability for different fault categories is further analyzed using confusion matrices. Finally, from the perspective of noise robustness, we assess the applicability of the model under realistic operating conditions in the nuclear power industry. On this basis, Section 5.2 presents systematic ablation studies from the three aspects of feature design, model architecture, and memory configuration.

5.1. Result Analysis

To validate the generalization ability of the proposed method across industrial scenarios, in addition to experiments on the NPPAD dataset, this study first performed the same model-structure comparison on the public benchmark TEP dataset. The corresponding results are shown in Table 7. It can be observed that the trend on TEP is consistent with that on NPPAD: as the LSTM, Transformer, and ATFNet modules are gradually introduced, the accuracy in all five incremental sessions continuously improves.

Specifically, after introducing the LSTM, the accuracy in sessions S2–S3 increases significantly. When the Transformer is further incorporated, long-range dependencies are better captured. Finally, after integrating ATFNet, the introduction of stable frequency-domain features leads to the most notable improvement in session S5, and the average accuracy reaches 93.16%.

On this basis, to further evaluate the applicability of the method in nuclear power scenarios, we apply the same structure combinations to the NPPAD dataset, with the results summarized in Table 8. A similar trend is observed on the NPPAD dataset, where progressively integrating model components leads to steady performance improvements. This result indicates that the proposed architecture is consistently effective across different scenarios and provides a solid foundation for the more detailed ablation analysis that follows.

In addition, to better illustrate the feature separability under class imbalance, we visualize the t-SNE embeddings of each class in each incremental session, as shown in Figure 8. To ensure comparability across incremental sessions, the t-SNE plots in Figure 8 use the same configuration as Figure 1 with a fixed random seed (random_state = 0), and are computed on the model-extracted feature vectors under the same normalization strategy. The plots show the distribution and separation of the feature vectors extracted by the improved SCLIFD framework. As the number of learned classes increases, the model can continuously extract discriminative features and maintain clear inter-class boundaries.

To further investigate the benefits of the structural improvements at the category level, we plot the confusion matrices of session 5 on the NPPAD dataset, as shown in Figure 9. It can be seen that, for the improved model, the old classes learned in earlier sessions still maintain high correct recognition rates (diagonal entries

\geq 85

). Misclassifications are mainly concentrated between the new class SLBIC and a few neighboring old classes (such as FLB and SGBTR). Although a small amount of cross-class confusion remains due to feature similarity, the overall number of misclassifications is significantly lower than that of the unimproved model, indicating that the improved architecture better preserves old-class knowledge while learning new classes.

After completing the comparison of different network module combinations, we further evaluate the overall competitiveness of the proposed method on class-incremental fault diagnosis tasks by comparing it with representative existing class-incremental learning methods. To this end, we select several widely used baselines, including Fine-tuning, LwF, iCaRL, and the original SCLIFD, and conduct additional experiments under the same incremental-session configuration. The results are reported in Table 9.

Here, we introduce the forgetting metric F to quantify the degradation of the model’s recognition performance on old classes in later sessions, and use it as an additional evaluation indicator:

F = \frac{1}{T - 1} \sum_{t = 1}^{T - 1} (max_{k \in {1, \dots, T}} a_{t}^{(k)} - a_{t}^{(T)}),

(10)

where

a_{t}^{(k)}

denotes the accuracy on task t after training on session k, and T is the total number of sessions.

From Table 9, it can be observed that the proposed method achieves the best performance in terms of average accuracy, last-session accuracy, and average macro-F1, and exhibits a particularly pronounced advantage in terms of forgetting. This demonstrates that, compared with traditional class-incremental strategies that rely only on knowledge distillation or sample replay, the proposed joint “feature–structure–memory” design can more effectively alleviate catastrophic forgetting across sessions.

In addition to network architecture and incremental strategy, measurement signals in nuclear power industry scenarios are often subject to noise and disturbances, making model stability under noisy conditions highly important. Therefore, we inject additive Gaussian noise with different intensities into the NPPAD dataset and compare the performance degradation of the original SCLIFD model and the proposed improved model. The experimental results are summarized in Table 10.

Furthermore, to more intuitively present the impact of noise on model performance, we plot the average accuracy versus noise level, as shown in Figure 10. It can be seen that the accuracy of the conventional model drops rapidly as the noise intensity increases, whereas the proposed model exhibits a much smoother degradation trend, indicating stronger robustness under complex disturbance conditions. Overall, the proposed method achieves stable and significant performance gains across different datasets, in comparison with mainstream class-incremental learning frameworks, and under noisy conditions. This provides strong support for the subsequent ablation analysis from the perspectives of features, structure, and memory strategies.

5.2. Ablation Study and Analysis

To systematically verify the effectiveness and independent contribution of the feature selection strategy, network components, and memory buffer configuration in the proposed framework for class-incremental fault diagnosis, this section designs multiple groups of ablation experiments along the three dimensions of “feature–structure–memory”.

5.2.1. Effectiveness of the Feature Selection Strategy

Table 11 reports the impact of different feature construction strategies on diagnostic performance. The baseline method (All-23) retains the original features only via random sampling, resulting in an accuracy of merely 69.00%, and both macro-F1 and balanced accuracy stagnate around 68%. This indicates that the original high-dimensional feature space contains a large amount of redundant information and noise, which severely interferes with the classifier’s decision boundary.

After introducing the SDG-Shield physics-guided causal selection, all evaluation metrics improve by more than 13%, demonstrating the effectiveness of physical prior knowledge in removing irrelevant variables. Furthermore, the proposed hybrid strategy that combines “physics constraints + data-driven selection” (SDG-Shield + SHAP) achieves the best performance, with accuracy reaching 89.70%. This result verifies that incorporating SHAP-XGBoost importance evaluation can further refine the feature subset and provide high signal-to-noise ratio inputs for downstream temporal modeling.

5.2.2. Contribution Analysis of Model Structure Components

To assess the contribution of different network modules in modeling long-term dependencies and handling incremental tasks, Table 8 presents the performance of various structural combinations across continuous incremental sessions. The results show that a model relying only on the basic encoder suffers from severe degradation in later sessions (Session 5 accuracy of 69.00%) and fails to cope with complex fault patterns effectively.

Under the Initial Feature Selection setting, introducing the LSTM module enhances the ability to capture local temporal features and increases the average accuracy to 88.50%. Further adding the Transformer module leverages self-attention to capture long-range dependencies, improving the overall performance to 89.78%. When ATFNet is additionally integrated, the accuracy in Session 5 is boosted from 69.00% to 82.20%, and the model exhibits more stable performance in later sessions. Under the SHAP-XGBoost feature framework, the performance of all structural combinations is further improved, and the complete architecture achieves an average accuracy of 91.36%.

To more intuitively observe the gains brought by “gradually stacking modules”, Figure 11 shows the performance curves of different structural combinations across incremental sessions. It can be seen that, without the improved structure, the accuracy decreases from incremental session 1 to 3, with a brief rebound in session 4, followed by another clear drop in session 5. In contrast, with the completely improved architecture, the accuracy keeps increasing after session 3, and the final-session accuracy is significantly higher than that of the unimproved model. This phenomenon indicates that the improved structure not only enhances incremental learning capability but also effectively alleviates catastrophic forgetting.

5.2.3. Sensitivity Analysis of Memory Buffer and Sample Configuration

Table 2 further investigates the effect of memory buffer size K and the number of new-class samples (shot) on mitigating catastrophic forgetting. When

shot = 1

is fixed, increasing K from 10 to 100 raises the average accuracy from 82.10% to 91.05%, and the macro-F1 in the final session improves from 76.20% to 88.30%. This confirms that enlarging the memory buffer allows the model to preserve old-class distribution more comprehensively, thereby maintaining previously learned knowledge.

On the other hand, when

K = 40

is fixed, increasing the number of new-class samples (shot from 1 to 5) significantly reinforces the model’s generalization ability: the average accuracy eventually reaches 94.10%, and the macro-F1 in session 5 rises to 92.30%. This trend indicates that sufficient new-class samples help form clearer decision boundaries and markedly improve recognition performance for long-tail classes such as SLBIC.

5.2.4. Overall Analysis

Synthesizing the ablation results in Table 2 and Table 11 and Figure 11, the following conclusions can be drawn:

Feature level: an appropriate feature selection strategy yields the most significant performance gain (about +20.7%), and serves as the foundation for raising the performance ceiling.
Model level: the improved hybrid architecture contributes an additional performance gain of about +8.4%, substantially enhancing the representation capacity for complex temporal patterns.
Strategy level: optimizing the memory buffer and new-class sample configuration further reduces forgetting and improves recognition performance for long-tail classes.

Overall, the proposed joint optimization framework of “feature–model–memory” demonstrates strong robustness and generalization capability in addressing key challenges such as class imbalance and incremental updates in nuclear power plant fault diagnosis.

5.3. Limitations and Directions for Improvement

To further analyze the sources of misclassification, we perform t-SNE visualization and spectral feature analysis on the failure cases, leading to the following observations:

5.3.1. Feature Overlap Between Old and New Classes Causes Cross-Class Misclassification

SLBIC and classes such as FLB and SGBTR exhibit obvious overlap in the high-dimensional feature space, and their frequency-domain representations also show similar harmonic structures. This intrinsic distributional proximity makes the model more prone to cross-class misclassification in the later stages of incremental learning.

5.3.2. Insufficient Memory Buffer Size K Leads to Forgetting of Old-Class Knowledge

When

K = 10

, the model fails to sufficiently cover boundary samples of earlier classes, causing the representation of old-class samples to gradually collapse in later sessions. As shown in Table 2, when K increases from 10 to 100, the average macro-F1 improves from 80.45% to 90.80%, indicating that a larger memory capacity can significantly mitigate catastrophic forgetting.

5.3.3. Insufficient Shot Number Results in Poor Fitting of New Classes

When

shot = 1

, the number of new-class samples is extremely limited, and their distribution in the feature space becomes highly sparse. As a result, the model cannot adequately learn their internal structure. When the shot number increases from 1 to 5, the macro-F1 in session 5 improves from 85.40% to 92.30%, verifying that moderately increasing the number of new-class samples can substantially enhance the model’s capability to model boundary classes.

5.3.4. Computational Cost and Deployment Considerations

The proposed framework combines an encoder, an LSTM-Transformer fusion layer, and an ATFNet-based time–frequency branch. While this improves diagnostic robustness, it also increases computational overhead compared with the original SCLIFD backbone. To facilitate practical deployment, we report the model size (number of parameters and memory footprint) and the average inference latency measured on our experimental platform (see Table 12). These results highlight the trade-off between accuracy/anti-forgetting gains and real-time feasibility, and motivate future lightweighting and buffer-optimization efforts.

6. Conclusions and Future Work

This study focuses on the problem of class-incremental fault diagnosis based on nuclear power plant operational data and proposes an integrated “feature–temporal–time–frequency fusion” diagnostic framework that jointly considers feature construction, class imbalance, and temporal dependencies. By combining SDG-Shield and SHAP-XGBoost, the framework performs key feature selection and importance evaluation; an LSTM–Transformer architecture is introduced to capture both short- and long-term correlations in fault signals; and an improved ATFNet is further integrated to adaptively model time- and frequency-domain information. Experiments on the NPPAD and other datasets demonstrate that the proposed method achieves high fault recognition accuracy and strong anti-forgetting performance, thereby validating its effectiveness and indicating promising potential for engineering applications.

Despite these advantages, this work still has several limitations. First, most of the experimental data are based on simulated operating conditions, and the coverage of fault types and operating scenarios remains limited. Second, the model architecture is relatively complex, which may lead to high deployment costs in real-time online monitoring scenarios. In future work, we plan to incorporate real operational data from nuclear power plants and conduct more comprehensive validation under multiple operating conditions and diverse fault patterns. In addition, we will explore model lightweighting and memory management optimization under the premise of maintaining diagnostic performance, as well as deeper integration with mechanistic models.

Author Contributions

Methodology, Q.Z.; Validation, Q.Z.; Data curation, Q.Z.; Writing—original draft, Q.Z.; Writing—review & editing, Q.Z.; Supervision, H.L.; Project administration, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, G.; Zhou, T.; Liu, Q. Data-driven machine learning for fault detection and diagnosis in nuclear power plants: A review. Front. Energy Res. 2021, 9, 663296. [Google Scholar] [CrossRef]
Qi, B.; Liang, J.; Tong, J. Fault diagnosis techniques for nuclear power plants: A review from the artificial intelligence perspective. Energies 2023, 16, 1850. [Google Scholar] [CrossRef]
Mu, Y. Research on Fault Diagnosis Technology of Nuclear Power Plants Based on Data Mining. Ph.D. Thesis, Harbin Engineering University, Harbin, China, 2011. (In Chinese). [Google Scholar]
Wang, Z.; Wei, H.; Tian, R.; Tan, S. A review of data-driven fault diagnosis method for nuclear power plant. Prog. Nucl. Energy 2025, 186, 105785. [Google Scholar] [CrossRef]
Shi, Y.; Xue, X.; Xue, J.; Qu, Y. Fault Detection in Nuclear Power Plants using Deep Leaning based Image Classification with Imaged Time-series Data. Int. J. Comput. Commun. Control 2022, 17, 4714. [Google Scholar] [CrossRef]
Gross, K.; Singer, R.; Wegerich, S.; Herzog, J.; VanAlstine, R.; Bockhorst, F. Application of a Model-Based Fault Detection System to Nuclear Plant Signals; Technical Report; Argonne National Lab. (ANL): Argonne, IL, USA, 1997. [Google Scholar]
Zhang, Q.; Geng, S. Dynamic uncertain causality graph applied to dynamic fault diagnoses of large and complex systems. IEEE Trans. Reliab. 2015, 64, 910–927. [Google Scholar] [CrossRef]
Wang, P.; Zhang, J.; Wan, J.; Wu, S. A fault diagnosis method for small pressurized water reactors based on long short-term memory networks. Energy 2022, 239, 122298. [Google Scholar] [CrossRef]
Guo, J.; Wang, Y.; Sun, X.; Liu, S.; Du, B. Imbalanced data fault diagnosis method for nuclear power plants based on convolutional variational autoencoding Wasserstein generative adversarial network and random forest. Nucl. Eng. Technol. 2024, 56, 5055–5067. [Google Scholar] [CrossRef]
Dai, Y.; Peng, L.; Juan, Z.; Liang, Y.; Shen, J.; Wang, S.; Tan, S.; Yu, H.; Sun, M. An intelligent fault diagnosis method for imbalanced nuclear power plant data based on generative adversarial networks. J. Electr. Eng. Technol. 2023, 18, 3237–3252. [Google Scholar] [CrossRef]
Yongkuo, L.; Zhen, L.; Xiaotian, W. Research on fault diagnosis with SDG method for nuclear power plant. At. Energy Sci. Technol. 2014, 48, 1646–1653. [Google Scholar]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
Qi, B.; Xiao, X.; Liang, J.; Po, L.c.C.; Zhang, L.; Tong, J. An open time-series simulated dataset covering various accidents for nuclear power plants. Sci. Data 2022, 9, 766. [Google Scholar] [CrossRef]
Ricker, N.L. Decentralized control of the Tennessee Eastman challenge process. J. Process Control 1996, 6, 205–221. [Google Scholar] [CrossRef]
Ruiz-Cárcel, C.; Cao, Y.; Mba, D.; Lao, L.; Samuel, R. Statistical process monitoring of a multiphase flow facility. Control Eng. Pract. 2015, 42, 74–88. [Google Scholar] [CrossRef]
Liu, Y.-K.; Ayodeji, A.; Wen, Z.-B.; Wu, M.-P.; Peng, M.-J.; Yu, W.-F. A cascade intelligent fault diagnostic technique for nuclear power plants. J. Nucl. Sci. Technol. 2018, 55, 254–266. [Google Scholar]
Chen, T. XGBoost: A Scalable Tree Boosting System. arXiv 2016, arXiv:1603.02754. [Google Scholar] [CrossRef]
Fanaee-T, H.; Gama, J. Event labeling combining ensemble detectors and background knowledge. Prog. Artif. Intell. 2014, 2, 113–127. [Google Scholar] [CrossRef]
Zhang, H.; Yao, Y.; Wang, Z.; Su, J.; Li, M.; Peng, P.; Wang, H. Class Incremental Fault Diagnosis Under Limited Fault Data via Supervised Contrastive Knowledge Distillation. IEEE Trans. Ind. Inform. 2025, 21, 4344–4354. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Kao, I.F.; Zhou, Y.; Chang, L.C.; Chang, F.J. Exploring a Long Short-Term Memory based Encoder-Decoder framework for multi-step-ahead flood forecasting. J. Hydrol. 2020, 583, 124631. [Google Scholar] [CrossRef]
Yokoo, K.; Ishida, K.; Ercan, A.; Tu, T.; Nagasato, T.; Kiyama, M.; Amagasaki, M. Capabilities of deep learning models on learning physical relationships: Case of rainfall-runoff modeling with LSTM. Sci. Total Environ. 2022, 802, 149876. [Google Scholar] [CrossRef]
Karim, F.; Majumdar, S.; Darabi, H.; Harford, S. Multivariate LSTM-FCNs for time series classification. Neural Netw. 2019, 116, 237–245. [Google Scholar] [CrossRef] [PubMed]
Gers, F.A.; Eck, D.; Schmidhuber, J. Applying LSTM to time series predictable through time-window approaches. In Proceedings of the International Conference on Artificial Neural Networks, Vienna, Austria, 21–25 August 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 669–676. [Google Scholar]
Liu, J.; Chen, S. Non-stationary multivariate time series prediction with selective recurrent neural networks. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Cuvu, Fiji, 26–30 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 636–649. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Kow, P.Y.; Liou, J.Y.; Yang, M.T.; Lee, M.H.; Chang, L.C.; Chang, F.J. Advancing climate-resilient flood mitigation: Utilizing transformer-LSTM for water level forecasting at pumping stations. Sci. Total Environ. 2024, 927, 172246. [Google Scholar] [CrossRef]
Ye, H.; Chen, J.; Gong, S.; Jiang, F.; Zhang, T.; Chen, J.; Gao, X. Atfnet: Adaptive time-frequency ensembled network for long-term time series forecasting. arXiv 2024, arXiv:2404.05192. [Google Scholar]

Figure 1. The t-SNE feature distribution comparison chart of five typical faults in the NPPAD nuclear power plant fault diagnosis dataset after dimensionality reduction.

Figure 2. Flowchart of the feature evaluation system.

Figure 3. General framework of the proposed method, SCLIFD, for fault diagnosis under limited fault data.

Figure 4. LSTM model of long short-term memory network.

Figure 5. Transformer architecture diagram.

Figure 6. LSTM-Transformer architecture diagram.

Figure 7. Improved fusion process based on ATFNet.

Figure 8. Feature distribution visualization of the improved SCLIFD framework across five incremental sessions. Each session shows the model’s extracted feature embeddings, demonstrating that the proposed method maintains clear inter-class separation and stable representation quality as new classes are continuously learned.

Figure 9. Confusion matrix comparison of improved and unimproved models in session 5 (NPPAD dataset).

Figure 10. Accuracy variation under different Gaussian noise levels.

Figure 11. Incremental accuracy improvement from individual network modules.

Table 1. Selected NPPAD Fault Types and Their Training/Testing Sample Distribution.

ID	Labels	Operation Conditions	Training Set	Testing Set
0	NORMAL	Normal Operation	200	200
1	LOCA	Loss of Coolant Accident (Hot Leg)	5	200
2	SGATR	Steam Generator A Tube Rupture	5	200
3	SGBTR	Steam Generator B Tube Rupture	5	200
4	FLB	Feedwater Line Break	5	200
5	SLBIC	Steam Line Break Inside Containment	5	200

Table 2. Impact of memory buffer capacity K and new-class shot number on incremental learning performance.

K	Shot	Average Metrics (All Sessions)			Final Session
K	Shot	Accuracy (%)	Macro-F1 (%)	Bal-Acc (%)	Macro-F1 (%)
10	1	82.10	80.45	80.98	76.20
40	1	89.36	88.75	89.10	85.40
100	1	91.05	90.80	90.52	88.30
40	2	92.25	91.83	91.50	89.75
40	5	94.10	93.65	93.25	92.30

Table 3. Overall dataset statistics and incremental learning settings for TEP and NPPAD.

Dataset	Total Classes	Incremental Sessions	Shot	Training Set (Normal/Fault)	Testing Set (Normal/Fault)	Memory Buffer K
TEP	10	5	2	525/248	1600	40/100
NPPAD	5	5	1	225	400	5/10

Table 4. Key Parameters for Nuclear Power Plant Fault Diagnosis Selected by SDG.

ID	Node Label	ID	Node Label	ID	Node Label
1	WLR	9	MCRT	17	QMGA
2	CNH2	10	MGAS	18	QMGB
3	RHRD	11	TDBR	19	NSGA
4	RBLK	12	TSLP	20	NSGB
5	SGLK	13	TCRT	21	FRCL
6	MBK	14	QMWT	22	PRB
7	EBK	15	LSGA	23	TRB
8	MDBR	16	LSGB

Table 5. Final Optimal Feature Subset Selected by the SHAP-XGBoost Framework.

ID	Node Label	ID	Node Label	ID	Node Label
1	HUP	9	WTRA	17	LSGB
2	TAVG	10	SCMA	18	SCMB
3	WFWA	11	WTRB	19	WHPI
4	VOL	12	WECS	20	LSGA
5	WRCA	13	WRCB	21	QMWT
6	TSAT	14	STRB	22	PSGB
7	THB	15	PSGA	23	WFWB
8	LVPZ	16	WSTA

Table 6. Key hyperparameter settings used in the proposed framework.

Module	Hyperparameter	Value
LSTM-Transformer	LSTM layers/hidden size	1/512
	Transformer layers/heads	2/4
	$d_{model}$ /dropout	512/0.1
ATFNet	FFT window length L	50
	Max harmonic order M	5
	EMA decay/ $ϵ$	0.9/ $10^{- 6}$
SCKD	Contrastive weight $λ$	0.5
	Temperature $τ$	0.1
Training	Optimizer/learning rate	Adam/0.001
	Batch size/iterations	32/100
	Random seeds	{0, 1, 2}

Table 7. Accuracy Comparison of Different Methods on the TEP Dataset.

Dataset	Feature Selection Method	Model Method			Accuracy in Incremental Sessions
		LSTM	Transformer	ATFNet	S1	S2	S3	S4	S5	Average
TEP	Initial Feature Selection				98.20%	91.50%	85.30%	83.10%	79.40%	87.50%
		✓			98.50%	93.40%	87.20%	85.10%	81.00%	89.04%
		✓	✓		98.60%	94.00%	88.10%	86.40%	82.70%	89.96%
		✓	✓	✓	98.70%	95.20%	89.50%	87.10%	85.30%	91.16%
	SHAP-XGBoost Framework				99.20%	95.60%	90.10%	88.00%	87.20%	92.02%
		✓			99.30%	95.80%	90.40%	88.30%	87.50%	92.26%
		✓	✓		99.40%	96.10%	90.80%	88.60%	88.10%	92.60%
		✓	✓	✓	99.40%	96.40%	91.30%	89.20%	89.50%	93.16%

Table 8. Accuracy Comparison of Different Methods on NPPAD Dataset.

Dataset	Feature Selection Method	Model Method			Accuracy in Incremental Sessions
		LSTM	Transformer	ATFNet	S1	S2	S3	S4	S5	Average
NPPAD	Initial Feature Selection				100.00%	96.50%	82.83%	88.12%	69.00%	87.29%
		✓			100.00%	95.75%	89.17%	85.88%	71.70%	88.50%
		✓	✓		100.00%	95.25%	88.00%	91.88%	73.80%	89.78%
		✓	✓	✓	100.00%	93.25%	87.50%	90.25%	82.20%	90.64%
	SHAP-XGBoost Framework				100.00%	92.00%	85.33%	86.75%	90.10%	90.84%
		✓			100.00%	92.25%	84.50%	87.625%	91.20%	91.11%
		✓	✓		100.00%	92.25%	86.17%	87.50%	88.60%	90.90%
		✓	✓	✓	100%	91.75%	86.33%	89.00%	89.70%	91.36%

Table 9. Comparison with representative class-incremental learning methods on NPPAD dataset.

Method	Avg. Acc (%)	Last Acc (%)	Avg. Macro-F1 (%)	Forgetting F (%)
Fine-tuning	78.5	65.2	76.8	22.3
LwF	82.4	72.1	81.5	17.8
iCaRL	84.1	76.3	83.6	15.2
SCLIFD	88.9	84.0	88.1	9.7
Proposed	91.4	89.7	90.9	6.3

Table 10. Robustness of different methods to additive Gaussian noise on NPPAD dataset.

Noise Std (%)	Method	Avg. Acc (%)	Avg. Macro-F1 (%)
0	SCLIFD	88.9	88.1
0	Proposed	91.4	90.9
5	SCLIFD	87.1	86.5
5	Proposed	90.2	89.3
10	SCLIFD	84.0	83.1
10	Proposed	88.3	87.4
20	SCLIFD	79.5	78.2
20	Proposed	84.7	83.5

Table 11. Performance comparison of different feature selection strategies on the NPPAD dataset.

Feature Selection Strategy	Accuracy (%)	Macro-F1 (%)	Bal-Acc (%)
All-23 (Baseline)	69.00	68.33	68.25
SDG-Shield	82.00	81.56	81.78
SDG-Shield + SHAP	89.70	89.12	89.45

Table 12. Computational cost and inference efficiency comparison.

Model	Params (M)	Model Size (MB)	Latency (ms/Sample)
SCLIFD (baseline)	1.3	5.2	7.4
Ours w/o ATFNet	3.1	12.4	14.8
Ours w/o Transformer	3.4	13.6	16.2
Ours (full model)	4.9	19.6	24.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Z.; Zhou, Q.; Liu, H. A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance. Computers 2026, 15, 22. https://doi.org/10.3390/computers15010022

AMA Style

Liu Z, Zhou Q, Liu H. A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance. Computers. 2026; 15(1):22. https://doi.org/10.3390/computers15010022

Chicago/Turabian Style

Liu, Zhaohui, Qihao Zhou, and Hua Liu. 2026. "A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance" Computers 15, no. 1: 22. https://doi.org/10.3390/computers15010022

APA Style

Liu, Z., Zhou, Q., & Liu, H. (2026). A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance. Computers, 15(1), 22. https://doi.org/10.3390/computers15010022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Time-Frequency Fusion Fault Diagnosis Framework for Nuclear Power Plants Oriented to Class-Incremental Learning Under Data Imbalance

Abstract

1. Introduction

2. Experimental Motivation

3. Methodology

3.1. Data Preparation and Preprocessing

3.1.1. NPPAD Dataset

3.1.2. Data Selection and Processing

3.2. Feature Selection Method

3.2.1. Initial Feature Screening

3.2.2. Interpretable Model and SHAP-XGBoost

3.2.3. Feature Evaluation System Based on SHAP-XGBoost

3.3. Model Improvements

3.3.1. Class-Incremental Fault Diagnosis Framework Based on Supervised Contrastive Knowledge Distillation

3.3.2. Improved LSTM-Transformer Fusion Layer

3.3.3. Improved Framework with ATFNet-Based Time–Frequency Fusion

4. Experimental Design

4.1. Experimental Environment Configuration

4.2. Data Preparation

4.3. Model Training

5. Experimental Results and Analysis

5.1. Result Analysis

5.2. Ablation Study and Analysis

5.2.1. Effectiveness of the Feature Selection Strategy

5.2.2. Contribution Analysis of Model Structure Components

5.2.3. Sensitivity Analysis of Memory Buffer and Sample Configuration

5.2.4. Overall Analysis

5.3. Limitations and Directions for Improvement

5.3.1. Feature Overlap Between Old and New Classes Causes Cross-Class Misclassification

5.3.2. Insufficient Memory Buffer Size K Leads to Forgetting of Old-Class Knowledge

5.3.3. Insufficient Shot Number Results in Poor Fitting of New Classes

5.3.4. Computational Cost and Deployment Considerations

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI