Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes

Yu, Xiaomin; Gong, Yingchuan; Chen, Maoyin

doi:10.3390/pr14101512

Open AccessArticle

Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes

by

Xiaomin Yu

¹,

Yingchuan Gong

¹ and

Maoyin Chen

^2,*

¹

College of Artificial Intelligence, China University of Petroleum (Beijing), Beijing 102249, China

²

College of Safety and Ocean Engineering, China University of Petroleum (Beijing), Beijing 102249, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(10), 1512; https://doi.org/10.3390/pr14101512

Submission received: 8 April 2026 / Revised: 1 May 2026 / Accepted: 6 May 2026 / Published: 7 May 2026

(This article belongs to the Section Process Control, Modeling and Optimization)

Download

Browse Figures

Versions Notes

Abstract

Incipient fault detection in industrial processes remains challenging, particularly for notorious faults 3, 9, and 15 in a chemical process benchmark, namely the Tennessee Eastman process (TEP). This paper proposes a novel unsupervised framework, namely self-modulated KAN-enhanced direct cross-attention (SMK-DCA). It constructs heterogeneous features by integrating raw data, process-aware features, and sliding-window singular values. Kolmogorov–Arnold networks (KAN) enhance nonlinear expressiveness before a cyclic DCA mechanism enables comprehensive interactions among heterogeneous features. A feature-wise linear modulation (FiLM) adaptively calibrates representations, while a sparse autoencoder with multi-target reconstruction amplifies subtle fault signatures. By leveraging KAN’s superior approximation capability and cyclic multi-view fusion, the proposed method effectively captures incipient fault-induced variations often overlooked by conventional approaches. Extensive experiments on TEP demonstrate that SMK-DCA effectively detects incipient faults 3, 9, and 15, while obtaining the best average detection rate across all faults among the compared MSPM and deep learning methods. Furthermore, validation on real-world data from an IGBT power system confirms the generalization capability of the proposed method across different industrial domains.

Keywords:

incipient fault detection; deep learning; Kolmogorov–Arnold networks; direct cross-attention; feature-wise linear modulation

1. Introduction

The increasing complexity of modern industrial processes increases the risk of catastrophic failures caused by undetected incipient faults. Consequently, timely and accurate detection of incipient faults is important to prevent catastrophic consequences. However, challenging incipient faults are difficult to detect [1]. As a widely adopted chemical process benchmark that simulates a realistic industrial plant, the Tennessee Eastman process (TEP) presents significant challenges for incipient fault detection. Among its 21 programmed faults, faults 3, 9, and 15 correspond to disturbances in reactor cooling water inlet temperature, feed temperature, and condenser cooling water inlet temperature, respectively [2]. In the field of fault detection, they have been widely recognized as notoriously difficult to detect by most multivariate statistical process monitoring (MSPM) methods [3,4,5,6,7].

Over the past decades, numerous MSPM methods have been developed to improve fault detection. For example, deep probabilistic principal component analysis (DePPCA) [8], canonical variate residual statistics analysis (CVRSA) [9], and slow feature statistics analysis (SFSA) [10] have shown progress. However, their performance on incipient faults remains limited. To our knowledge, only a few MSPM methods, such as CUSUM [11] and canonical variate analysis (CVA) [12], have shown certain effectiveness in detecting these faults. Nevertheless, their performance remains unsatisfactory. For example, the reported detection rate for TEP fault 3 is only 73.03%. Moreover, CUSUM often suffers from long detection delays for these incipient faults. However, while the former has a significant delay, the latter relies heavily on the variable selection. Recently, a twofold weighted statistical feature kernel entropy component analysis (TWSFKECA) method [13] achieved improved performance for faults 3 and 9, but still struggled with fault 15. Recently, Kolmogorov–Arnold networks (KANs) have been introduced as an alternative to multi-layer perceptrons (MLPs) for fault detection, such as KAN autoencoders (KAN-AEs) [14] and KAN with a test-time training mechanism [15]. Numerous ensemble learning approaches have been proposed, such as distributed-ensemble stacked autoencoder (DE-SAE) [16] and improved stacking (IStacking) [17]. They can integrate multiple models but are limited by individual detectors. Feature ensemble net (FENet) shows excellent performance by integrating multiple MSPM statistics [18], but it is not truly a deep learning model, limiting its applicability to complex industrial processes.

Due to the limitations of MSPM methods, researchers have increasingly focused on unsupervised deep learning (DL) models [19,20,21,22,23]. Architectures like dual-attention LSTM autoencoder (DALSTM-AE) [24], residual autoencoder-based transformer (RA-Transformer) [25], adversarial autoencoder (AAE) [26], deep-autoencoder-based principal component analysis (DAE-PCA) [27], and Kantorovich distance-multiblock variational autoencoder (KD-MBVAE) [28] have been proposed for fault detection. Recently, a causal graph spatial–temporal autoencoder (CGSTAE) was proposed, which integrates causal graph learning and spatial–temporal reconstruction to improve the reliability and interpretability of industrial process monitoring [23]. Both artificial neural correlation analysis (ANCA) [29] and decentralized adaptively weighted stacked AE (DAWSAE) [30] achieve relatively high detection rates for faults 3 and 9 in TEP. Nevertheless, these models still struggle with incipient faults 3, 9, and 15 in TEP. This is primarily because convolution and pooling operations tend to suppress low-amplitude fault signatures, treating them as noise. Although supervised DL methods can achieve better performance [31,32,33,34,35], they require labeled fault data. Hence, developing an effective unsupervised DL model for incipient fault detection in chemical and related industrial processes remains an urgent need.

Physics-guided and knowledge-guided modeling provides an important direction for reliable fault detection. By introducing physical constraints, degradation trends, or fault evolution knowledge, these methods can improve model interpretability and robustness [36,37,38]. Recent studies also show the value of prior knowledge under limited data and varying operating conditions [39,40]. However, explicit physical modeling remains difficult for complex chemical processes due to nonlinear coupling and unknown fault propagation mechanisms. Therefore, this work uses process-aware statistical features and SWSV-based local structural features as practical forms of process-related guidance. They provide statistical and structural prior information for incipient fault detection.

In this paper, we propose a novel unsupervised framework named self-modulated KAN-enhanced direct cross-attention (SMK-DCA). We construct a heterogeneous feature set that combines raw time series, process-aware features derived from MSPM statistics, and local structure features obtained via sliding-window singular values. Kolmogorov–Arnold networks (KAN) [41] are introduced to enhance nonlinear expressiveness before attention. Then, a cyclic DCA mechanism is designed to enable comprehensive interactions among heterogeneous features by cyclically rotating their roles as query, key, and value. This cyclic role assignment is a key design of SMK-DCA, allowing each feature stream to actively interact with the other two streams. A feature-wise linear modulation (FiLM) [42] adaptively calibrates the KAN-enhanced representations, and a sparse autoencoder (SAE) reconstruction is used to capture subtle fault patterns. Unlike simple feature concatenation or module stacking, SMK-DCA achieves complete cross-domain fusion of heterogeneous features while preserving their complementary information. It delivers superior performance on TEP, with consistently high detection rates on the difficult incipient faults 3, 9, and 15.

The remainder of this paper is given below. Section 2 presents the main problem. Section 3 introduces the construction of heterogeneous features. Section 4 describes the proposed SMK-DCA framework in detail. Section 5 introduces the logic of fault detection. Section 6 reports the experimental results and analysis on the TEP and IGBT power system. Finally, Section 7 summarizes the conclusions.

2. Problem Formulation and Preliminary

2.1. Problem Formulation

This paper mainly considers the detection of incipient faults in industrial processes. Intuitively speaking, an incipient fault refers to a subtle and progressive deviation from normal operating conditions. They do not immediately result in significant performance deterioration or system failure, but may evolve into a severe fault if left undetected.

The main task in this paper is to address the challenge of incipient fault detection, especially notorious incipient faults 3, 9, and 15 in the famous chemical process benchmark TEP. This paper will propose a novel integration framework, namely SMK-DCA. It aims to improve the overall detection performance for difficult incipient faults. To achieve this, the method integrates complementary heterogeneous features and enables cyclic interactions among them.

2.2. Preliminary—KAN

Note that KAN offers superior function approximation with learnable activations, compared to fixed ones in multi-layer perceptrons (MLP) [41]. A KAN layer is given by

Φ = \{F_{q, p}\}

for

p = 1, 2, \dots, n_{in}

and

q = 1, 2 \dots, n_{out}

, where

F_{q, p}

is a learnable activation function. The output of layer l passing through layer

l + 1

is expressed as

x_{l + 1} = Φ_{l} x_{l} = (\begin{matrix} F_{l, 1, 1} (\cdot) & \dots & F_{l, 1, n_{l}} (\cdot) \\ F_{l, 2, 1} (\cdot) & \dots & F_{l, 2, n_{l}} (\cdot) \\ ⋮ & ⋮ \\ F_{l, n_{l + 1}, 1} (\cdot) & \dots & F_{l, n_{l + 1}, n_{l}} (\cdot) \end{matrix}) x_{l}

(1)

where

Φ_{l}

is the activation function matrix for layer l. Hence, the output of a KAN consisting of L layers is described by

KAN (x) = (Φ_{L - 1} \circ Φ_{L - 2} \circ \dots \circ Φ_{1} \circ Φ_{0}) x

, where

x_{0} \in R^{n}

is an input.

For KAN, the activation function

F (x)

is represented by

F (x) = w_{b} \cdot b (x) + w_{S} \cdot spline (x)

with the sigmoid linear unit (SiLU):

b (x) = SiLU (x) = \frac{x}{1 + e^{- x}}

and

s p l i n e (x)

being the linear combination of B-spline basis functions, given by

spline (x) = \sum_{i} c_{i} B_{i} (x)

. Here,

c_{i}

are trainable parameters. Unlike MLPs, KAN supports the dynamic expansion of grid points for spline functions during training. Thus, KAN can effectively extract key information from the data.

3. Heterogeneous Features

3.1. Construction of Heterogeneous Feature Streams

Different from using a single statistical feature or a direct feature concatenation, this work constructs a three-stream heterogeneous representation for incipient fault detection. To characterize process dynamics, we use a heterogeneous feature set consisting of raw time-series data, process-aware statistical features, and local structure features extracted via sliding-window singular values (SWSVs). Each feature captures distinct aspects of process behavior, and their fusion provides a rich representation for subsequent deep learning.

(1) Raw data features.

Raw data directly reflect the instantaneous state of the industrial process. Let

X \in R^{m \times n}

denote the raw time series, where m is the number of monitored variables, and n represents the number of samples. Although raw data contain all original information, they are often high-dimensional and noisy. This makes it difficult to detect subtle incipient faults.

(2) Process-aware features.

Process-aware features are derived from MSPM methods and encapsulate the underlying correlations among variables. For

x_{i} \in R^{m}

(

i = 1, 2, \dots, n)

, the process-aware feature is expressed as

g (x_{i}) = x_{i}^{T} M x_{i}

, where

M \in R^{m \times m}

is a positive semi-definite (PSD) matrix, relying heavily on process characteristics. Taking PCA as an example,

T^{2}

and Q statistics are

g_{1} (x_{i}) = {∥ Λ_{p c}^{- \frac{1}{2}} P_{p c}^{T} x_{i} ∥}_{2}^{2}

and

g_{2} (x_{i}) = {∥ P_{r e s}^{T} x_{i} ∥}_{2}^{2}

, respectively, where

Λ_{p c}

,

P_{p c}

and

P_{r e s}

are determined by process characteristics. Thus,

M_{1} = P_{p c} Λ_{p c}^{- 1} P_{p c}^{T}

and

M_{2} = P_{r e s} P_{r e s}^{T}

are two PSD matrices [18]. Both

g_{1} (x_{i})

and

g_{2} (x_{i})

are called process-aware features for

x_{i}

.

Using MSPM techniques, we obtain a set of k different process-aware features. Let

S_{i} = {[g_{1} (x_{i}), g_{2} (x_{i}), \dots, g_{k} (x_{i})]}^{T} \in R^{k}

represent a process-aware feature vector of

x_{i}

. Stacking all samples yields the process-aware feature matrix:

\begin{matrix} S = [\begin{matrix} g_{1} (x_{1}) & g_{2} (x_{1}) & \dots & g_{k} (x_{1}) \\ g_{1} (x_{2}) & g_{2} (x_{2}) & \dots & g_{k} (x_{2}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ g_{1} (x_{n}) & g_{2} (x_{n}) & \dots & g_{k} (x_{n}) \end{matrix}] \in R^{n \times k} \end{matrix}

(2)

These features may be sensitive to deviations from normal operation, but they may still overlook local temporal dynamics.

(3) Local structure features.

To capture local temporal evolution, we introduce SWSVs computed on the process-aware feature matrix

S

. A full sliding window of width

w (w \geq k)

is applied along the rows, represented by

\begin{matrix} S_{q} & = {[S_{q - w + 1}, S_{q - w + 2}, \dots, S_{q}]}^{T} \\ = [\begin{matrix} g_{1} (x_{q - w + 1}) & g_{2} (x_{q - w + 1}) & \dots & g_{k} (x_{q - w + 1}) \\ g_{1} (x_{q - w + 2}) & g_{2} (x_{q - w + 2}) & \dots & g_{k} (x_{q - w + 2}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ g_{1} (x_{q}) & g_{2} (x_{q}) & \dots & g_{k} (x_{q}) \end{matrix}] \end{matrix}

(3)

Here,

S_{q} \in R^{w \times k}

is standardized to a zero-mean matrix:

{\bar{S}}_{q} = S_{q} - \frac{1}{w} 1_{w} 1_{w}^{T} S_{q}, 1_{w} = {[1, 1, \dots, 1]}^{T} \in R^{w}

(4)

Then, singular value decomposition (SVD) is performed on

{\bar{S}}_{q}

. The singular value vector

σ_{q} \in R^{k}

can capture the local characteristics within sliding windows. As the window slides, we obtain a sequence of singular value vectors:

V = {[σ_{w}, σ_{w + 1}, \dots, σ_{n}]}^{T} \in R^{(n - w + 1) \times k}

(5)

These SWSVs provide a dynamic view complementary to the static raw and process-aware features.

3.2. Computational Complexity of Process-Aware Features and SWSVs

The computational cost of SMK-DCA mainly comes from two preprocessing steps, namely the construction of process-aware features and the extraction of SWSVs. Let

x_{i} \in R^{m}

denote the i-th raw sample, where m is the number of monitored variables. The process-aware features are constructed from k statistical functions. Each feature can be written as

g_{j} (x_{i}) = x_{i}^{T} M_{j} x_{i}

, where

j = 1, 2, \dots, k

, and

M_{j} \in R^{m \times m}

is determined by the corresponding statistical model. After PCA and ICA are trained on normal data, these matrices or projection operators are fixed. Thus, the process-aware features can be computed directly in the testing stage.

If the quadratic form is calculated directly, the cost of one process-aware feature is

O (m^{2})

for each sample. Therefore, the cost of constructing k process-aware features for one sample is

O (k m^{2})

. For a sequence with n samples, the total cost is

O (n k m^{2})

. In practical implementation, the statistics derived from PCA and ICA are usually computed through low-dimensional projections. In this case, the cost is further reduced to approximately

O (m r_{j})

for the j-th statistic, where

r_{j}

is the retained projection dimension. Hence, the process-aware feature construction is lightweight because both m and k are small in the studied systems.

After obtaining the process-aware feature matrix

S \in R^{n \times k}

, SWSVs are extracted from sliding windows. For each sample index q, a local window

S_{q} \in R^{w \times k}

is constructed, where w is the window width. Since k is much smaller than the number of original variables, SVD is performed on a low-dimensional matrix rather than on the raw data matrix. For one sliding window, the standardization step requires

O (w k)

operations. The dominant cost comes from SVD. When

w \geq k

, computing the singular values of

S_{q}

requires approximately

O (2 w k^{2})

floating-point operations. Therefore, for a sequence with n samples, the total offline cost of SWSV extraction is

O ((n - w + 1) 2 w k^{2})

.

In the online stage, a first-in-first-out buffer with length w is maintained. When a new sample arrives, only the latest window is updated. Thus, only one SVD operation is required for the current window. The per-sample cost of SWSV extraction is

O (2 w k^{2})

. Combining process-aware feature construction and SWSV extraction, the online preprocessing cost per sample can be written as

O (k m^{2} + 2 w k^{2})

, or lower when projection-based computation is used for PCA and ICA statistics. Since k is small in this work, the SVD operation is conducted on a compact process-aware feature matrix. Therefore, the computational burden introduced by SWSVs remains moderate and is suitable for online fault detection under the tested sampling conditions.

3.3. Sensitivity Analysis of Process-Aware Features and SWSVs

To further clarify the sensitivity of the process-aware features and SWSVs, we consider an additive fault model. Suppose that the faulty sample is expressed as

x_{e, t} = x_{t} + ϵ_{t} d

, where

x_{t}

denotes the normal sample,

ϵ_{t}

is a scalar fault magnitude, and

d

is the fault direction vector. For incipient faults,

| ϵ_{t} |

is usually small.

For the j-th process-aware feature defined in Section 3.1, the fault-induced variation is

Δ g_{j} = g_{j} (x_{e, t}) - g_{j} (x_{t})

. Substituting

x_{e, t} = x_{t} + ϵ_{t} d

into the quadratic form and using the symmetry of

M_{j}

, we have

\begin{matrix} Δ g_{j} & = {(x_{t} + ϵ_{t} d)}^{T} M_{j} (x_{t} + ϵ_{t} d) - x_{t}^{T} M_{j} x_{t} \\ = x_{t}^{T} M_{j} x_{t} + ϵ_{t} x_{t}^{T} M_{j} d + ϵ_{t} d^{T} M_{j} x_{t} + ϵ_{t}^{2} d^{T} M_{j} d - x_{t}^{T} M_{j} x_{t} \\ = 2 ϵ_{t} d^{T} M_{j} x_{t} + ϵ_{t}^{2} d^{T} M_{j} d . \end{matrix}

(6)

For incipient faults with small

| ϵ_{t} |

, the second-order term is relatively small, and the first-order term dominates. Thus,

Δ g_{j} \approx 2 ϵ_{t} d^{T} M_{j} x_{t}

. This indicates that the sensitivity of the process-aware feature depends on the projection of the fault direction

d

onto the statistical direction

M_{j} x_{t}

. By combining PCA and ICA statistics, multiple statistical directions are considered, which increases the possibility of capturing weak fault-induced deviations.

Let

S_{t} = {[g_{1} (x_{t}), g_{2} (x_{t}), \dots, g_{k} (x_{t})]}^{T} \in R^{k}

denote the process-aware feature vector at time t. Since

Δ S_{t} = {[Δ g_{1}, Δ g_{2}, \dots, Δ g_{k}]}^{T}

, stacking the first-order approximation of all process-aware features gives

Δ S_{t} \approx ϵ_{t} c_{t}

, where

c_{t} = 2 {[d^{T} M_{1} x_{t}, d^{T} M_{2} x_{t}, \dots, d^{T} M_{k} x_{t}]}^{T} \in R^{k} .

(7)

Here,

c_{t}

collects the first-order sensitivities of all process-aware features.

Now consider the SWSV features. Let

S_{q} \in R^{w \times k}

denote the process-aware feature matrix formed by stacking w consecutive process-aware feature vectors in the q-th sliding window. After centering,

{\bar{S}}_{q} = H_{w} S_{q}

, where

H_{w} = I_{w} - \frac{1}{w} 1_{w} 1_{w}^{T}

is the centering matrix. The matrix

H_{w}

is symmetric and idempotent. The SWSVs are obtained as the singular values of

{\bar{S}}_{q}

. Under the simplifying assumption that the operating point varies slowly within the sliding window,

c_{t}

can be approximated by a coherent direction

c

. Suppose that the window contains r faulty samples

(r \leq w)

with a constant fault magnitude

ϵ

. Then, the sliding-window perturbation matrix can be approximated by the rank-one form

Δ S_{q} = ϵ z_{r} c^{T}

, where

z_{r} \in R^{w}

is a binary indicator vector with ones at the r faulty positions and zeros elsewhere. After centering, the perturbation becomes

E_{q} = H_{w} Δ S_{q} = ϵ (H_{w} z_{r}) c^{T}

. Since

H_{w}

is symmetric and idempotent, the following relation holds:

\begin{matrix} ∥ H_{w} z_{r} ∥_{2}^{2} & = z_{r}^{T} H_{w}^{T} H_{w} z_{r} \\ = z_{r}^{T} H_{w} z_{r} \\ = z_{r}^{T} (I_{w} - \frac{1}{w} 1_{w} 1_{w}^{T}) z_{r} \\ = z_{r}^{T} z_{r} - \frac{1}{w} z_{r}^{T} 1_{w} 1_{w}^{T} z_{r} \\ = r - \frac{r^{2}}{w} = r (1 - \frac{r}{w}) . \end{matrix}

(8)

Therefore, the Frobenius norm of the centered perturbation is

\begin{matrix} ∥ E_{q} ∥_{F} & = {∥ϵ (H_{w} z_{r}) c^{T}∥}_{F} \\ = | ϵ | ∥ H_{w} z_{r} ∥_{2} {∥ c ∥}_{2} \\ = | ϵ | \sqrt{r (1 - \frac{r}{w})} {∥ c ∥}_{2} . \end{matrix}

(9)

When

r ≪ w

, this term can be approximated as

∥ E_{q} ∥_{F} \approx | ϵ | \sqrt{r} {∥ c ∥}_{2}

. This result shows that the centered perturbation can increase with the number of coherent faulty samples in the sliding window.

According to first-order singular value perturbation analysis, for a simple non-degenerate singular value,

Δ σ_{ℓ, q} \approx u_{ℓ, q}^{T} E_{q} V_{ℓ, q}

, where

u_{ℓ, q}

and

v_{ℓ, q}

are the corresponding left and right singular vectors of the nominal centered window matrix. Thus, SWSVs are sensitive to the centered structural perturbation

E_{q}

induced by the fault. When the perturbation is coherent across consecutive samples and aligned with dominant singular directions, SWSVs can reflect the accumulated local structural change within the window. It should also be noted that if the same constant offset affects the entire window, i.e.,

r = w

, the centered perturbation vanishes in this simplified case. Therefore, SWSVs mainly characterize local temporal and structural changes rather than pure constant shifts over the whole window.

In summary, process-aware features respond to instantaneous statistical deviations through

M_{j} x_{t}

, while SWSVs further capture local temporal and structural changes through the centered perturbation

E_{q}

. The two types of features are therefore complementary. This complementarity does not imply that SWSVs can compensate for arbitrarily poor process-aware features. Reasonable statistical features are still required, while SWSVs further enhance sensitivity to persistent and evolving fault patterns.

4. Self-Modulated KAN-DCA (SMK-DCA)

The SMK-DCA architecture (Figure 1) first uses KAN to enhance heterogeneous features’ nonlinearity. Cyclic DCA then enables direct cross-domain interaction among raw data, process-aware features, and SWSVs. FiLM adaptively calibrates the KAN-enhanced features based on DCA outputs. Finally, the enhanced features are fused into a unified representation for SAE-based training and fault detection.

4.1. DCA

After constructing the heterogeneous feature set (Section 3), how to effectively fuse these features is crucial. We adopt DCA to perform direct cross-domain interactions via dedicated projections.

Definition 1

(DCA [43]). Consider a heterogeneous feature set

H = {f_{i}, f_{j}, f_{k}}

, where

f_{i} \in R^{1 \times d_{i}}

,

f_{j} \in R^{1 \times d_{j}}

, and

f_{k} \in R^{1 \times d_{k}}

. Different from standard cross-attention (SCA) [44], DCA does not first map heterogeneous features into a shared latent representation by a common projection. Instead, it builds feature interactions through role-specific and path-dependent projection matrices. The DCA operation is formulated as

\begin{matrix} DCA (f_{i}, f_{j}, f_{k}; W_{f_{i}}^{Q}, W_{f_{j}}^{K}, W_{f_{k}}^{V}) \\ = softmax (\frac{(f_{i} W_{f_{i}}^{Q}) {(f_{j} W_{f_{j}}^{K})}^{T}}{\sqrt{d_{k}}}) (f_{k} W_{f_{k}}^{V}), \end{matrix}

(10)

where

W_{f_{i}}^{Q} \in R^{d_{i} \times d_{q}}

,

W_{f_{j}}^{K} \in R^{d_{j} \times d_{k}}

, and

W_{f_{k}}^{V} \in R^{d_{k} \times d_{v}}

are learnable projection matrices for query, key, and value, respectively. These matrices are determined by the specific interaction path and the assigned roles of the three feature streams.

Compared with SCA, DCA allows heterogeneous features to exchange information without using an explicit shared homogenization mapping. Although projection matrices are still used, they are not shared across all feature domains. Instead, each interaction path has its own projection set

W_{f_{i}}^{Q}

,

W_{f_{j}}^{K}

, and

W_{f_{k}}^{V}

. This design can be regarded as a flexible path-aware alignment strategy. It is more adaptive than a fixed common projection. Moreover, SCA usually involves two feature sources in one attention operation, where one feature provides the query and the other provides both the key and value. In contrast, DCA assigns query, key, and value to three different feature streams. Thus, it can model more decoupled interactions among heterogeneous feature domains within a single attention operation.

4.2. SMK-DCA

To improve nonlinear expressiveness, we insert a KAN transformation [41] to construct the query, key, and value representations. Given a triplet of heterogeneous features

(f_{i}, f_{j}, f_{k})

, we first obtain their KAN-enhanced embeddings:

{\tilde{f}}_{i} = {KAN}_{i} (f_{i}), {\tilde{f}}_{j} = {KAN}_{j} (f_{j}), {\tilde{f}}_{k} = {KAN}_{k} (f_{k})

(11)

Here,

KAN (\cdot)

denotes a KAN layer with a learnable basis and spline components. In particular, for the l-th KAN layer, the transformation can be written as

X^{(l)} = δ (X^{(l - 1)}) W_{base}^{(l)} + Spline (X^{(l - 1)}) W_{spline}^{(l)}

(12)

where

δ (\cdot)

is the basis activation and

Spline (\cdot)

is the learnable B-spline expansion. Then, the query, key, and value are constructed on the KAN-enhanced features via path-specific projections:

Q = {\tilde{f}}_{i} W_{f_{i}}^{Q}, K = {\tilde{f}}_{j} W_{f_{j}}^{K}, V = {\tilde{f}}_{k} W_{f_{k}}^{V}

(13)

Finally, the KAN-enhanced DCA (K-DCA) is defined as

K-DCA (f_{i}, f_{j}, f_{k}) = softmax (\frac{{\tilde{f}}_{i} W_{f_{i}}^{Q} {({\tilde{f}}_{j} W_{f_{j}}^{K})}^{T}}{\sqrt{d_{k}}}) ({\tilde{f}}_{k} W_{f_{k}}^{V})

(14)

where

d_{k}

is the scaling factor. It should be noted that DCA employs learnable projection matrices

W_{f_{i}}^{Q}, W_{f_{j}}^{K}, W_{f_{k}}^{V}

for each interaction path.

As stated above, we construct a heterogeneous feature set:

H = {f_{r}, f_{p}, f_{s}}

(15)

where

f_{r}

denotes the raw time series, and

f_{p}

denotes the process-aware feature. Moreover, SWSVs are extracted from the process-aware sequence, yielding a singular-value representation denoted by

f_{s}

.

To fully exploit different types of features, we do not rely on a single (

Q, K, V

) assignment. Instead, we perform three cyclic DCA operations by rotating the roles of the feature domains. Specifically, let

P = {(f_{r}, f_{p}, f_{s}), (f_{p}, f_{s}, f_{r}), (f_{s}, f_{r}, f_{p})}

(16)

where each tuple in

P

indicates the ordering of (

f_{i}, f_{j}, f_{k}

). For each

(a, b, c) \in P

, we compute one K-DCA branch

A_{a}

by taking

f_{a}

as the query source, using

f_{b}

to provide keys, and retrieving the values from

f_{c}

:

O_{a} = f_{a} + A_{a} = f_{a} + K-DCA (f_{a}, f_{b}, f_{c}) .

(17)

This cyclic design is a key difference between SMK-DCA and a conventional single branch cross-attention fusion. Instead of fixing one feature as the query source, the proposed cyclic DCA allows each feature stream to actively query the other two streams. Therefore, the interaction is bidirectional at the feature domain level and more suitable for heterogeneous process monitoring features.

4.3. Self-Modulation by DCA

Although K-DCA produces a cross-attention output

A_{a}

, the contribution of KAN on the query stream may vary across operating conditions and time segments. To adaptively calibrate the KAN outputs while preserving a lightweight structure, we introduce a feature-wise linear modulation (FiLM) mechanism conditioned on

A_{a}

[42].

Specifically, for each branch

(a, b, c) \in P

, we first obtain a KAN-enhanced query feature

{\tilde{f}}_{a} = KAN (f_{a})

. Then, the FiLM parameters are generated from the attention residual

A_{a}

via a learnable generator

g_{a} (\cdot)

:

(γ_{a}, β_{a}) = g_{a} (A_{a})

(18)

Here,

γ_{a}

and

β_{a}

denote feature-wise scaling and shifting factors. Thus, FiLM modulates the KAN output as

{\tilde{f}}_{a}^{F} = FiLM ({\tilde{f}}_{a} ∣ γ_{a}, β_{a}) = γ_{a} ⊙ {\tilde{f}}_{a} + β_{a}

(19)

with `⊙’ being the Hadamard product. Finally, the modulated KAN feature is fused with the cross-attention output through a residual pathway followed by layer normalization:

R_{a} = {\tilde{f}}_{a}^{F} + A_{a}

.

Thus, the cyclic design makes each feature act as the query once. Therefore, every feature domain can be enhanced by information selectively attended from the other two domains, while maintaining their original meanings rather than forcing them into a single homogenized space. Accordingly, we obtain three types of outputs:

\begin{matrix} R_{r} & = {\tilde{f}}_{r}^{F} + K-DCA (f_{r}, f_{p}, f_{s}), \\ R_{p} & = {\tilde{f}}_{p}^{F} + K-DCA (f_{p}, f_{s}, f_{r}), \\ R_{s} & = {\tilde{f}}_{s}^{F} + K-DCA (f_{s}, f_{r}, f_{p}) . \end{matrix}

(20)

After the cyclic K-DCA with FiLM calibration, we obtain three enhanced representations

{R_{r}, R_{p}, R_{s}}

. They preserve complementary dynamics from

f_{r}

,

f_{p}

, and

f_{s}

, respectively. Instead of forcing them into a single shared latent embedding through heavy projection, we employ a linear fusion layer to retain their complementary cues. Specifically, the final representation is constructed by aggregating these outputs into a unified feature vector:

U = Ψ (R_{r}, R_{p}, R_{s}),

(21)

Here,

Ψ (\cdot)

is a feature aggregation function. The fused representation

U

is then used for subsequent fault detection.

5. Fault Detection Using SMK-DCA

5.1. Loss Function

We use SAE to design the loss function [45]. In particular, after an encoder

E

maps Z to a latent variable, a decoder

D

performs the reconstruction by

H = E (Z), \hat{Z} = D (H) .

(22)

where

H

denotes the latent embedding. To provide multi-granularity supervision, the model simultaneously reconstructs

{f_{r}, f_{p}, f_{s}}

as follows:

{\hat{f}}_{r} = ψ_{r} (\hat{Z}), {\hat{f}}_{p} = ψ_{p} (\hat{Z}), {\hat{f}}_{s} = ψ_{s} (\hat{Z})

(23)

Here,

ψ_{r} (\cdot)

,

ψ_{p} (\cdot)

, and

ψ_{s} (\cdot)

are linear decoders. Accordingly, the overall training objective is formulated as

L = {∥f_{r} - {\hat{f}}_{r}∥}_{2}^{2} + {∥f_{p} - {\hat{f}}_{p}∥}_{2}^{2} + {∥f_{s} - {\hat{f}}_{s}∥}_{2}^{2} + β \sum_{j = 1}^{d_{h}} KL (ρ ∥ {\hat{ρ}}_{j})

(24)

where

d_{h}

is the dimension of the latent representation,

β

controls the strength of the sparsity,

ρ

is a predefined sparsity target, and

{\hat{ρ}}_{j}

denotes the average activation of the jth latent unit, given by

{\hat{ρ}}_{j} = \frac{1}{N} \sum_{i = 1}^{N} H_{j}^{(i)}

(25)

In addition,

KL (ρ ∥ {\hat{ρ}}_{j})

is defined by

KL (ρ ∥ {\hat{ρ}}_{j}) = ρ log \frac{ρ}{{\hat{ρ}}_{j}} + (1 - ρ) log \frac{1 - ρ}{1 - {\hat{ρ}}_{j}}

(26)

The KL divergence term constrains the average activation of each latent unit to remain close to the sparsity target. Thus, it suppresses redundant factors while highlighting subtle abnormal patterns.

5.2. Detection Logic

For SMK-DCA, the mean squared error (MSE) serves as the detection index. It is calculated from the reconstruction errors of the three heterogeneous features:

J_{q} = \sum_{u \in {r, p, s}} {∥f_{u}^{q} - {\hat{f}}_{u}^{q}∥}_{F}^{2}

(27)

where

f_{u}^{q}

and

{\hat{f}}_{u}^{q}

denote the feature and its corresponding reconstruction for the q-th sample, respectively, with

u \in {r, p, s}

corresponding to the raw-data, process-aware, and SWSV features.

Based on

J_{q}

, kernel density estimation (KDE) is adopted to calculate the detection threshold

J^{l i m}

at a given confidence level

Θ

at the offline training stage. The KDE is derived from the probability density function

f_{p} (x)

corresponding to the monitoring statistics x. Then, a collection of independent and identically distributed samples

x_{1}, x_{2}, \dots, x_{n}

can be formulated by

{\hat{f}}_{p, h} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h})

, where K is the kernel function and h is the bandwidth.

In the online testing stage, the status of a new sample

x_{q} (q \geq n + 1)

is determined by the following logic:

\{\begin{matrix} If J_{q} > J^{l i m}, x_{q} is faulty \\ Otherwise, x_{q} is normal \end{matrix}

(28)

6. Case Study

The famous benchmark TEP and real-world data from an insulated-gate bipolar transistor (IGBT) power system are used to verify the effectiveness of SMK-DCA. All simulations are conducted on a local machine with the following specifications: the operating system is Windows 11 (64-bit), the processor is an Intel Core i7-14700HX, the graphics card is an NVIDIA RTX 4070, and the software used for data processing is Python 3.9.23.

In order to show the superiority of SMK-DCA, some distinctive methods are employed for comparison. The contrastive methods include MSPM methods such as CVRSA [9], probability transformation Mahalanobis distance (PTMD) [46], SFSA [10], OTSMD [5], DePPCA [8], and TWSFKECA [13]. Moreover, unsupervised deep learning models such as RATransformer [25], AAE [26], DALSTM-AE [24], KD-MBVAE [28], DAE-PCA [27], CGSTAE [23], and ANCA [29] are chosen for comparison. The false alarm rate (FAR) and fault detection rate (FDR) are adopted as evaluation metrics to assess fault detection performance. For a fair and transparent comparison, the sources of the baseline results are clarified here. For the TEP experiment, CVRSA, PTMD, SFSA, and OTSMD are reimplemented according to their original papers. Their main parameter settings are described in this section. The results of the other compared methods on TEP are taken from the corresponding original references under the standard TEP benchmark setting. For the IGBT power system, the experiments are conducted on the experimental platform described in this paper. The parameter settings of the compared methods are also specified accordingly.

6.1. Tennessee Eastman Process

For TEP, incipient faults 3, 9, and 15 are notoriously difficult to detect [3,7,27]. The input variables comprise 33 variables (XMEAS (1-22) and XMV (1-11)). Detailed descriptions of the flowsheet of the TEP and faults can be found in [47]. The training set consists of 960 normal samples, while the test set contains 21 fault scenarios with 960 samples each, where the first 160 samples correspond to normal operating conditions and the remaining samples correspond to fault conditions.

6.1.1. Parameter Settings

For TEP, PCA and independent component analysis (ICA) [48] are adopted to provide process-aware features. For PCA, the principal component and residual subspaces are divided based on the cumulative percentage variance (CPV) of 90% [49], and

T^{2}

and Q statistics are calculated. For ICA,

I^{2}

and

I_{e}^{2}

statistics are used, with the number of independent components being 10. The confidence level

Θ

is uniformly set as 99%, and the control limit is calculated by KDE. Thus, the number of process-aware features used in SMK-DCA is

k = 4

with

PCA (T^{2})

,

PCA (Q)

,

ICA (I^{2})

, and

ICA (I_{e}^{2})

. Adam is adopted as the optimizer with a learning rate of 0.001, determined by hyperparameter grid search. For K-DCA, parameter G and spline order are set as 5 and 3, respectively. For SAE, the sparsity parameter

ρ

is 0.15, and

β

is 0.3. For SMK-DCA, the width of sliding windows is set as 200. For CVRSA, SFSA, PTMD, and OTSMD, the window width is set to 35, 35, 100, and 100, respectively.

To evaluate the computational cost introduced by KAN and FiLM, four DCA variants are compared. SMM-DCA w/o FiLM denotes the MLP-based DCA model without FiLM modulation. SMM-DCA replaces KAN with MLP while retaining FiLM. SMK-DCA w/o FiLM keeps the KAN-enhanced DCA module but removes FiLM. SMK-DCA denotes the full proposed model with both KAN and FiLM. Table 1 reports the computational cost of different DCA variants on TEP. Compared with SMM-DCA w/o FiLM, SMM-DCA only introduces a small increase in parameters, training time, memory usage, and inference time. This indicates that the FiLM module is lightweight. In contrast, replacing MLP with KAN increases the computational cost more clearly. For example, the trainable parameters increase from 123,865 in SMM-DCA to 221,785 in SMK-DCA. The training time also increases from 14.94 s to 94.83 s. The peak GPU memory rises from 1110.78 MB to 5087.68 MB. This additional cost mainly comes from the B-spline-based learnable activation functions in KAN. However, the inference time of SMK-DCA is still only 0.8035 ms per sample. This value remains acceptable for online fault detection under the tested setting. Moreover, the comparison between SMK-DCA w/o FiLM and SMK-DCA shows that FiLM adds only a minor computational burden. Therefore, the main cost of SMK-DCA comes from KAN, while FiLM provides adaptive modulation with limited extra complexity. Considering the improved detection performance shown in the ablation study, the additional cost of SMK-DCA is acceptable.

6.1.2. Detection Performance

Table 2 and Table 3 compare SMK-DCA with representative MSPM and DL methods on TEP, respectively. MSPM methods such as CVRSA, PTMD, SFSA, and OTSMD perform adequately on several common faults, but remain limited on incipient faults 3, 9, and 15. For faults 3 and 9, most conventional MSPM methods yield relatively low FDRs. Note that TWSFKECA shows improved detection capability, achieving 82.23% and 91.33%, respectively. However, its performance on fault 15 is still unsatisfactory, with an FDR of only 60.40%. As a kind of DL model, ANCA achieves high FDRs for faults 3 (98.0%) and 9 (98.3%), respectively. However, its performance on fault 15 remains limited at 38.0%. CVA achieves FDRs of 73.03%, 92.26%, and 99.50% for faults 3, 9, and 15, respectively. Its performance on fault 3 is still unsatisfactory. In contrast, SMK-DCA achieves FDRs of 91.13%, 95.87%, and 91.50% for faults 3, 9, and 15, respectively. Moreover, it attains the highest average FDR (AFDR) of 97.37% among all compared methods while maintaining a relatively low average FAR (AFAR) of 2.29%. Although SMK-DCA is not the best for every individual incipient fault, it provides the strongest overall balance across all TEP faults. For incipient fault 9, the detailed performance of SMK-DCA and some contrastive methods are shown in Figure 2.

To further evaluate the early detection capability of SMK-DCA, the detection delay (DD) is reported in Table 4. DD is defined as

DD = t_{\det} - t_{fault}

, where

t_{fault}

denotes the actual fault occurrence sample and

t_{\det}

denotes the first alarm sample after the fault occurs. A smaller DD indicates earlier fault detection. For TEP, the difficult faults 3, 9, and 15 occur at the 160th sample. SMK-DCA detects fault 3 at the fault occurrence sample, with a DD of 0 samples, while fault 9 is detected after only 2 samples, and fault 15 is detected after 68 samples. Although the DD of fault 15 is relatively larger than those of faults 3 and 9, it remains acceptable considering the weak and slowly evolving nature of this fault. These results show that SMK-DCA can detect faults 3 and 9 almost immediately and still provides effective early warning for fault 15.

6.1.3. Ablation Experiments

To clarify the ablation settings,

f_{r}

-SMKSA,

f_{p}

-SMKSA, and

f_{S}

-SMKSA denote the self-modulated KAN-enhanced self-attention models using raw data, process-aware features, and SWSVs as input, respectively. SMK-DCA w/o Cyclic removes the cyclic interaction and keeps only one cross-attention branch. SMK-CSA replaces DCA with feature concatenation followed by self-attention. SMM-DCA replaces KAN with MLP. SMK-DCA w/o FiLM removes the FiLM-based self-modulation module. SMK-DCA-AE replaces the sparse autoencoder with a standard autoencoder. SMK-DCA w/o MTR removes multi-target reconstruction and reconstructs only raw data.

To verify the contribution of each module, a systematic ablation study is conducted on TEP, as summarized in Table 5. The three single-feature SMKSA variants show clear limitations, especially for difficult incipient faults 3, 9, and 15. Although the SWSV-based variant achieves the highest AFDR among the single-feature models, its FDRs for faults 3, 9, and 15 are only 41.50%, 37.00%, and 47.00%, respectively. This indicates that a single feature stream is insufficient for reliable incipient fault detection. By introducing cross-feature interaction, SMK-DCA w/o Cyclic improves the AFDR to 96.05%, but its AFAR increases to 7.02%, and its FDR for fault 15 is only 64.12%. In contrast, the complete SMK-DCA reduces the AFAR to 2.29% and improves the FDR of fault 15 to 91.50%, showing the necessity of cyclic interaction among heterogeneous feature streams. The comparison between SMK-CSA and SMK-DCA further shows that simple concatenation followed by self-attention cannot fully exploit complementary information across heterogeneous feature domains. SMM-DCA obtains a competitive AFDR of 96.93%, but SMK-DCA still achieves a higher AFDR and a lower AFAR, indicating the benefit of KAN-based nonlinear feature transformation. After removing FiLM, the AFDR decreases from 97.37% to 93.70%, and the FDRs of faults 9 and 15 drop to 82.00% and 41.00%, respectively. This suggests that FiLM-based self-modulation helps calibrate KAN-enhanced features and retain weak fault-related variations. When the sparse autoencoder is replaced by a standard autoencoder, the AFDR decreases to 95.70%, and the FDR of fault 15 decreases to 73.75%. When multi-target reconstruction is removed, the AFDR drops to 89.08%, and the FDRs of faults 3, 9, and 15 decrease markedly.

Overall, SMK-DCA achieves the highest AFDR of 97.37%, with a low AFAR of 2.29%. These results demonstrate that the strong performance of SMK-DCA comes from the joint contribution of heterogeneous features, cyclic DCA, KAN, FiLM, sparse AE, and multi-target reconstruction.

6.1.4. Hyperparameter Discussion

To evaluate the robustness of SMK-DCA with regard to the initial process-aware feature settings, Table 6 reports the sensitivity analysis of PCA/ICA hyperparameters on TEP. The CPV of PCA and the number of independent components in ICA are varied to generate six process-aware feature settings. Six PCA/ICA combinations are considered: S1 uses CPV = 0.85 and

n_{IC} = 8

; S2 uses CPV = 0.85 and

n_{IC} = 9

; S3 uses CPV = 0.90 and

n_{IC} = 8

; S4 uses CPV = 0.90 and

n_{IC} = 10

; S5 uses CPV = 0.95 and

n_{IC} = 9

; and S6 uses CPV = 0.95 and

n_{IC} = 10

. For each setting, SMK-DCA w/o Cyc. uses only one cross-attention branch, where raw data, process-aware features, and SWSVs are used as query, key, and value, respectively. Cyc. denotes the complete cyclic DCA mechanism. Under all settings, the cyclic DCA version maintains high AFDRs above 95%. This indicates that SMK-DCA is robust to moderate changes in PCA/ICA hyperparameters. More importantly, cyclic DCA provides a better balance between FDR and FAR. Although the w/o Cyc. variants often obtain high AFDRs, their AFARs increase sharply under several settings. For example, the AFARs of the w/o Cyc. variants reach 39.82% and 75.09% under S1 and S2, respectively. In contrast, the cyclic DCA variants keep AFARs much lower under the same settings. Under the selected setting S4, SMK-DCA achieves an AFDR of 97.37% with an AFAR of 2.29%. It also obtains FDRs of 91.13%, 95.87%, and 91.50% for faults 3, 9, and 15, respectively. These results show that cyclic DCA can enhance complementary information from heterogeneous feature streams. They also show that the proposed method remains stable when the PCA/ICA hyperparameters change within a reasonable range. However, the quality of process-aware features still affects monitoring performance. Thus, cyclic DCA improves robustness, but reasonable PCA/ICA settings are still needed.

The KDE thresholding procedure is also clarified. In this work, the Gaussian kernel is used for KDE, and Scott’s rule is adopted as the default bandwidth selection method. The control limit

J_{\lim}

is calculated only from the reconstruction errors of normal training samples under the same confidence level. For all experiments of the proposed method, the same KDE setting is used for fault detection. To evaluate the influence of bandwidth selection, several bandwidth settings are further tested, as reported in Table 7. When the bandwidth changes from Scott

\times 0.75

to Scott

\times 1.25

, the AFDR remains stable from 97.30% to 97.43%, and the AFAR varies only from 2.08% to 2.38%. Silverman’s rule also gives similar results, with an AFDR of 97.35% and an AFAR of 2.17%. These results indicate that the proposed method is not highly sensitive to moderate KDE bandwidth changes.

To examine whether the sparsity constraint suppresses weak fault signatures, we further analyze the effect of the target sparsity level

ρ

in the KL sparsity term. Here,

ρ

denotes the target average activation level of latent units. Different

ρ

values are tested while keeping the other hyperparameters unchanged, as shown in Table 8.

As shown in Table 8,

ρ

has a clear influence on difficult incipient faults. When

ρ

is set to 0.05 or 0.10, the FDRs of faults 3 and 9 remain high, but the FDR of fault 15 decreases to 72.25% and 77.00%, respectively. When

ρ

is increased to 0.20, the FDR of fault 15 is still limited at 73.75%. This indicates that an inappropriate sparsity level may weaken some low-amplitude fault signatures. With

ρ = 0.15

, SMK-DCA achieves the best overall performance, with FDRs of 91.13%, 95.87%, and 91.50% for faults 3, 9, and 15, respectively. It also obtains the highest AFDR of 97.37% and the lowest AFAR of 2.29%. These results show that the sparsity constraint does not simply suppress weak fault information. With a proper target sparsity level, it can reduce redundant latent responses while retaining informative abnormal patterns.

6.2. Insulated-Gate Bipolar Transistor (IGBT) Power System

The insulated-gate bipolar transistor (IGBT) is a high-performance semiconductor device widely used in power electronics. It combines high input impedance with strong current drive capability. In high-power-density applications, IGBTs often operate continuously under high voltage and high current. This makes them highly vulnerable to electrical stress. Long-term electrical stress may distort the internal electric field. This can reduce carrier injection efficiency, increase chip temperature, and eventually cause overheating or breakdown.

Experiments are conducted on a three-phase full-bridge inverter system that employs six IGBTs to convert direct current (DC) to three-phase alternating current (AC). The experimental platform consists of six IGBT modules with fast recovery diodes and generates a three-phase sine wave AC output through space pulse width modulation (SPWM) and LC filters. Three phase-symmetry resistors simulate load conditions.

During the experiment, three-phase load voltage waveforms are sampled every second, with six features recorded per phase: frequency, average value, peak-to-peak value, root mean square value, clearance indicator, and kurtosis. Thus, each data sample contains 18 elements. We simulate two types of faults, corresponding to stationary and non-stationary processes. To simulate stationary fault 1, a 900

Ω

shunt resistor is connected. To simulate a non-stationary fault 2, the resistance value of the platform is linearly varied from 16

Ω

to 25

Ω

during the experiment, with the fault condition introduced by connecting a 700

Ω

resistor in parallel with the inductor. A total of 8000 data samples are collected both before and after the occurrence of the fault. The first 4000 samples are utilized for model training, while the remaining samples are used for testing. The fault is introduced starting from the 2001st sampling time until the end of the experiment in the test data.

To further improve the interpretability of the IGBT validation, t-SNE visualization is used to compare different feature spaces. As shown in Figure 3, the raw time series shows severe overlap between normal and faulty samples. This indicates that the weak fault information is difficult to distinguish in the original measurement space. The process-aware features enlarge the distribution difference by reflecting statistical deviations and correlation changes. The local structure features further describe the local evolution of the data structure within sliding windows. After cyclic DCA, the normal and faulty samples become more clearly separated. This shows that cyclic feature interaction can integrate raw responses, statistical deviations, and local structural information. Therefore, the fused representation is more discriminative for IGBT fault detection.

From an engineering perspective, IGBT faults and degradation are usually related to electrical and thermal stresses. Typical mechanisms include bond-wire degradation, solder fatigue, gate-oxide-related degradation, and parameter drift. These mechanisms may affect several measured responses at the same time, rather than changing only one single variable. Therefore, a single feature space may be insufficient to describe the fault behavior. In this work, the raw time series preserves the direct electrical response. The process-aware features describe abnormal statistical deviations from normal operating conditions. The local structure features capture the local structural evolution of multivariate measurements. These complementary feature streams provide a practical engineering interpretation for the proposed heterogeneous feature fusion.

6.2.1. Parameter Settings

Here, PCA and ICA are adopted to generate process-aware features. For ICA, the number of independent components is set to nine. Other parameters including the sliding window width

(w)

and the configurations of the laptop computer are the same as those assigned for simulations of TEP. For CVRSA, SFSA, PTMD, and OTSMD, the window width is set to 10, 10, 50, and 50, respectively. For DAE-PCA, the feature dimension after PCA operation is set to 15. The 18 variables are partitioned into six sub-blocks, each containing three variables, and the sliding window size for KD-MBVAE is set to 50.

6.2.2. Detection Performance

Table 9 reports the online latency of SMK-DCA on the IGBT power system. The SVD operation is performed on the low-dimensional process-aware feature matrix rather than the original raw data. Thus, the computational burden of SWSV extraction is limited. The measured SVD time is 0.0241 ms per sample. The network inference time is 0.4488 ms per sample. The total online latency is 0.4729 ms per sample. This value is much smaller than the sampling period of the IGBT system, which is 1000 ms. Therefore, the combined latency of SWSV extraction and SMK-DCA inference is compatible with the sampling period in this case. Although the peak GPU memory reaches 5829.03 MB, the online latency remains low under the tested setting. These results indicate that the proposed method has practical feasibility for online fault detection under the tested sampling condition.

A comprehensive performance comparison across all contrastive methods is given in Table 10. It demonstrates the superior performance of SMK-DCA in detecting stationary and non-stationary faults in the IGBT power system. For fault 1, MSPM methods such as CVRSA, SFSA, PTMD, and OTSMD achieve FDRs above 96%. However, their FARs all exceed 1%. In comparison, SMK-DCA achieves an FDR of 97.50% with an FAR of only 0.05%. Compared with DL methods, SMK-DCA also exhibits clear superiority. For example, RATransformer and AAE achieve FDRs of only 47.55% and 2.25%, respectively, both much lower than that of SMK-DCA. For fault 2, the FDRs of most methods degrade markedly and the FARs increase significantly. For example, DePPCA exhibits an FAR above 16%, while KD-MBVAE(

B I C_{K D}

) reaches an FDR of 100.00% but with an extremely high FAR of 89.30%. In contrast, SMK-DCA maintains a high FDR of 98.60% with a relatively low FAR of 1.55%. DL methods such as DALSTM-AE and DAE-PCA(Q) also show clearly lower FDRs on fault 2. For fault 2, the detailed monitoring results of SMK-DCA and several compared methods are shown in Figure 4. Summing up the above analysis, SMK-DCA achieves consistently high FDRs and low FARs on both stationary and non-stationary faults, demonstrating its reliability in the IGBT power system.

To further evaluate the early detection capability of SMK-DCA, the detection delay (DD) is reported in Table 11. DD is defined as the difference between the first alarm sample and the actual fault occurrence sample. A smaller DD indicates that the fault can be detected earlier. For the IGBT power system, the fault is introduced at the 2001st sample. SMK-DCA detects fault 1 at the 2011th sample, with a DD of 10 samples. It detects fault 2 at the 2029th sample, with a DD of 28 samples. These results show that SMK-DCA can detect the two IGBT faults shortly after fault occurrence.

6.2.3. Ablation Experiments

The ablation variants follow the same definitions as those in the TEP ablation study. The ablation experiment results are summarized in Table 12.

f_{r}

-SMKSA and

f_{p}

-SMKSA fail to detect these two faults effectively, yielding extremely low FDRs. In contrast,

f_{S}

-SMKSA achieves an FDR of 97.50% for fault 1. This indicates that SWSVs provide more discriminative information than raw data and process-aware features. However, its FDR for fault 2 is only 21.30%. This implies that a single feature stream is still insufficient for non-stationary fault detection. SMK-DCA w/o Cyclic achieves high FDRs for both faults, but its FARs increase to 21.05% and 78.25%, respectively. This shows that using only one cross-attention branch may cause many false alarms. SMK-CSA obtains a low FAR for fault 1, but its FAR for fault 2 is still 12.15%. This suggests that simple feature concatenation with self-attention cannot provide stable monitoring performance. SMM-DCA achieves FDRs of 97.65% and 98.80% for faults 1 and 2, respectively. However, its FAR for fault 2 is 5.30%, which is higher than that of SMK-DCA. After removing FiLM, the FDR remains high, but the FAR for fault 2 increases to 17.80%. This demonstrates the importance of FiLM-based self-modulation in suppressing false alarms. When the sparse AE or multi-target reconstruction is removed, the FARs also increase clearly. In particular, SMK-DCA w/o MTR yields FARs of 73.05% and 84.20% for faults 1 and 2, respectively. This confirms that multi-target reconstruction is important for stable fault detection. Overall, SMK-DCA achieves the best balance between FAR and FDR. It obtains FAR/FDR values of 0.05%/97.50% and 1.55%/98.60% for faults 1 and 2, respectively. These results further verify the effectiveness of cyclic DCA, KAN, FiLM, sparse AE, and multi-target reconstruction on the IGBT power system.

7. Conclusions

We propose a novel unsupervised DL model (SMK-DCA) for incipient fault detection. Through the sufficient fusion of heterogeneous features, such as raw data, process-aware features, and SWSV features, the proposed model SMK-DCA with KAN-enhanced cyclic DCA greatly enhances the sensitivity to incipient faults. Simulation results demonstrate that SMK-DCA achieves strong overall fault detection performance on TEP. It maintains consistently high FDRs on the challenging faults 3, 9, and 15 and achieves the highest AFDR among the compared MSPM and unsupervised DL methods. Moreover, validation on real-world data further confirms its effectiveness and generalization capability. Future research can focus on further optimizing the model’s structure to explore its broader applicability across different industrial settings.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, X.Y. and M.C.; investigation, M.C.; resources, data curation, X.Y. and Y.G.; writing—original draft preparation, writing—review and editing, X.Y., Y.G. and M.C.; visualization, supervision, project administration, M.C.; funding acquisition, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62373213, and in part by Science Foundation of China University of Petroleum, Beijing (No. 2462024YJRC0006).

Data Availability Statement

The data are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Miao, Y.; Li, Z.; Chen, M. Time/Frequency Feature-Driven Ensemble Learning for Fault Detection. Processes 2024, 12, 2099. [Google Scholar] [CrossRef]
Xu, Y.; Bai, Z.; Chen, M. Incipient fault detection based on feature adaptive ensemble net. Processes 2025, 13, 1474. [Google Scholar] [CrossRef]
Yin, S.; Ding, S.X.; Haghani, A.; Hao, H.; Zhang, P. A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process. J. Process Control 2012, 22, 1567–1581. [Google Scholar] [CrossRef]
Li, Y.; Yang, D. Local component based principal component analysis model for multimode process monitoring. Chin. J. Chem. Eng. 2021, 34, 116–124. [Google Scholar] [CrossRef]
Ji, C.; Ma, F.; Wang, J.; Sun, W. Orthogonal projection based statistical feature extraction for continuous process monitoring. Comput. Chem. Eng. 2024, 183, 108600. [Google Scholar] [CrossRef]
Zheng, D.; Zhou, L.; Liu, Y.; Wu, Z.G.; Song, Z. Total Structure Multirate Autoregressive Dynamic Latent Variable Model for Multirate Dynamic Process Fault Detection. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 6397–6408. [Google Scholar] [CrossRef]
Zheng, J.; Zhang, Y.; Fang, Z.; Bi, Z. Structure-Adaptive Graph Embedding with Fractal Analysis for Industrial Process Monitoring. IEEE Trans. Instrum. Meas. 2025, 74, 3533313. [Google Scholar] [CrossRef]
Kong, X.; He, Y.; Song, Z.; Liu, T.; Ge, Z. Deep Probabilistic Principal Component Analysis for Process Monitoring. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 7422–7436. [Google Scholar] [CrossRef]
Ji, H.; Hou, Q.; Shao, Y.; Zhang, Y. Incipient fault detection for dynamic processes with canonical variate residual statistics analysis. Chemom. Intell. Lab. Syst. 2024, 252, 105189. [Google Scholar] [CrossRef]
Ji, H.; Wang, R. Incipient fault detection and isolation for dynamic processes with slow feature statistics analysis. Chem. Eng. Sci. 2024, 298, 120386. [Google Scholar] [CrossRef]
Shams, M.B.; Budman, H.; Duever, T. Fault detection, identification and diagnosis using CUSUM based PCA. Chem. Eng. Sci. 2011, 66, 4488–4498. [Google Scholar] [CrossRef]
Chen, H.; Li, L.; Shang, C.; Huang, B. Fault detection for nonlinear dynamic systems with consideration of modeling errors: A data-driven approach. IEEE Trans. Cybern. 2022, 53, 4259–4269. [Google Scholar] [CrossRef]
Li, T.; Han, Y.; Hu, X.; Ma, B.; Geng, Z. Twofold Weighted-Based Statistical Feature KECA for Nonlinear Industrial Process Fault Diagnosis. IEEE Trans. Autom. Sci. Eng. 2025, 22, 3901–3910. [Google Scholar] [CrossRef]
Luna-Villagómez, E.; Mahalec, V. Exploring Kolmogorov–Arnold Networks for Unsupervised Anomaly Detection in Industrial Processes. Processes 2025, 13, 3672. [Google Scholar] [CrossRef]
Li, D.; Dong, J.; Peng, K.; Simani, S.; Zhang, C.; Hua, D. Fault detection in nonstationary industrial processes via kolmogorov-arnold networks with test-time training. ISA Trans. 2025, in press. [Google Scholar]
Li, Z.; Tian, L.; Jiang, Q.; Yan, X. Distributed-ensemble stacked autoencoder model for non-linear process monitoring. Inf. Sci. 2021, 542, 302–316. [Google Scholar] [CrossRef]
Li, G.; Zheng, Y.; Liu, J.; Zhou, Z.; Xu, C.; Fang, X.; Yao, Q. An improved stacking ensemble learning-based sensor fault detection method for building energy systems using fault-discrimination information. J. Build. Eng. 2021, 43, 102812. [Google Scholar] [CrossRef]
Liu, D.; Wang, M.; Chen, M. Feature ensemble net: A deep framework for detecting incipient faults in dynamical processes. IEEE Trans. Ind. Inf. 2022, 18, 8618–8628. [Google Scholar] [CrossRef]
Li, T.; Han, Y.; Wang, Y.; Geng, Z. A Self-Attention Mechanism Integrating Adaptive Double Subspace for Fault Detection in Industrial Processes. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 540–549. [Google Scholar] [CrossRef]
Yu, J.; Li, S.; Liu, X.; Li, H.; Ma, M.; Liu, P.; You, L. Residual squeeze-and-excitation convolutional auto-encoder for fault detection and diagnosis in complex industrial processes. Eng. Appl. Artif. Intell. 2024, 136, 108872. [Google Scholar] [CrossRef]
Wu, P.; Ni, Y.; Wang, H.; Hu, X.; Wu, Z.; Jiang, J.; Hu, Y. Quality related fault detection based on dynamic-inner convolutional autoencoder and partial least squares and its application to ironmaking process. Chin. J. Chem. Eng. 2025, 89, 267–276. [Google Scholar] [CrossRef]
Ji, C.; Ma, F.; Wang, J.; Sun, W.; Palazoglu, A. Industrial Process Fault Detection Based on Siamese Recurrent Autoencoder. Comput. Chem. Eng. 2025, 192, 108887. [Google Scholar] [CrossRef]
Zhang, X.; Song, C.; Dai, W.; Zhang, Z.; Gao, K.; Gao, F. Causal Graph Spatial-Temporal Autoencoder for Reliable and Interpretable Process Monitoring. IEEE Trans. Neural Netw. Learn. Syst. 2026, 1–13. [Google Scholar] [CrossRef] [PubMed]
Zeng, L.; Jin, Q.; Lin, Z.; Zheng, C.; Wu, Y.; Wu, X.; Gao, X. Dual-attention LSTM autoencoder for fault detection in industrial complex dynamic processes. Process Saf. Environ. Prot. 2024, 185, 1145–1159. [Google Scholar] [CrossRef]
Shang, J.; Yu, J. A residual autoencoder-based transformer for fault detection of multivariate processes. Appl. Soft Comput. 2024, 163, 111896. [Google Scholar] [CrossRef]
Jang, K.; Pilario, K.E.S.; Lee, N.; Moon, I.; Na, J. Explainable Artificial Intelligence for Fault Diagnosis of Industrial Processes. IEEE Trans. Ind. Informat. 2025, 21, 4–11. [Google Scholar] [CrossRef]
Ren, Z.; Jiang, Y.; Yang, X.; Tang, Y.; Zhang, W. Learnable faster kernel-PCA for nonlinear fault detection: Deep autoencoder-based realization. J. Ind. Inf. Integr. 2024, 40, 100622. [Google Scholar] [CrossRef]
Yao, Z.; Jiang, Q.; Gu, X. Distributed process monitoring based on Kantorovich distance-multiblock variational autoencoder and Bayesian inference. Chin. J. Chem. Eng. 2024, 73, 311–323. [Google Scholar] [CrossRef]
Chen, Q.; Liu, Z.; Ma, X.; Wang, Y. Artificial Neural Correlation Analysis for Performance-Indicator-Related Nonlinear Process Monitoring. IEEE Trans. Ind. Inform. 2022, 18, 1039–1049. [Google Scholar] [CrossRef]
Gao, H.; Huang, W.; Gao, X.; Han, H. Decentralized adaptively weighted stacked autoencoder-based incipient fault detection for nonlinear industrial processes. ISA Trans. 2023, 139, 216–228. [Google Scholar] [CrossRef]
Zhao, S.; Duan, Y.; Roy, N.; Zhang, B. A deep learning methodology based on adaptive multiscale CNN and enhanced highway LSTM for industrial process fault diagnosis. Reliab. Eng. Syst. Saf. 2024, 249, 110208. [Google Scholar] [CrossRef]
Liu, K.; Lu, N.; Wu, F.; Zhang, R.; Gao, F. Model fusion and multiscale feature learning for fault diagnosis of industrial processes. IEEE Trans. Cybern. 2022, 53, 6465–6478. [Google Scholar] [CrossRef]
Liu, Y.; Xu, Z.; Wang, K.; Zhao, J.; Song, C.; Shao, Z. Incipient fault detection enhancement based on spatial-temporal multi-mode siamese feature contrast learning for industrial dynamic process. Comput. Ind. 2024, 155, 104062. [Google Scholar] [CrossRef]
Amini, N.; Zhu, Q. Fault detection and diagnosis with a novel source-aware autoencoder and deep residual neural network. Neurocomputing 2022, 488, 618–633. [Google Scholar] [CrossRef]
Tang, S.; Shi, H.; Song, B.; Tao, Y.; Tan, S. Physically-consistent-WGAN based small sample fault diagnosis for industrial processes. Chin. J. Chem. Eng. 2025, 78, 163–174. [Google Scholar] [CrossRef]
Yin, C.; Dong, Y.; He, J.; Wang, Y. A fault evolution knowledge-driven adversarial meta-learning method for few-shot tool state recognition under variable working conditions. Eng. Appl. Artif. Intell. 2026, 167, 113806. [Google Scholar] [CrossRef]
Lu, W.; Wang, Y.; Zhang, M.; Gu, J. Physics guided neural network: Remaining useful life prediction of rolling bearings using long short-term memory network through dynamic weighting of degradation process. Eng. Appl. Artif. Intell. 2024, 127, 107350. [Google Scholar] [CrossRef]
Yin, C.; Li, Y.; Wang, Y.; Dong, Y. Physics-guided degradation trajectory modeling for remaining useful life prediction of rolling bearings. Mech. Syst. Signal Process. 2025, 224, 112192. [Google Scholar] [CrossRef]
Yin, C.; Sun, T.; Wu, H.; Dong, Y. Trustworthy multistep-ahead remaining useful life prediction for rolling bearings with limited data. Reliab. Eng. Syst. Saf. 2026, 267, 111902. [Google Scholar] [CrossRef]
Jia, N.; Huang, W.; Guo, P.; Ding, C.; Huangfu, Y.; Shen, C.; Zhu, Z. A physics-guided memory enhancement and causality-inspired generalization framework for continual fault diagnosis. Knowl.-Based Syst. 2025, 325, 114044. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljacic, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov–Arnold Networks. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Perez, E.; Strub, F.; Vries, H.D.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, USA, 2018; Volume 32. [Google Scholar]
Shi, Z.; Zhang, L.; Chen, M. Polarized Direct Cross-Attention Message Passing in GNNs for Machinery Fault Diagnosis. arXiv 2026, arXiv:2603.06303. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Gao, T.; Yang, J.; Jiang, S. A novel fault detection model based on vector quantization sparse autoencoder for nonlinear complex systems. IEEE Trans. Ind. Informat. 2022, 19, 2693–2704. [Google Scholar] [CrossRef]
Ji, H.; Zhao, W.; Sheng, N. Incipient fault detection with probability transformation and statistical feature analysis. Automatica 2024, 166, 111706. [Google Scholar] [CrossRef]
Downs, J.J.; Vogel, E.F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255. [Google Scholar] [CrossRef]
Lakshmi Priya Palla, G.; Kumar Pani, A. Independent component analysis application for fault detection in process industries: Literature review and an application case study for fault detection in multiphase flow systems. Measurement 2023, 209, 112504. [Google Scholar] [CrossRef]
Valle, S.; Li, W.; Qin, S.J. Selection of the number of principal components: The variance of the reconstruction error criterion with a comparison to other methods. Ind. Eng. Chem. Res. 1999, 38, 4389–4401. [Google Scholar] [CrossRef]

Figure 1. The scheme of SMK-DCA.

Figure 2. Detection performance of fault 9 in TEP: (a) CVRSA (

D_{S}

), (b) PTMD (M), (c) SFSA, (d)

f_{p}

-SMKSA, (e)

f_{S}

-SMKSA, (f) SMK-DCA. The yellow dashed line denotes the fault start time.

Figure 2. Detection performance of fault 9 in TEP: (a) CVRSA (

D_{S}

), (b) PTMD (M), (c) SFSA, (d)

f_{p}

-SMKSA, (e)

f_{S}

-SMKSA, (f) SMK-DCA. The yellow dashed line denotes the fault start time.

Figure 3. t-SNE visualization of different feature spaces for fault 2 in the IGBT power system: (a) raw time series features, (b) process-aware features, (c) local structure features, (d) cyclic DCA fused features.

Figure 4. Detection performance of fault 2 in IGBT: (a) CVRSA (

D_{S}

), (b) PTMD (M), (c) DAE-PCA (Q), (d)

f_{p}

-SMKSA, (e)

f_{S}

-SMKSA, (f) SMK-DCA.

Figure 4. Detection performance of fault 2 in IGBT: (a) CVRSA (

D_{S}

), (b) PTMD (M), (c) DAE-PCA (Q), (d)

f_{p}

-SMKSA, (e)

f_{S}

-SMKSA, (f) SMK-DCA.

Table 1. Complexity comparison of different DCA variants on TEP.

Model	Backbone	FiLM	Trainable Parameters	Training Time (s)	Peak GPU Memory (MB)	Inference Time (ms/Sample)
SMM-DCA w/o FiLM	MLP	No	98,905	14.17	1090.82	0.1222
SMM-DCA	MLP	Yes	123,865	14.94	1110.78	0.1269
SMK-DCA w/o FiLM	KAN	No	196,825	93.34	5070.42	0.7996
SMK-DCA	KAN	Yes	221,785	94.83	5087.68	0.8035

Note: “w/o” denotes “without”. The inference time is averaged over test samples.

Table 2. FDRs (%) for different MSPM methods on TEP.

Fault	CVRSA [9]	PTMD [46]	SFSA [10]	OTSMD [5]	SAGE-FA [7]	CVA [12]	DePPCA [8]		TWSFKECA [13]		SMK-DCA
Fault	CVRSA [9]	PTMD [46]	SFSA [10]	OTSMD [5]	SAGE-FA [7]	CVA [12]	$SPE$	$LFS$	$T^{2}$	$SPE$	SMK-DCA
1	99.75	99.50	100.00	99.12	99.88	99.75	99.90	99.90	100.00	100.00	99.50
2	99.00	97.88	98.50	97.50	98.13	99.50	99.20	99.20	100.00	100.00	97.75
4	100.00	100.00	100.00	100.00	100.00	99.88	100.00	100.00	100.00	93.21	100.00
5	88.75	99.88	100.00	99.88	100.00	99.88	100.00	100.00	46.97	27.02	99.88
6	99.88	100.00	100.00	100.00	100.00	99.88	100.00	100.00	100.00	100.00	100.00
7	99.88	100.00	100.00	100.00	100.00	99.88	100.00	100.00	100.00	100.00	100.00
8	98.00	97.62	98.75	99.75	98.25	99.88	98.60	98.50	100.00	100.00	97.50
10	77.00	97.25	95.62	81.25	89.88	96.63	96.20	96.40	90.75	81.79	96.88
11	95.75	99.25	96.38	99.12	75.13	99.38	93.20	93.60	100.00	94.22	99.25
12	99.62	99.75	100.00	100.00	99.88	99.50	100.00	100.00	100.00	100.00	100.00
13	95.38	94.75	96.00	94.12	95.25	96.13	96.40	96.40	100.00	100.00	98.38
14	99.25	99.88	100.00	99.88	100.00	99.88	100.00	100.00	100.00	100.00	99.88
16	86.75	100.00	97.62	97.25	92.88	99.13	97.10	97.10	96.10	70.38	100.00
17	97.62	97.38	98.00	96.88	96.63	98.13	97.80	97.90	100.00	100.00	97.62
18	90.88	89.75	91.12	90.00	89.75	99.25	91.90	91.60	100.00	100.00	89.50
19	52.62	99.88	99.88	98.00	90.38	99.88	99.90	99.90	89.02	87.57	99.75
20	84.62	91.50	92.38	93.25	91.13	97.63	92.10	92.10	97.11	97.11	91.37
21	57.63	68.62	58.13	95.25	56.63	–	63.40	64.80	55.64	42.49	99.00
3	10.00	14.50	16.75	25.12	2.25	73.03	9.20	10.40	82.23	63.01	91.13
9	13.12	9.88	14.25	14.25	2.13	92.26	7.90	9.00	91.33	61.71	95.87
15	10.50	28.62	24.62	23.50	7.75	99.50	44.80	52.90	60.40	44.36	91.50
AFAR	4.94	5.65	6.67	4.11	1.16	–	3.00	3.20	0.00	0.09	2.29
AFDR	78.86	85.04	84.67	85.91	80.28	–	85.10	85.70	90.93	83.95	97.37

Note: AFAR and AFDR denote the average FAR and FDR for faults 1–21, respectively. “–” denotes that thecorresponding metric is not reported in the original reference.

Table 3. FDRs (%) for different DL methods on TEP.

Fault	RATrans -Former [25]	CGST -AE [23]	TKAN [15]	AAE [26]	DALSTM -AE [24]	KD-MBVAE [28]		DAE-PCA [27]		ANCA [29]		SMK-DCA
Fault	RATrans -Former [25]	CGST -AE [23]	TKAN [15]	AAE [26]	DALSTM -AE [24]	${BIC}_{T^{2}}$	${BIC}_{KD}$	$PS$	$RS$	$D_{1}$	$D_{2}$	SMK-DCA
1	99.87	99.60	99.30	99.75	100.00	100.00	99.00	100.00	99.87	98.00	35.50	99.50
2	98.50	98.50	97.50	98.88	98.00	99.00	98.00	98.62	98.00	96.50	90.00	97.75
4	100.00	99.90	99.80	98.88	100.00	100.00	100.00	100.00	92.00	97.30	5.60	100.00
5	98.62	99.90	99.30	41.63	100.00	100.00	100.00	100.00	9.87	98.10	27.10	99.88
6	100.00	99.90	100.00	100.00	100.00	100.00	100.00	100.00	100.00	98.00	95.30	100.00
7	100.00	99.90	100.00	100.00	100.00	100.00	100.00	100.00	99.12	98.40	33.50	100.00
8	99.37	98.40	97.10	99.13	98.00	99.00	97.00	98.25	97.75	100.00	94.90	97.50
10	85.00	88.00	66.60	68.75	94.00	86.00	78.00	88.87	74.00	93.10	29.80	96.88
11	83.63	95.60	69.00	81.25	96.00	81.00	99.00	83.25	57.62	97.80	22.00	99.25
12	99.75	99.80	98.90	99.38	100.00	100.00	100.00	99.75	99.75	98.50	99.10	100.00
13	95.50	95.80	94.40	95.38	95.00	95.00	95.00	95.37	94.00	93.00	92.40	98.38
14	100.00	99.90	100.00	100.00	100.00	100.00	100.00	100.00	99.87	98.50	0.00	99.88
16	91.12	91.30	75.50	63.13	98.00	94.00	89.00	92.50	89.87	100.00	13.60	100.00
17	97.00	97.80	95.10	93.38	98.00	96.00	97.00	97.12	82.12	100.00	27.50	97.62
18	91.62	91.90	89.30	93.13	91.00	91.00	95.00	90.62	90.25	90.90	88.00	89.50
19	74.00	67.90	22.00	20.63	100.00	90.00	95.00	94.50	88.62	92.60	0.40	99.75
20	77.88	81.40	68.60	72.75	92.00	90.00	90.00	86.25	74.12	90.00	20.60	91.37
21	49.75	64.50	65.90	47.88	58.00	62.00	63.00	61.00	34.37	73.40	47.00	99.00
3	16.00	17.90	–	21.13	11.00	8.00	4.00	7.50	4.00	98.00	11.90	91.13
9	15.00	16.80	–	16.63	8.00	8.00	5.00	5.50	2.88	98.30	1.90	95.87
15	23.50	21.90	–	31.50	17.00	20.00	8.00	13.25	5.00	38.00	11.10	91.50
AFAR	4.58	5.90	–	8.66	4.33	–	–	2.41	2.05	1.50	1.50	2.29
AFDR	85.53	82.20	85.50	73.49	83.52	81.86	81.52	81.54	71.12	92.78	40.34	97.37

Note: “–” denotes that the corresponding metric is not reported in the original reference.

Table 4. Detection delay analysis of SMK-DCA on TEP.

Fault	$t_{fault}$	$t_{\det}$	DD (Samples)
Fault 3	160	160	0
Fault 9	160	162	2
Fault 15	160	228	68

Table 5. The ablation study of SMK-DCA on TEP.

Fault	SMKSA			SMK-DCA w/o Cyclic	SMK-CSA	SMM-DCA	SMK-DCA w/o FiLM	SMK-DCA-AE	SMK-DCA w/o MTR	SMK-DCA
Fault	$f_{r}$	$f_{p}$	$f_{s}$	SMK-DCA w/o Cyclic	SMK-CSA	SMM-DCA	SMK-DCA w/o FiLM	SMK-DCA-AE	SMK-DCA w/o MTR	SMK-DCA
1	99.75	99.75	99.62	99.75	99.38	99.50	99.38	99.50	99.25	99.50
2	98.38	98.12	97.88	98.12	97.38	97.75	97.38	97.62	97.75	97.75
4	100.00	99.62	100.00	100.00	99.88	100.00	99.75	100.00	99.62	100.00
5	28.88	100.00	99.88	100.00	99.75	99.88	99.75	99.88	99.75	99.88
6	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	99.88	100.00
7	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	99.88	100.00
8	97.88	97.62	97.50	97.62	97.25	97.38	97.25	97.38	97.25	97.50
10	51.75	79.87	97.25	97.12	96.63	96.88	96.50	96.75	93.88	96.88
11	78.75	65.12	99.25	99.25	99.12	99.62	99.00	99.25	98.75	99.25
12	99.12	99.75	99.88	99.88	100.00	100.00	100.00	100.00	99.62	100.00
13	95.25	94.87	94.75	94.88	96.63	100.00	95.38	94.87	93.88	98.38
14	100.00	100.00	99.88	99.88	99.88	99.88	99.88	99.88	99.75	99.88
16	42.38	82.50	100.00	100.00	100.00	99.25	100.00	100.00	100.00	100.00
17	95.00	93.63	99.25	100.00	97.12	97.38	97.00	97.38	97.00	97.62
18	90.00	89.62	90.12	90.25	89.25	89.50	89.25	89.50	89.12	89.50
19	38.37	75.62	99.88	99.88	99.38	99.75	99.25	99.50	94.88	99.75
20	55.87	86.00	91.75	91.75	91.13	91.37	91.00	91.25	89.50	91.37
21	51.12	42.00	71.13	84.50	90.38	97.62	90.00	90.00	67.50	99.00
3	4.25	0.00	41.50	100.00	93.50	85.62	93.88	95.63	66.62	91.13
9	3.75	0.13	37.00	100.00	81.87	92.38	82.00	87.62	59.38	95.87
15	5.25	0.50	47.00	64.12	40.88	91.75	41.00	73.75	27.50	91.50
AFAR	2.05	0.00	6.82	7.02	2.80	2.56	2.68	3.54	3.15	2.29
AFDR	68.37	76.41	88.74	96.05	93.78	96.93	93.70	95.70	89.08	97.37

Note: “w/o” denotes “without”. The entries for faults 1–21 denote FDRs. AFAR and AFDR denote the average FAR and FDR for faults 1–21, respectively.

Table 6. Sensitivity analysis of PCA/ICA hyperparameter combinations with and without cyclic DCA on TEP.

Fault	S1		S2		S3		S4		S5		S6
Fault	w/o	Cyc.	w/o	Cyc.	w/o	Cyc.	w/o	Cyc.	w/o	Cyc.	w/o	Cyc.
1	100.00	99.62	100.00	99.62	99.75	99.50	99.75	99.50	99.75	99.62	100.00	99.62
2	100.00	97.75	100.00	97.88	98.25	97.75	98.12	97.75	99.12	97.75	100.00	97.75
4	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
5	100.00	100.00	100.00	99.88	100.00	99.88	100.00	99.88	100.00	99.88	100.00	99.88
6	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
7	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
8	97.62	97.38	100.00	97.50	97.62	97.38	97.62	97.50	97.62	97.38	97.50	97.38
10	99.00	96.88	100.00	96.88	97.25	96.88	97.12	96.88	97.00	96.88	97.25	96.88
11	100.00	99.25	100.00	99.25	100.00	99.25	99.25	99.25	100.00	99.25	100.00	99.25
12	100.00	100.00	100.00	100.00	100.00	99.88	99.88	100.00	100.00	100.00	100.00	100.00
13	100.00	95.38	100.00	95.25	97.00	94.88	94.88	98.38	100.00	94.88	100.00	95.00
14	100.00	100.00	100.00	100.00	100.00	99.88	99.88	99.88	100.00	99.88	100.00	99.88
16	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
17	100.00	100.00	100.00	100.00	100.00	97.38	100.00	97.62	100.00	97.38	100.00	97.38
18	100.00	90.50	100.00	91.38	90.25	89.62	90.25	89.50	99.62	90.00	100.00	89.62
19	100.00	99.75	100.00	99.88	99.88	99.62	99.88	99.75	99.88	99.75	100.00	99.75
20	91.75	91.25	100.00	91.25	91.62	91.25	91.75	91.37	91.62	91.25	91.62	91.25
21	100.00	84.75	100.00	84.62	100.00	80.25	84.50	99.00	100.00	100.00	100.00	100.00
3	100.00	100.00	100.00	100.00	100.00	95.25	100.00	91.13	100.00	99.12	100.00	96.88
9	100.00	99.50	100.00	100.00	97.12	88.38	100.00	95.87	100.00	88.25	100.00	88.00
15	100.00	70.88	100.00	68.88	94.88	70.25	64.12	91.50	100.00	83.50	100.00	81.38
AFAR	39.82	6.70	75.09	6.34	13.27	2.20	7.02	2.29	11.46	5.54	19.38	5.27
AFDR	99.45	96.33	100.00	96.30	98.27	95.11	96.05	97.37	99.27	96.89	99.35	96.66

Note: “w/o” denotes “without”, and “Cyc.” denotes cyclic DCA. The entries for faults 1–21 denote FDRs. AFAR and AFDR denote the average FAR and FDR for faults 1–21, respectively.

Table 7. Sensitivity analysis of KDE bandwidth selection on TEP.

Bandwidth Setting	$J_{\lim}$	FDR (%) of Fault 3	FDR (%) of Fault 9	FDR (%) of Fault 15	AFAR (%)	AFDR (%)
Scott $\times 0.75$	1.389884	91.50	96.12	91.75	2.38	97.43
Scott $\times 1.00$	1.401600	91.13	95.87	91.50	2.29	97.37
Scott $\times 1.25$	1.415872	90.88	95.62	91.38	2.08	97.30
Silverman	1.404795	91.12	95.88	91.38	2.17	97.35

Note: Scott

\times 1.00

denotes the default bandwidth setting used in this work.

Table 8. Sensitivity analysis of the target sparsity level

ρ

on TEP.

Table 8. Sensitivity analysis of the target sparsity level

ρ

on TEP.

$ρ$	Fault 3 FDR (%)	Fault 9 FDR (%)	Fault 15 FDR (%)	AFAR (%)	AFDR (%)
0.05	97.13	88.00	72.25	2.53	95.58
0.10	100.00	88.88	77.00	3.33	96.18
0.15	91.13	95.87	91.50	2.29	97.37
0.20	99.75	88.50	73.75	2.74	95.86

Note: For each

ρ

setting, the model is retrained with the same random seed and other hyperparameters.

Table 9. Online latency analysis of SMK-DCA on the IGBT power system.

Item	Value
Method	SMK-DCA
SVD time (ms/sample)	0.0241
Network inference time (ms/sample)	0.4488
Total latency (ms/sample)	0.4729
Sampling period	1000 ms
Compatible with sampling period	Yes
Test peak GPU memory (MB)	5829.03

Table 10. FARs/FDRs (%) on different methods for IGBT.

Methods	Fault 1	Fault 2
CVRSA [9]	1.45/97.90	5.40/89.80
SFSA [10]	1.95/97.60	6.05/88.40
PTMD [46]	1.40/96.80	0.00/65.50
OTSMD [5]	7.80/96.95	13.25/83.70
DePPCA( $S P E$ ) [8]	2.15/23.70	16.35/68.55
DePPCA( $L F S$ ) [8]	2.95/95.45	4.35/57.45
SAGE-FA [7]	1.15/8.00	11.65/48.45
RATransformer [25]	1.85/47.55	11.45/72.90
AAE [26]	1.35/2.25	2.30/10.35
DALSTM-AE [24]	0.30/0.30	0.75/5.35
DAE-PCA( $T^{2}$ ) [27]	1.60/0.80	0.60/0.50
DAE-PCA(Q) [27]	2.05/88.15	8.10/59.40
KD-MBVAE( $B I C_{T^{2}}$ ) [28]	4.00/23.80	7.65/46.55
KD-MBVAE( $B I C_{K D}$ ) [28]	3.55/97.35	89.30/100.00
SMK-DCA	0.05/97.50	1.55/98.60

Table 11. Detection delay of SMK-DCA on the IGBT power system.

Fault	$t_{fault}$	$t_{\det}$	DD (Samples)
Fault 1	2001	2011	10
Fault 2	2001	2029	28

Table 12. FARs/FDRs (%) on ablation experiments for IGBT.

Method	Fault 1		Fault 2
Method	FAR (%)	FDR (%)	FAR (%)	FDR (%)
$f_{r}$ -SMKSA	0.00	0.00	0.15	0.50
$f_{p}$ -SMKSA	0.00	0.00	0.00	0.00
$f_{s}$ -SMKSA	1.45	97.50	0.00	21.30
SMK-DCA w/o Cyclic	21.05	97.45	78.25	100.00
SMK-CSA	0.00	97.40	12.15	96.55
SMM-DCA	1.25	97.65	5.30	98.80
SMK-DCA w/o FiLM	0.00	97.40	17.80	100.00
SMK-DCA-AE	1.25	97.40	21.95	97.05
SMK-DCA w/o MTR	73.05	97.65	84.20	100.00
SMK-DCA	0.05	97.50	1.55	98.60

Note: “w/o” denotes “without”. FAR and FDR denote the false alarm rate and fault detection rate, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, X.; Gong, Y.; Chen, M. Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes. Processes 2026, 14, 1512. https://doi.org/10.3390/pr14101512

AMA Style

Yu X, Gong Y, Chen M. Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes. Processes. 2026; 14(10):1512. https://doi.org/10.3390/pr14101512

Chicago/Turabian Style

Yu, Xiaomin, Yingchuan Gong, and Maoyin Chen. 2026. "Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes" Processes 14, no. 10: 1512. https://doi.org/10.3390/pr14101512

APA Style

Yu, X., Gong, Y., & Chen, M. (2026). Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes. Processes, 14(10), 1512. https://doi.org/10.3390/pr14101512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Modulated KAN-DCA for Incipient Fault Detection in Industrial Processes

Abstract

1. Introduction

2. Problem Formulation and Preliminary

2.1. Problem Formulation

2.2. Preliminary—KAN

3. Heterogeneous Features

3.1. Construction of Heterogeneous Feature Streams

3.2. Computational Complexity of Process-Aware Features and SWSVs

3.3. Sensitivity Analysis of Process-Aware Features and SWSVs

4. Self-Modulated KAN-DCA (SMK-DCA)

4.1. DCA

4.2. SMK-DCA

4.3. Self-Modulation by DCA

5. Fault Detection Using SMK-DCA

5.1. Loss Function

5.2. Detection Logic

6. Case Study

6.1. Tennessee Eastman Process

6.1.1. Parameter Settings

6.1.2. Detection Performance

6.1.3. Ablation Experiments

6.1.4. Hyperparameter Discussion

6.2. Insulated-Gate Bipolar Transistor (IGBT) Power System

6.2.1. Parameter Settings

6.2.2. Detection Performance

6.2.3. Ablation Experiments

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI