1. Introduction
Fault diagnosis serves as a cornerstone in maintaining the operational stability and safety of industrial systems [
1,
2]. With the increasing popularity of monitoring sensors and data acquisition equipment, abundant sensor data has made condition monitoring more feasible [
3], which has aroused widespread interest among experts and scholars in adopting data-driven fault diagnosis methods [
4].
Among the prevailing data-driven strategies for fault diagnosis, learning-based methods have become the mainstream due to their ability to automatically extract discriminative features from complex, high-dimensional sensor data [
5]. For example, Zuo et al. proposed extraction networks and dual-model geometric calibration to monitor the status of bolts in industrial magnetic separation systems [
6]; Mukherjee et al. developed deep learning models to detect pipeline faults from sensor data [
7]; Cheng et al. trained a physics-informed Fourier neural operator as a surrogate solver for pantograph–catenary dynamics [
8]; Chen et al. developed a parallel attention module within a flexible residual structure, showing improved diagnostic accuracy under noisy conditions [
9]; Chen et al. proposed a multi-scale ScConv and quaternion Transformer method based on a hybrid attention mechanism for noise-resistant bearing fault diagnosis [
10]. These models eliminate the need for hand-crafted feature design and offer scalable solutions across diverse equipment types and operating conditions [
11]. In laboratory settings or data-rich environments, such methods have shown remarkable accuracy and robustness, fueling their rapid adoption in predictive maintenance and intelligent monitoring systems [
12]. Nevertheless, translating these successes to real-world applications presents persistent challenges [
13]. Fault data, by nature, are difficult to obtain in sufficient quantity and diversity, often limited to a few cases captured under specific conditions [
14]. Therefore, defining fault diagnosis as a few-shot learning problem is more in line with real-world requirements [
15].
This paradigm is fundamentally more challenging than traditional data-driven approaches for several reasons. First, while conventional methods excel in data-rich environments, they struggle to generalize from the limited samples available in real-world industrial settings, where fault events are inherently rare. Second, the process of labeling fault data requires costly and time-consuming manual inspection by domain experts, making the creation of large-scale labeled datasets infeasible. Therefore, few-shot fault diagnosis (FSFD) directly confronts these practical constraints by aiming to build robust models from inherently scarce data, a scenario where traditional methods often fall short.
To address the challenges of FSFD, three mainstream solutions have emerged: domain generalization, self-supervised learning, and meta-learning. Domain generalization enhances cross-domain adaptability by learning invariant features from multiple source domains, reducing reliance on target-specific data [
16]. For instance, Li et al. first demonstrated this principle by adversarially aligning feature distributions in four source domains [
17]. Self-supervised learning leverages intrinsic data structures to generate pseudo-labels, addressing annotation scarcity through contrastive learning or auxiliary tasks [
18]. For example, Fu et al. proposed a self-supervised learning approach for time-series anomaly detection, which was the first attempt to apply masked self-supervised learning to multivariate time-series anomaly detection [
19]. Lastly, meta-learning explicitly models how to learn from a few instances across multiple tasks. Furthermore, meta-learning introduces a highly adaptive learning paradigm that addresses the few-shot problem by focusing on the acquisition of learning capabilities rather than just learning itself [
20]. Therefore, meta-learning-based fault diagnosis has gradually become a potential solution for FSFD [
21].
In the meta-learning paradigm, it is generally assumed that knowledge acquired from multiple auxiliary fault diagnosis tasks can be leveraged to rapidly adapt to new, related tasks with scarce labeled samples. The goal is to build a diagnosis model that can swiftly generalize to unseen fault conditions with minimal fine-tuning effort. In recent years, increasing attention has been paid to the application of optimization-based meta-learning algorithms in the FSFD domain. For instance, a first-order approximation of model agnostic meta-learning (MAML) was employed to reduce training complexity while maintaining adaptation efficiency in mechanical fault diagnosis scenarios [
20]; another line of work proposed a meta-regularization framework to constrain parameter updates and enhance generalization in noisy industrial environments [
22]; additionally, a hybrid meta-learning model combining task-specific adaptation and shared representation learning was designed for cross-domain fault identification [
23]. These methods typically aim to establish a set of initialization parameters that are broadly optimal across tasks, thereby enabling efficient adaptation to target fault types using few labeled examples. However, when the task distribution is highly heterogeneous, these approaches may struggle to maintain stable convergence and generalization performance [
24]. Striking an effective balance between rapid adaptability and model stability thus remains a critical challenge for practical deployment.
In contrast, metric-based meta-learning (MBML) has demonstrated strong robustness in FSFD scenarios, particularly under low-data conditions where task heterogeneity and label scarcity can hinder optimization-based approaches. By learning an embedding space in which intra-class features are compact and inter-class features are well separated, MBML shifts the adaptation burden from parameter fine-tuning to metric computation, thereby improving stability and reducing overfitting risk [
25,
26]. Among MBML methods, prototypical networks (PNs) [
25] remain a widely adopted and conceptually simple baseline; however, their effectiveness in realistic FSFD is constrained by implicit assumptions of class-separable and task-invariant embeddings, which are often violated under non-stationary conditions, sensor noise, and severe class imbalance.
To mitigate these PN limitations, recent studies have explored diverse strategies, including PN with center loss (PN-CL) [
27] and PN with center triplet loss (PN-CTL) [
28]. They use different loss function constraints to make similar samples more focused and different samples separated. In addition, conductive inference, distribution calibration, or cross-domain alignment can be used to alleviate distribution shift [
29,
30]. While methods like PN-CL and PN-CTL have improved performance by refining the loss function to better structure the embedding space, they often rely on conventional encoders that struggle with long-range temporal dependencies and employ static regularization constraints. Two key challenges remain underexplored in FSFD: the lack of strong temporal embeddings capable of modeling long-range dependencies under noisy industrial time series and the absence of task-difficulty-aware regularization that adapts prototype geometry to varying episodic complexities. Recent studies have explored Transformer-based variants for industrial fault diagnosis and reported promising results, which motivates us to further investigate their potential for capturing temporal dependencies in complex industrial data [
31,
32].
To address the challenges of effectively modeling long-range temporal dependencies in noisy industrial time series and adapting prototype representations to varying task difficulties, we propose a novel framework termed transformer-embedded Task-Adaptive-Regularized Prototypical Network (TETARPN). Unlike conventional approaches relying on convolutional encoders, TETARPN leverages the powerful self-attention mechanism of the Transformer to extract richer and more robust feature embeddings that better capture temporal context and subtle fault characteristics. Meanwhile, a task-adaptive prototype regularization strategy is introduced to dynamically guide the prototype learning process according to the complexity of each diagnostic task, encouraging more compact intra-class distributions and clearer inter-class boundaries. By jointly integrating these components, our method aims to significantly enhance the generalization capability and diagnostic accuracy in few-shot fault diagnosis scenarios under realistic industrial conditions.
The main contribution of this article is summarized as follows.
- 1.
To overcome the limitations of conventional encoders in modeling complex industrial signals, we design a Transformer-based Temporal Encoder (TBTE). This module adapts the self-attention mechanism to tokenize time series into segments, enabling it to capture not only local dynamics but also critical long-range temporal dependencies. This results in more discriminative feature embeddings crucial for distinguishing subtle fault characteristics.
- 2.
We propose a novel Task-Adaptive Prototype Regularization (TAPR) strategy to address the challenge of task heterogeneity. This mechanism dynamically adjusts the regularization strength based on the difficulty of each few-shot task, promoting more compact intra-class distributions and clearer inter-class boundaries. This enhances the model’s adaptability and robustness in diverse and challenging diagnostic scenarios.
The rolling bearing benchmark are used to analyze the FSFD performance of the proposed TETARPN approach. Extensive experiments demonstrate its effectiveness and superiority in comparison with state-of-the-art methodologies.
3. Fault Diagnosis Based on TETARPN
3.1. Overview of the Proposed Method
Although PN and its variants have demonstrated strong capability in MBML-based FSFD, their performance can still be hindered by two key limitations. First, PN-based methods’ embedding modules, typically based on shallow or convolutional architectures, are insufficient for capturing long-range temporal dependencies inherent in sequential signals. Second, they often struggle to cope with substantial task variability, which can lead to unstable feature distributions across different tasks.
To address the challenges of long-range temporal dependency and task variability, we propose a general-purpose framework named
Transformer-embedded Task-Adaptive-Regularized Prototypical Network (TETARPN). This framework seamlessly integrates two complementary enhancements into the conventional PN structure: a Transformer-based Temporal Encoder (TBTE) tailored for time series and a Task-Adaptive Prototype Regularization (TAPR) mechanism, as shown in
Figure 1. The framework’s workflow proceeds as follows: First, each task’s data is divided into a support set and a query set. Both sets are processed by the Transformer-based Temporal Encoder to extract powerful temporal features. For the support set, these features are then used to compute class prototypes. Concurrently, the Task-Adaptive Regularization Module calculates a regularization loss that encourages intra-class compactness and inter-class separability. This loss is dynamically weighted based on the task’s initial difficulty and is combined with the primary classification loss from the query set to form a comprehensive task loss, which guides the model’s optimization.
The first component of TETARPN is the TBTE module, which serves as a high-capacity embedding function for time series. Traditional CNN or LSTM-based encoders struggle to capture global temporal features due to local receptive field limitations or sequential processing bottlenecks. In contrast, our embedding module utilizes a self-attention mechanism with positional encoding to extract long-term dependencies across time steps. To better adapt the Transformer to fault diagnosis scenarios, we adopt a time-series segmentation strategy as preprocessing, which partitions complex signals into meaningful intervals, enabling the attention mechanism to focus on discriminative temporal patterns.
The second component is the TAPR module, designed to dynamically adapt to the difficulty of each individual few-shot task. Unlike conventional PNs that optimize uniformly across tasks, this module computes an intra-class and inter-class prototype loss and adjusts its contribution according to the initial task loss. This allows the model to maintain compact intra-class distributions while enhancing inter-class separability, thereby improving generalization under varying task complexities.
By combining these two components, TETARPN achieves both TBTE and TAPR—two critical factors for better FSFD performance across a variety of industrial conditions.
3.2. Transformer-Based Temporal Encoder
Transformers have achieved remarkable success in computer vision, particularly with vision Transformers (ViTs), which model images as sequences of patches and capture long-range dependencies through self-attention [
39]. Inspired by this paradigm, we extend the Transformer architecture to time-series analysis by treating temporal subsequences as analogous to patches. This adaptation allows the model to capture both local patterns and global temporal dependencies, which is critical for FSFD tasks where subtle temporal correlations may indicate different fault conditions.
Let each input sample be denoted as
, where
l is the sequence length and
N is the number of samples. The sequence is first segmented into
temporal subsequences:
Each segment is projected into an embedding space using a learnable weight matrix
to obtain temporal tokens. A class token
is prepended, and positional encoding
is added:
The token sequence is processed by a Transformer encoder consisting of self-attention and feed-forward layers. The single-head scaled dot-product attention is defined as
where
are learnable projections of
.
Residual connections and layer normalization are applied after each sublayer to enhance training stability:
For the purpose of clarity in description, the process of the aforementioned embedded feature extraction can be expressed in a formulaic manner as follows:
where
represents the output embedding, and
denotes the TBTE module parameterized by
.
This design enables the model to adaptively capture both short-term dynamics and long-term dependencies in time series, producing discriminative embeddings suitable for MBML. By drawing inspiration from ViT and tailoring it to temporal data, the TBTE module provides a robust feature extraction backbone for downstream metric-based classification.
3.3. Task-Adaptive Prototype Regularization
Given output embeddings
from TBTE, we adopt a metric-based classification framework inspired by PN. In each few-shot episode, the support set contains
K samples for each of
M classes (
). The class prototypes
are first computed as the mean of support embeddings, as defined in Equation (
1). While the original PN assumes a task-invariant and class-separable embedding space, such assumptions often fail in practice due to noise, inter-class ambiguity, and intra-class variation. To address this, we propose a TAPR strategy that shapes the prototype space according to the intrinsic difficulty of each episode.
We simultaneously encourage
intra-class compactness and
inter-class separability. The intra-class loss penalizes dispersion of samples around their prototype:
where
denotes the embedding of the
k-th support sample from class
m.
The inter-class loss rewards separation between prototypes:
To balance these objectives, we define the prototype regularization as the ratio of intra-class dispersion to inter-class separation:
where a small constant
prevents numerical instability when
. Minimizing
encourages embeddings to form tight clusters within classes while maximizing distances between prototypes.
Our TAPR formulation can be interpreted as a Fisher-style ratio
, similar in spirit to center loss and triplet loss that encourage intra-class compactness and inter-class separability. However, prior works such as PN-CL and PN-CTL adopt static penalty terms that apply uniformly across all tasks. In contrast, to adapt the strength of this regularization to task difficulty, we introduce a dynamic weight
controlled by the initial query classification loss
:
where
sets a baseline regularization strength and
determines sensitivity to task difficulty. Higher initial query loss leads to stronger regularization, allowing the model to adjust prototype compactness and separability more aggressively in challenging episodes. This adaptation ensures that difficult tasks receive stronger prototype constraints, while easier tasks are not over-regularized, thus enhancing stability and generalization across heterogeneous episodes.
The final episodic training objective is represented by
where
is the cross-entropy loss on the query set. By dynamically shaping the prototype space during episodic training, TAPR produces embeddings that remain discriminative in complex tasks while avoiding over-constraining simpler ones, thus improving few-shot generalization.
3.4. Fault Diagnosis Procedure
The TETARPN-based FSFD flowchart is drawn in
Figure 2, where the blue part represents the offline training phase and the orange part depicts the online diagnosis phase.
In the training phase, the proposed TETARPN framework is trained on the episodically sampled tasks from the training dataset
, which includes
known fault categories. As detailed in Algorithm 1, each training episode consists of a support set and a query set sampled from a subset of classes. During each episode, the model first extracts Transformer embeddings for both support and query samples, and then constructs class-wise prototypes using the support embeddings. A Task-Adaptive Prototype Regularization strategy is subsequently applied to jointly optimize classification performance and embedding space structure. The episodic optimization process encourages the model to generalize well to novel diagnostic tasks by learning a transferable embedding space and a robust metric-based decision boundary. After sufficient training episodes, the model parameters are optimized to
.
Algorithm 1 Learning Algorithm for TETARPN |
- Input:
Training set ; Model parameters - Output:
Learned parameters - 1:
Initialize the embedding network with random weights - 2:
Sample episodic tasks , where each task - 3:
for each task do - 4:
Segment and tokenize time-series samples via Equations ( 4) and ( 5) - 5:
Encode support and query samples into embeddings using the TBTE via Equations ( 6)–( 8) - 6:
Compute class prototypes from support set using Equation ( 1) - 7:
Compute TAPR loss via Equations ( 9)–( 13) - 8:
Update model parameters - 9:
end for - 10:
Return:
|
In the diagnosis (inference and deployment) phase, the trained TETARPN model is used to perform fault diagnosis on new, unseen tasks sampled from
, containing
previously unobserved fault types. In practical deployment, a few labeled instances from the new task are leveraged solely to define class prototypes in the embedding space. These instances are passed through the frozen Transformer encoder
to extract their embeddings, which are then averaged to obtain task-specific prototypes
. For any incoming data sample
, its embedding is computed and compared against the prototypes to determine its class, enabling efficient, real-time diagnosis based on the learned metric:
where
denotes the squared Euclidean distance.
Through this procedure, the proposed TETARPN method enables rapid adaptation to new diagnostic conditions with minimal labeled data. By integrating TBTE and TAPR, it achieves better FSFD performance across diverse and challenging working scenarios.