A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions

Jiao, Huachao; Sun, Wenlei; Wang, Hongwei; Wan, Xiaojing

doi:10.3390/agriculture16050566

Open AccessArticle

A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions

¹

Intelligent Manufacturing Modern Industry College, Xinjiang University, Urumqi 830046, China

²

School of Mechanical and Electrical Engineering, Xinjiang Institute of Engineering, Urumqi 830091, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(5), 566; https://doi.org/10.3390/agriculture16050566

Submission received: 17 January 2026 / Revised: 24 February 2026 / Accepted: 28 February 2026 / Published: 2 March 2026

(This article belongs to the Section Agricultural Technology)

Download

Browse Figures

Versions Notes

Abstract

The picking-head drivetrain is a critical transmission component of cotton harvesters, and its fault condition monitoring and diagnosis are essential for ensuring stable and reliable operation of the equipment. In practical engineering applications, diagnostic models for picking-head drivetrains are typically initialized using data collected under a limited number of representative operating conditions. Although sufficient fault samples can often be obtained during the initial training stage, the coverage of operating conditions is inherently restricted. As the model is deployed and used in the field, fault samples collected under new operating conditions are gradually acquired in a stage-wise manner. How to stably update the diagnostic model while the operating-condition coverage continuously expands, and how to avoid performance degradation and catastrophic forgetting, remain critical challenges. To address these issues, this paper proposes a continual learning method, termed DP-CL (Disentangled Prototype-Driven Continual Learning), for fault diagnosis of cotton harvester picking-head drivetrains under gradually expanding operating conditions. The proposed method is built upon an explicit disentanglement of condition-invariant features and condition-specific features. Within a unified framework, three types of structured prototypes, including class prototypes, condition prototypes, and condition-aware class prototypes, are constructed to form a multi-level representation hierarchy. A prototype-driven structured update mechanism is then employed to impose stable constraints on fault-discriminative semantics across different operating conditions. In addition, an operating-condition similarity measurement based on condition-specific features is introduced, based on which a proportion-adaptive sample selection strategy is designed. This strategy enables controlled knowledge transfer and preservation of discriminative structures during multi-stage model updates. Experimental results obtained under a laboratory-constructed cumulative operating-condition expansion scenario demonstrate that the proposed method achieves superior performance in terms of overall performance retention, cross-stage stability, and resistance to performance degradation. Moreover, as the number of operating conditions increases, the proposed method maintains a relatively smooth performance variation trend, while preserving clear class structures and a controllable level of confusion. These results validate the effectiveness of the proposed approach for stable fault diagnosis under expanding operating-condition coverage.

Keywords:

continual learning; fault diagnosis; feature disentanglement; prototype learning; cotton harvester

1. Introduction

Cotton is an important strategic agricultural commodity and a primary raw material for the textile industry, and its stable and efficient production is of great significance to agricultural economics and related industrial sectors. In recent years, cotton production efficiency in the Xinjiang region has been significantly improved through continuous advances in cultivation techniques and the mechanization upgrade of agricultural equipment. In this process, domestically produced cotton harvesters, benefiting from their high cost-effectiveness, have gradually become essential equipment in cotton production in Xinjiang [1].

During the cotton harvesting season, cotton harvesters are required to operate continuously under high loads over large working areas within relatively short harvesting windows. As a key power transmission unit in the picking system, the picking-head drivetrain is exposed to complex operating conditions characterized by high dust concentration, strong impacts, and fluctuating loads. On the one hand, dust ingress in field environments accelerates lubrication degradation and abrasive wear; on the other hand, operating conditions such as crop entanglement and local blockage of the picking head can introduce frequent impact loads, leading to deterioration of gear meshing conditions and abnormal bearing contact stresses. The combined effects of these factors accelerate the accumulation of fatigue damage in drivetrain components, causing early latent faults such as gear pitting, bearing spalling, and increased clearances to progressively evolve into severe structural damage [2]. Once abnormalities occur in the picking-head drivetrain, the stable operation of the picking system is directly affected, potentially resulting in machine shutdowns, operational delays, and significantly increased maintenance costs. Therefore, developing intelligent fault diagnosis methods for picking-head drivetrains and formulating appropriate maintenance strategies in a timely manner are of great engineering significance for ensuring the operational reliability of domestically produced cotton harvesters under complex operating conditions [3].

Vibration-signal-based fault diagnosis provides an effective pathway to address the above challenges due to its clear physical basis, flexible sensor deployment, and rich information content [4]. Abnormal gear meshing, localized bearing defects, and transmission-chain stiffness fluctuations can generate impact and modulation effects during operation, which can be captured through vibration responses. In the field of rotating machinery condition monitoring, numerous vibration-based diagnosis methods have been proposed and widely applied [5,6]. Traditional vibration analysis methods are typically grounded in rigorous mathematical theories; they extract fault-related information such as impact features, modulation components, or energy distributions using techniques including envelope demodulation, wavelet analysis, and time–frequency transforms, and then employ appropriate algorithms to identify gear or bearing faults. These methods are computationally efficient and highly interpretable, and they perform well when operating conditions are relatively stable or fault patterns are simple, which explains their broad adoption in engineering practice.

With the increasing complexity of transmission systems and the growing uncertainty of operating environments, data-driven intelligent fault diagnosis has become a major research focus. Deep learning models can automatically learn nonlinear modulation patterns and complex time–frequency structures from vibration signals in an end-to-end manner, demonstrating superior performance over traditional methods in diagnosing faults of gearboxes, bearings, and other rotating components. However, such performance gains typically rely on large amounts of labeled data and often assume that training and test data follow the same distribution. This assumption is difficult to satisfy in fault diagnosis for the main transmission system of cotton pickers. The statistical characteristics of vibration signals in this system can continuously change due to speed fluctuations, load variations, assembly differences, and the evolution of component wear; meanwhile, fault data acquisition is often stage-wise and sporadic, making it difficult to cover all potential operating conditions during training. Consequently, static diagnostic models built on fixed datasets often fail to maintain stable performance in long-term operation, facing significant challenges in both generalization and continual adaptability. Continual learning (CL), as a machine learning paradigm that continuously acquires new knowledge from data streams while retaining previously learned capabilities, offers a promising direction for developing adaptive and evolvable fault diagnosis models for cotton-picker transmission systems [7,8].

In recent years, continual learning has been widely adopted in intelligent fault diagnosis to mitigate catastrophic forgetting and to improve model stability and adaptability over long-term, multi-stage operation. Existing methods in this direction can be broadly categorized into three groups. The first group introduces regularization constraints during model updates to limit the interference of new knowledge with previously learned knowledge. For example, elastic weight consolidation (EWC) and its variants based on parameter-importance estimation have been used to suppress drastic changes in critical parameters during incremental learning, thereby alleviating forgetting. Building upon this idea, some studies further impose structured constraints at the feature/representation level; for instance, reserving embedding space to accommodate potential new fault types helps preserve the stability of learned feature structures while learning new knowledge [9]. To address class imbalance and scarce incremental samples, continual diagnosis models combining prototype contrastive learning and weight alignment have been proposed to improve discriminative stability under feature distribution shifts [10,11]. In addition, uncertainty modeling has been incorporated into continual learning by constraining predictive evidence or output-distribution consistency to suppress performance degradation during incremental updates [12,13,14]. Although these approaches can enhance anti-forgetting capability to some extent, their effectiveness often depends on the design of regularization terms and the accuracy of parameter-importance estimation; under changing operating-condition distributions or severely limited samples, it remains difficult to simultaneously achieve stability and adaptability.

The second group tackles continual learning from the network-architecture perspective by parameter isolation or dynamic architecture expansion. Typical strategies introduce independent branches or sub-modules at different incremental stages and maintain historical diagnostic capability via knowledge distillation or structural fusion mechanisms, such as dynamic branch fusion networks [15], elastic expandable continual diagnosis models [16,17], and incremental fault diagnosis methods based on evolvable graph structures [18,19]. Moreover, for unknown operating conditions or system-level evolution, mixture-of-experts models and dynamic expert selection strategies have been applied to continual diagnosis tasks [20,21,22]. While these methods are effective in suppressing forgetting, model structures often keep expanding as learning progresses, leading to increased computational complexity and long-term maintenance costs. In addition, their reliance on explicit task or operating-condition boundaries can limit flexibility in scenarios where operating conditions evolve continuously.

The third group relies on replay and memory mechanisms, enhancing retention of old knowledge by injecting historical information during incremental learning. A variety of strategies have been studied, including sample replay, feature replay, and generative replay. Under memory constraints in industrial applications, some works focus on efficient memory management, such as sample selection driven jointly by gradient importance and class coverage [23], or selective replay and loss allocation based on task similarity measures [24]. Furthermore, contrastive-learning or reinforcement-learning ideas have been introduced into continual diagnosis, using historical healthy states or stable patterns as references to strengthen representation consistency and discriminative robustness [25,26]. For unlabeled or cross-domain distribution drift, continual unsupervised domain adaptation has also been explored to mitigate long-term performance degradation [27]. Despite their flexibility in alleviating forgetting, replay-based methods depend heavily on the representativeness of stored (or generated) information, and they still face challenges in long-term deployment, including storage overhead, noise interference, and stability control.

Overall, existing continual learning methods provide various effective solutions for model updating in intelligent fault diagnosis from different perspectives, including regularization constraints, architectural isolation, and memory replay mechanisms. However, in practical fault diagnosis scenarios of cotton harvester picking-head drivetrains, diagnostic models are often initially trained using data collected under a limited number of representative operating conditions. As model deployment and field operation progress, fault samples corresponding to different rotational speeds and equipment-specific variations are gradually introduced in a stage-wise manner. Under such continuously expanding operating-condition coverage, how to achieve stable model updating while effectively suppressing performance degradation and catastrophic forgetting remains a critical challenge that must be addressed.

To tackle the above issues, this study focuses on the fault diagnosis of cotton harvester picking-head drivetrains and proposes a continual learning-based diagnostic method, termed DP-CL (Disentangled Prototype-Driven Continual Learning), tailored for scenarios with gradually expanding operating-condition coverage. The proposed approach aims to maintain stable fault-discriminative capability throughout multi-stage model updates by means of structured modeling and controlled update mechanisms. The main contributions of this work are summarized as follows:

A feature disentanglement modeling strategy for continual learning is proposed, which explicitly separates sample representations into condition-invariant features that capture stable fault-discriminative semantics across operating conditions and condition-specific features that characterize differences among operating conditions. This strategy provides a structured foundation for coordinating knowledge retention and new-condition modeling under gradually expanding operating-condition coverage.
Within a unified framework, three types of structured prototype representations, including class prototypes, condition prototypes, and condition-aware class prototypes, are constructed and jointly optimized through a consistent pull–push constraint mechanism. These prototypes serve complementary roles in anchoring global semantic consistency and enhancing local discriminative capability during continual learning.
To address the trade-off between replay efficiency and discriminative stability in continual learning stages, an operating-condition similarity measurement method based on condition-specific features is developed. Based on this measurement, a proportion-adaptive top–bottom hybrid sample selection strategy is designed to improve knowledge transfer efficiency while suppressing catastrophic forgetting.
During continual learning updates, a prototype-driven structured parameter update mechanism is combined with feature-level knowledge distillation to enable controlled adjustment of model parameters and feature distributions, thereby enhancing the long-term stability and cross-condition diagnostic performance of the proposed diagnostic model under multi-stage operating-condition expansion.

The remainder of this paper is organized as follows. Section 2 presents the proposed DP-CL method, including the feature disentanglement strategy, prototype construction, and the continual learning update mechanism. Section 3 describes the experimental setup and evaluation results under cumulative operating-condition extension scenarios. Section 4 concludes the paper and discusses potential directions for future research.

2. Continual Learning Model Based on Feature Disentanglement and Prototype Updating

In practical agricultural machinery fault diagnosis, the same type of mechanical fault may repeatedly occur under different operating conditions, such as variations in rotational speed, load levels, or working environments. These operating-condition changes often lead to significant shifts in the statistical characteristics of vibration signals, causing diagnostic models trained on historical data to experience noticeable performance degradation when deployed under new conditions. Although the fault categories themselves remain unchanged, the learned feature representations typically entangle fault-related information with operating-condition-dependent variations. As a result, changes in operating conditions may induce distributional shifts in the feature space and consequently distort previously established decision boundaries, leading to degradation in the recognition capability for previously learned faults. In practical deployment, directly fine-tuning the model using data from newly introduced operating conditions often causes the model to bias toward the most recent operating state while weakening diagnostic performance on previously learned conditions, which manifests as knowledge forgetting in continual learning scenarios. Therefore, when operating conditions evolve over time while fault categories remain fixed, achieving adaptation to new conditions without compromising previously acquired diagnostic knowledge becomes the central challenge of Operating-Condition Incremental Learning (OCIL).

To address this challenge, this study investigates the fault diagnosis problem of cotton-harvester picking-head drivetrains and develops a continual model updating framework tailored for progressively evolving operating conditions. The core idea is to disentangle stable fault-related semantic information from feature components induced by operating-condition variations, enabling the model to distinguish between invariant diagnostic characteristics and condition-dependent distribution changes. Based on this representation strategy, prototype-based constraints are employed to maintain consistent structural relationships among fault classes in the feature space across different learning stages, thereby preserving stable diagnostic references. Meanwhile, representative historical operating-condition information and knowledge distillation mechanisms are incorporated to continuously retain previously acquired knowledge during adaptation to newly introduced conditions. Guided by this conceptual framework, the continual learning task is first formally defined in the following subsection, followed by detailed descriptions of the proposed model architecture and optimization procedure.

2.1. Continual Learning Task Definition and Problem Formulation

Let

D_{0}, D_{1}, \dots, D_{t}

denote the sequence of data domains used for model updating over time. The data domain at time step t is defined as

D_{t} = {(x_{i}^{t}, y_{i}^{t}, z_{i}^{t})}_{i = 1}^{n_{t}}

, where

x_{i}^{t}

represents the i-th signal sample collected at time t,

y_{i}^{t} \in Y^{t}

denotes the corresponding fault class label of

x_{i}^{t}

, and

z_{i}^{t} \in Z^{t}

indicates the operating-condition label associated with sample

x_{i}^{t}

. Here,

n_{t}

denotes the number of samples in

D_{t}

.

At

t = 0

, the initial data domain

D_{0}

contains a sufficient number of labeled samples collected under multiple operating conditions, which are primarily used for initial model training and prototype construction.

When

t > 0

, the newly introduced data domains

D_{1}, D_{2}, \dots, D_{t}

typically contain only a small number of labeled samples collected under a single operating condition, which is regarded as a newly introduced condition relative to the historical training stages.

2.2. Feature Disentanglement and Prototype Construction at the Initial Stage

Within the continual learning framework, the initial training stage at

t = 0

plays a crucial role, as it aims to establish stable and discriminative feature disentanglement modules under multi-condition settings and to lay a solid foundation for subsequent prototype updating and knowledge retention. To this end, inspired by the idea of feature disentanglement, the sample representations are explicitly decomposed into condition-invariant features and condition-specific features, which are designed to capture fault-related category information shared across operating conditions and environment-related characteristics associated with specific operating conditions, respectively.

In addition, fused features are introduced to jointly incorporate global and local information, and the initialization of various types of prototypes is completed at this stage. The overall structure of feature disentanglement and prototype construction at the initial stage is illustrated in Figure 1.

2.2.1. Feature Disentanglement

At

t = 0

, each sample

x_{i}^{0}

is first processed by a feature encoder E to extract a feature representation

F_{i}^{0} = E (x_{i}^{0}; θ_{E}^{0})

, where

θ_{E}^{0}

denotes the trainable parameters of the encoder. The extracted feature

F_{i}^{0}

is then fed into two independent feature disentanglement branches to obtain a condition-invariant feature

F_{i}^{di, 0} = {FC}_{di} (F_{i}^{0}; θ_{Fdi}^{0})

and a condition-specific feature

F_{i}^{ds, 0} = {FC}_{ds} (F_{i}^{0}; θ_{Fds}^{0})

, respectively, where

θ_{Fdi}^{0}

and

θ_{Fds}^{0}

are the trainable parameters of the condition-invariant and condition-specific feature separators.

The two disentangled features

F_{i}^{di, 0}

and

F_{i}^{ds, 0}

are then concatenated to form a fused feature representation

F_{i}^{df, 0} = [F_{i}^{di, 0} \oplus F_{i}^{ds, 0}]

, which is subsequently fed into a feature decoder DE to reconstruct the original sample as

{\hat{x}}_{i}^{0} = DE (F_{i}^{df, 0}; θ_{DE}^{0})

. Here, ⊕ denotes the feature concatenation operation along a specified dimension, and

θ_{DE}^{0}

represents the trainable parameters of the decoder. The reconstruction loss is defined as

L_{re} = \frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} {| x_{i}^{0} - {\hat{x}}_{i}^{0} |}_{2}^{2}

(1)

In this expression,

n_{0}

denotes the number of training samples at

t = 0

, and

{| \cdot |}_{2}

represents the

L_{2}

norm. By minimizing the reconstruction loss

L_{re}

, the model is encouraged to jointly exploit both condition-invariant and condition-specific features for sample reconstruction, thereby ensuring that both feature branches preserve effective semantic information relevant to the original signal. This constraint helps suppress information collapse caused by a single feature branch dominating the reconstruction task and promotes complementary representations between the two types of features during training.

For the extracted condition-invariant features

F_{i}^{di, 0}

, a discriminator

D_{di}

is employed to perform operating-condition classification, aiming to identify the corresponding operating-condition category of each feature. The discriminator loss is computed using the cross-entropy function as

L_{D_{di}} = - \frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} ℓ ({\hat{d}}_{i}^{di, 0}, d_{i}^{0})

(2)

where

{\hat{d}}_{i}^{di, 0} = D_{di} (F_{i}^{di, 0})

denotes the discriminator output,

d_{i}^{0}

is the ground-truth operating-condition label of sample

x_{i}^{0}

, and

ℓ (\cdot, \cdot)

represents the cross-entropy loss function. The condition-invariant features are intended to capture fault-discriminative semantics shared across different operating conditions and should ideally not contain information that can be used to distinguish specific operating conditions. To explicitly suppress potential operating-condition discriminative capability in condition-invariant features, a gradient reversal layer (GRL) is introduced between the feature extraction module and the operating-condition discriminator

D_{di}

. Through adversarial optimization, the feature extractor is guided to maximize the operating-condition classification error while minimizing the task-related losses, thereby progressively removing operating-condition-related components from

F^{di}

.

For the condition-specific features

F_{i}^{ds, 0}

, a separate discriminator

D_{ds}

is employed to perform operating-condition classification, and the corresponding loss is also computed using cross-entropy:

L_{D_{ds}} = - \frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} ℓ ({\hat{d}}_{i}^{ds, 0}, d_{i}^{0})

(3)

Unlike

D_{di}

, the discriminator

D_{ds}

does not involve a gradient reversal layer. Its objective is to enhance the separability of condition-specific features, enabling accurate discrimination among samples from different operating conditions and facilitating more effective capture of fault-related information that is unique to each operating condition.

To ensure that condition-invariant features retain sufficient fault class discriminative capability while suppressing operating-condition information, a category-level discriminative supervision is further introduced on the condition-invariant feature branch. After feature disentanglement, the condition-invariant feature

F_{i}^{di, 0}

is fed into a fault classification head C to predict the fault category of each sample, yielding the output

{\hat{y}}_{i}^{0} = C (F_{i}^{di, 0}, θ_{C}^{0})

.

The corresponding classification loss is defined as

L_{cls} = - \frac{1}{n_{0}} \sum_{i = 1}^{n_{0}} ℓ ({\hat{y}}_{i}^{0}, y_{i}^{0})

(4)

where

y_{i}^{0}

denotes the ground-truth fault class label of sample

x_{i}^{0}

, and

ℓ (\cdot, \cdot)

represents the cross-entropy loss function.

By minimizing the classification loss

L_{cls}^{0}

, the model is explicitly constrained to form a well-separated and discriminative class structure in the condition-invariant feature space, thereby preventing fault-related discriminative information from being weakened during the suppression of operating-condition-related representations.

2.2.2. Prototype Construction

In the proposed continual learning-based fault diagnosis model, three types of prototypes are designed, including the Class Prototype based on condition-invariant features, the Condition Prototype based on condition-specific features, and the Condition-aware Class Prototype based on fused features. During the construction process, all three types of prototypes are required to maintain intra-prototype compactness through pull constraints and inter-prototype separability through push constraints. To avoid repetitive definitions in subsequent sections, a unified formulation of prototype constraints is first presented, which can be directly instantiated for different prototype types by substituting the corresponding variables.

Let the prototype type be denoted as

P \in {di, ds, df}

, and the corresponding prototype set be represented as

{c_{m}^{P}}_{m = 1}^{M_{P}}

, where

M_{P}

denotes the number of prototypes of type

P

. Specifically, when

P = di

,

M_{P} = K

corresponds to the number of fault classes; when

P = ds

,

M_{P} = S

corresponds to the number of operating conditions; and when

P = df

,

M_{P}

corresponds to the number of class-condition combinations.

Let

F_{i}^{P}

denote the feature representation of sample

x_{i}

in the feature space associated with prototype type

P

, and let

y_{i}^{P}

denote the index of the corresponding prototype in the set

{c_{m}^{P}}_{m = 1}^{M_{P}}

. The value of

y_{i}^{P}

depends on the definition of

P

and corresponds to the fault class label, the operating-condition label, or the class-condition combination label, respectively. Under this formulation, the pull constraint enforcing intra-prototype aggregation and the push constraint enforcing inter-prototype separation can be expressed as

L_{pull}^{P} = \frac{1}{N} \sum_{i = 1}^{N} {∥F_{i}^{P} - c_{y_{i}^{P}}^{P}∥}_{2}^{2}

(5)

L_{push}^{P} = \frac{1}{M_{P} (M_{P} - 1)} \sum_{\begin{matrix} p, q = 1 \\ p \neq q \end{matrix}}^{M_{P}} {\max (0, δ - {∥ c_{p}^{P} - c_{q}^{P} ∥}_{2})}^{2}

(6)

where p and q denote two distinct prototype indices in the set

{c_{m}^{P}}_{m = 1}^{M_{P}}

, and

δ

is a predefined minimum margin threshold. The loss term

L_{pull}^{P}

encourages samples to be pulled toward their corresponding prototypes, thereby enhancing intra-class compactness, while

L_{push}^{P}

pushes different prototypes apart to improve inter-class separability. By jointly optimizing these two loss terms, consistent constraints across multiple prototype types can be achieved within a unified framework. In subsequent descriptions of specific prototype types, the general formulation in Equations (5) and (6) can be directly applied by instantiating

P

with the corresponding feature space and prototype category, where di denotes class prototypes, ds denotes condition prototypes, and df denotes condition-aware class prototypes.

Construction of Class Prototypes Based on Condition-Invariant Features

The condition-invariant features

F_{i}^{di, 0}

primarily contain fault-discriminative semantic information that is shared across different operating conditions. In the initial data domain

D_{0}

, samples belong to K fault classes, and the fault label set is denoted as

Y = {y_{1}, y_{2}, \dots, y_{K}}

. Accordingly, the class prototype vector for each fault class is defined as

c_{k}^{di, 0} = \frac{1}{| Ω_{k}^{0} |} \sum_{x_{i}^{0} \in Ω_{k}^{0}} F_{i}^{di, 0}, Ω_{k}^{0} = {x_{i}^{0} ∣ y_{i}^{0} = y_{k}}

(7)

Here,

Ω_{k}^{0}

denotes the set of all samples in

D_{0}

whose fault label is

y_{k}

.

To ensure scale consistency among prototypes of different classes and to avoid distance measurement bias caused by variations in feature magnitudes, each class prototype vector is further normalized using

L_{2}

normalization after computation:

c_{k}^{di, 0} \leftarrow \frac{c_{k}^{di, 0}}{∥ c_{k}^{di, 0} ∥_{2}}

(8)

In this expression,

{∥ c_{k}^{di, 0} ∥}_{2} = \sqrt{\sum_{j = 1}^{d} {(c_{k, j}^{di, 0})}^{2}}

denotes the

L_{2}

norm of vector

c_{k}^{di, 0}

, where d is the feature dimensionality. This normalization operation is consistently applied during both the initialization stage and subsequent prototype updating stages, thereby ensuring that prototype representations remain within a unified metric space throughout the entire continual learning process.

Moreover, to enhance the discriminative capability of class prototypes, both pull and push constraints are imposed on the class prototypes. Specifically, by substituting

P = di

into Equations (5) and (6), the corresponding pull loss and push loss for class prototypes can be directly obtained, which jointly enforce intra-class compactness and inter-class separability.

Construction of Condition Prototypes Based on Condition-Specific Features

The condition-specific features

F_{i}^{ds, 0}

primarily reflect information that is specific to the current operating condition. In the initial data domain

D_{0}

, samples are collected under S operating conditions, and the operating-condition label set is denoted as

Z = {z_{1}, z_{2}, \dots, z_{S}}

. For notation consistency, all operating-condition prototypes in this study are directly indexed by operating-condition labels, such that each condition label

z_{s} \in Z

uniquely corresponds to a condition prototype

c_{z_{s}}^{ds}

. Accordingly, for each operating condition

z_{s}

, the corresponding condition prototype is defined as:

c_{z_{s}}^{ds} = \frac{1}{| Ω_{z_{s}}^{0} |} \sum_{x_{i}^{0} \in Ω_{z_{s}}^{0}} F_{i}^{ds, 0}, Ω_{z_{s}}^{0} = {x_{i}^{0} ∣ z_{i}^{0} = z_{s}}

(9)

Here,

Ω_{z_{s}}^{0}

denotes the set of samples collected under operating condition

z_{s}

, and

z_{i}^{0}

represents the operating-condition label associated with sample

x_{i}^{0}

.

Similar to class prototypes, pull and push constraints are also imposed on condition prototypes to enhance their discriminative capability. In this case,

P = ds

in Equations (5) and (6). These constraints ensure that samples collected under the same operating condition are more compactly clustered, while condition prototypes corresponding to different operating conditions remain sufficiently separated, thereby improving discrimination among operating conditions.

Construction of Condition-Aware Class Prototypes Based on Fused Features

The fused feature representation

F_{i}^{df, 0} = [F_{i}^{di, 0} \oplus F_{i}^{ds, 0}]

jointly contains condition-invariant shared information and condition-specific information associated with the current operating condition. To enable more accurate discrimination of different fault types under different operating conditions in subsequent stages, a condition-aware class prototype is constructed for each fault class

y_{k}

under operating condition

z_{s}

as

c_{k, z_{s}}^{df} = \frac{1}{| Ω_{k, z_{s}}^{0} |} \sum_{x_{i}^{0} \in Ω_{k, z_{s}}^{0}} F_{i}^{df, 0}, Ω_{k, z_{s}}^{0} = {x_{i}^{0} ∣ y_{i}^{0} = y_{k}, z_{i}^{0} = z_{s}}

(10)

Here,

Ω_{k, z_{s}}^{0}

denotes the set of samples in

D_{0}

that belong to fault class

y_{k}

and are collected under operating condition

z_{s}

.

Similarly, by setting

P = df

, the pull and push constraints defined in Equations (5) and (6) are applied to condition-aware class prototypes. These constraints promote compact clustering of samples belonging to the same fault class under the same operating condition, while maintaining clear separation from other fault classes within the same operating condition.

Through the above three prototype construction procedures, all prototypes exhibit strong intra-class compactness and inter-class separability at the initial stage. Consequently, when entering the continual learning stage, newly introduced operating conditions or samples can be effectively incorporated through unified prototype-driven knowledge transfer and updating, thereby enabling long-term fault diagnosis capability across operating conditions.

It should be noted that the condition-invariant features, condition-specific features, and their corresponding three types of prototypes are not independently stacked functional modules. Instead, they are jointly designed to address the core trade-off between cross-condition generalization and condition adaptability in continual learning scenarios. Specifically, condition-invariant features and class prototypes provide shared discriminative references across operating conditions to maintain long-term semantic stability; condition-specific features and condition prototypes characterize operating-condition differences to prevent disruption of the global discriminative structure during adaptation to new conditions; and fused features together with condition-aware class prototypes offer localized discriminative supplementation when cross-condition fault discrimination becomes ambiguous. Together, these components constitute a hierarchical representation framework that balances global consistency and local adaptability, which is essential for the stable operation of the proposed continual learning framework.

2.3. Sample Selection Mechanism in the Continual Learning Stage

After completing the initial training stage, the diagnostic model has learned relatively stable feature representations and fault-discriminative capability under a limited set of operating conditions. When entering the continual learning stage with

t > 0

, the system gradually receives a small number of labeled fault samples collected under newly introduced operating conditions. As a result, the model is required to expand its operating-condition coverage while maintaining diagnostic performance on previously learned conditions. Under such circumstances, how to effectively integrate knowledge from new and historical operating conditions under limited sample availability, and how to suppress performance degradation caused by stage-wise updates, become the core challenges in the continual learning stage.

To address these issues, a condition-similarity-driven sample selection mechanism is introduced without altering the overall model architecture. This mechanism selects representative and discriminative samples from historical operating conditions and jointly incorporates them with samples from the current operating condition during model updating, thereby enabling stable knowledge transfer and sustained preservation of discriminative structures throughout multi-stage continual learning.

2.3.1. Operating-Condition Similarity Measurement

During the continual learning process, the degree of relevance between different operating conditions directly affects the effectiveness of sample selection and the recognition performance for unknown fault patterns. When two operating conditions exhibit high similarity in feature distributions, knowledge transfer between them is more likely to be beneficial; conversely, when operating conditions differ significantly, more independent discriminative information should be preserved. Therefore, constructing a reasonable operating-condition similarity metric not only guides efficient sample selection from the memory buffer but also provides auxiliary information for unknown fault discrimination. In this study, condition-specific features are adopted as the basis for similarity computation, as they effectively characterize environment- and load-related attributes unique to each operating condition and better reflect inter-condition differences than condition-invariant features.

Assume that at time step t, the system has cumulatively encountered S historical operating conditions, with the corresponding condition label set denoted as

Z^{t} = {z_{1}, z_{2}, \dots, z_{S}}

. For a given operating condition

z_{s}

, the corresponding sample subset is represented as

Ω_{s}^{t} = {x_{i}^{t} ∣ z_{i}^{t} = z_{s}} \subset D_{t}

, and the associated condition-specific feature set is given by

{F_{i}^{ds, t} ∣ x_{i}^{t} \in Ω_{s}^{t}}

. Based on these definitions, the similarity between different operating conditions is evaluated from two complementary perspectives: statistical distribution discrepancy and geometric structure consistency in the feature space, in order to characterize the relevance between newly introduced and historical operating conditions.

From the statistical distribution perspective, this study focuses on the overall distributional differences of condition-specific features across operating conditions. To obtain stable and analytically tractable distribution representations under limited sample availability, a second-order statistical approximation based on multivariate Gaussian modeling is adopted to characterize the overall distributional patterns of condition-specific features under different operating conditions. It should be emphasized that this modeling strategy is introduced to obtain an analytically computable representation of distribution discrepancy, rather than imposing a strict Gaussian assumption on the true feature distributions. Under this approximation, the condition-specific feature distributions associated with operating conditions

z_{i}

and

z_{j}

are represented as

Q^{z_{i}} \sim N (μ_{z_{i}}, Σ_{z_{i}})

and

Q^{z_{j}} \sim N (μ_{z_{j}}, Σ_{z_{j}})

, respectively, where

μ

and

Σ

denote the mean vector and covariance matrix of the features. Based on this second-order representation, the 2-Wasserstein distance is adopted as the distribution-level similarity metric to quantify statistical discrepancies between operating conditions. Compared with information-theoretic divergence measures, the 2-Wasserstein distance possesses symmetry and favorable geometric interpretability, enabling joint characterization of distributional differences in both mean displacement and covariance structure. This property makes it particularly suitable for capturing cross-condition distribution evolution in continual learning scenarios. The closed-form expression of the 2-Wasserstein distance between the two Gaussian distributions is given as

D_{W_{2}}^{2} (Q^{z_{i}}, Q^{z_{j}}) = {∥ μ_{z_{i}} - μ_{z_{j}} ∥}_{2}^{2} + tr (Σ_{z_{i}} + Σ_{z_{j}} - 2 {(Σ_{z_{i}}^{\frac{1}{2}} Σ_{z_{j}} Σ_{z_{i}}^{\frac{1}{2}})}^{\frac{1}{2}})

(11)

Here,

tr (\cdot)

denotes the trace operator of a matrix, and

{(\cdot)}^{1 / 2}

represents the principal matrix square root.

Since the 2-Wasserstein distance is essentially a discrepancy metric, the above distributional discrepancy is further mapped into a similarity form using an exponential decay function:

S_{W} (z_{i}, z_{j}) = \exp (- α_{W} \cdot D_{W_{2}}^{2} (Q^{z_{i}}, Q^{z_{j}}))

(12)

where

α_{W} = 0.01

is a scaling factor that controls the decay rate of similarity with respect to distributional discrepancy. The resulting similarity value lies in the interval

(0, 1]

, and smaller distributional differences correspond to higher statistical similarity between operating conditions.

The similarity metric based on the 2-Wasserstein distance captures global distributional differences of condition-specific features in terms of both mean location and covariance structure, providing a stable distribution-level reference for subsequent fine-grained similarity constraints based on structural consistency.

However, relying solely on distribution-level discrepancy is insufficient to fully characterize geometric structural variations of condition-specific features in the feature space across operating conditions. Although distribution-level similarity reflects differences in mean position and overall dispersion, it has limited sensitivity to the consistency of principal variation directions and internal correlation structures. To address this limitation, geometric structure-level similarity is further introduced to complement the distribution-based similarity measure.

Considering that the covariance matrix

Σ_{z_{s}}

implicitly encodes feature correlations and principal variation directions under operating condition

z_{s}

, structural similarity between operating conditions is assessed by analyzing the consistency of principal subspaces. Specifically, eigen-decomposition is performed on the covariance matrix

Σ_{z_{s}}

, and the top r eigenvectors are selected to form an orthogonal basis matrix

U_{z_{s}} = [u_{1}^{z_{s}}, u_{2}^{z_{s}}, \dots, u_{r}^{z_{s}}]

, where r denotes the dimensionality of the principal subspace and characterizes the dominant variation directions of condition-specific features under operating condition

z_{s}

.

Based on this definition, the Frobenius distance between subspace projection matrices is adopted to measure structural discrepancy between two operating conditions, which is defined as

D_{sub} (z_{i}, z_{j}) = {∥U_{z_{i}} U_{z_{i}}^{⊤} - U_{z_{j}} U_{z_{j}}^{⊤}∥}_{F}

(13)

Here,

{∥ \cdot ∥}_{F}

denotes the Frobenius norm. This distance effectively reflects the consistency of principal variation directions and correlation structures between two operating conditions, independent of specific sample ordering.

To maintain consistency with the distribution-level similarity formulation, an exponential decay function is also employed to map the structural discrepancy into a structural similarity metric:

S_{sub} (z_{i}, z_{j}) = \exp (- β_{sub} \cdot D_{sub} (z_{i}, z_{j}))

(14)

where

β_{sub} = 1.0

is a scaling factor that controls the sensitivity of structural similarity to geometric discrepancy. The resulting similarity value lies in

(0, 1]

, with higher similarity corresponding to greater consistency of principal subspaces.

It should be emphasized that structural similarity based on principal subspace consistency is not a direct replacement for statistical distribution similarity, but rather a complementary characterization from a geometric perspective. In practice, when feature distributions under operating conditions are relatively stable, distribution-level similarity alone may provide reliable similarity estimation; however, in scenarios involving complex condition evolution or significant structural changes in feature representations, incorporating structural similarity improves the robustness of the overall similarity assessment.

By jointly considering statistical distribution discrepancy and structural consistency, the overall similarity between operating conditions

z_{i}

and

z_{j}

is finally defined as

S_{τ} (z_{i}, z_{j}) = S_{W} (z_{i}, z_{j}) \times S_{sub} (z_{i}, z_{j})

(15)

This formulation simultaneously accounts for distributional consistency and structural pattern matching, such that a larger value of

S_{τ} (z_{i}, z_{j})

directly indicates higher similarity between operating conditions. In practical implementation, the similarity relationships among all historical operating conditions are organized in matrix form to construct an operating-condition similarity graph:

S_{G} = [\begin{matrix} 0 & S_{0, 1}^{τ} & \dots & S_{0, S}^{τ} \\ 0 & 0 & \dots & S_{1, S}^{τ} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 0 \end{matrix}]

(16)

where

S_{i, j}^{τ}

denotes the similarity value between operating conditions

z_{i}

and

z_{j}

. This matrix provides the basis for subsequent sample selection and unknown fault inference.

2.3.2. Sample Selection Strategy

Based on the fused similarity measurement from statistical distribution and feature-space structure, the operating-condition similarity matrix

S_{G}

is obtained, which effectively reveals the relationships among operating conditions in terms of both feature distribution and structural characteristics. Leveraging this similarity information, a sample selection mechanism is constructed to facilitate knowledge transfer and suppress operating-condition forgetting during continual learning.

Existing continual learning methods typically perform replay sample selection by uniformly or randomly sampling from the memory buffer of all historical operating conditions, without explicitly considering condition relevance. Although such indiscriminate replay strategies can preserve historical knowledge to some extent, they exhibit clear limitations in terms of transfer efficiency and discriminative boundary maintenance. On the one hand, replay samples may contain an excessive number of irrelevant samples, diluting knowledge that is most relevant to the current task. On the other hand, if replay samples are overly concentrated on similar conditions, information from distant operating conditions may be insufficiently retained, leading to inter-class confusion and semantic collapse. To address these issues, a proportion-adaptive Top–Bottom hybrid sample selection strategy based on operating-condition similarity ranking is proposed, which maintains consistency and applicability across different scales of historical operating conditions.

Let the set of historical operating conditions at time step t be denoted as

Z_{his}^{t} = Z^{t} ∖ {z_{t}}

, with

H_{t} = |Z_{his}^{t}|

representing the number of available historical operating conditions. Based on the similarity ranking results, two subsets corresponding to the most similar and the most dissimilar operating conditions are selected from the historical condition set to participate in sample selection:

Top- $M_{sim}^{t}$ similar condition set: The $M_{sim}^{t}$ historical operating conditions with the highest similarity to the current condition $z_{t}$ are selected to form the similar-condition subset. Here, $M_{sim}^{t} = ⌈ρ_{top} \cdot H_{t}⌉$ , where $ρ_{top} \in (0, 1)$ is the proportion coefficient for similar operating conditions. Samples from this subset exhibit condition-specific feature distributions that are closer to the current operating condition, which helps improve knowledge transfer efficiency and accelerates alignment between new and historical feature spaces.
Bottom- $M_{far}^{t}$ dissimilar condition set: The $M_{far}^{t}$ historical operating conditions with the lowest similarity to the current condition $z_{t}$ are selected to form the dissimilar-condition subset. Here, $M_{far}^{t} = ⌈ρ_{bot} \cdot H_{t}⌉$ , where $ρ_{bot} \in (0, 1)$ is the proportion coefficient for dissimilar operating conditions. Samples from this subset provide long-range contrastive signals during the distillation process, which helps maintain the integrity of discriminative boundaries and suppress feature-space collapse.

To avoid extreme cases where the number of selected conditions is insufficient or the computational cost becomes excessive, upper and lower bounds are imposed on the values of

M_{sim}^{t}

and

M_{far}^{t}

in practice, while ensuring that

M_{sim}^{t} + M_{far}^{t} \leq H_{t}

. When the number of historical operating conditions is small, the proposed strategy naturally degenerates into utilizing all available historical conditions, thereby ensuring stability and robustness in the early stages of continual learning. As the number of historical operating conditions gradually increases, the sample selection mechanism progressively focuses on operating-condition subsets that are most relevant and most discriminative with respect to the current task.

After completing condition-level selection, replay data are further constructed in a controlled manner at the sample level. At time step t, let the sample set collected under the current operating condition be denoted as

D_{t}

, with sample size

| D_{t} |

. To balance the gradient contributions of new and historical samples during model updating, the total number of replay samples is set to

| R | = | D_{t} |

and is evenly allocated among the selected historical operating conditions. Within each selected historical condition, a fixed number of representative samples are randomly sampled from the corresponding sample pool to participate in subsequent feature distillation and model updating. In this way, the replay sample size is kept under control while enabling stable retention and effective transfer of historical knowledge across multiple operating conditions.

2.4. Prototype Updating and Knowledge Distillation in the Continual Learning Stage

During the continual learning stage, the model is required to gradually expand its operating-condition coverage while maintaining stable discrimination of historical operating conditions and fault categories under the constraint of only a small number of newly introduced samples. To this end, two complementary mechanisms are introduced in this stage. On the one hand, structured prototype updating is employed to adaptively adjust representation centers of features under both new and historical operating conditions, thereby preserving the overall discriminative structure of the feature space. On the other hand, knowledge distillation constraints based on replay samples are incorporated to restrict excessive parameter and feature distribution drift during stage-wise updates, effectively suppressing catastrophic forgetting.

In this work, the sample set obtained through operating-condition similarity ranking in the previous stage—including samples from the current operating condition, Top-

M_{sim}^{t}

similar operating conditions, and Bottom-

M_{far}^{t}

dissimilar operating conditions—is used as a unified data source for both prototype updating and distillation constraints at the current stage. This design ensures that the two mechanisms operate cooperatively under the same sample context. Specifically, the prototype updating mechanism focuses on constructing robust representation centers in the condition-invariant, condition-specific, and fused feature spaces, providing structural anchoring for cross-condition fault discrimination. In contrast, the knowledge distillation mechanism imposes soft constraints on historical model behavior, thereby preserving learned knowledge while incorporating new operating-condition information. The prototype updating strategy is introduced first, and the overall procedure of this stage is illustrated in Figure 2.

2.4.1. Prototype Updating for Condition-Invariant Features

At time step t, the system receives labeled samples collected under the newly introduced operating condition

z_{t}

, which are jointly used with replay samples selected from the previous stage for stable updating and knowledge retention. For updating class prototypes in the condition-invariant feature space, samples from the new operating condition are treated as incremental information. Condition-invariant features

F_{i}^{di, t}

are extracted through the feature encoder, and class-wise aggregation is performed to compute the mean representation of condition-invariant features as

{\bar{F}}_{k}^{di, t} = \frac{1}{| Ω_{k}^{t} |} \sum_{x_{i} \in Ω_{k}^{t}} F_{i}^{di, t}, Ω_{k}^{t} = {x_{i}^{t} ∣ y_{i}^{t} = y_{k}}

(17)

Here,

Ω_{k}^{t}

denotes the set of newly introduced samples belonging to the k-th fault class at time step t.

To balance historical knowledge stability and adaptability to new operating conditions, class prototypes are updated using an exponential moving average (EMA) scheme:

c_{k}^{di, t} \leftarrow β^{t} \cdot c_{k}^{di, t - 1} + (1 - β^{t}) \cdot {\bar{F}}_{k}^{di, t}

(18)

The adaptive coefficient is defined as

β^{t} = \frac{N_{old}}{N_{old} + N_{new}}

according to the proportion of historical and newly introduced samples. When the number of new samples is limited, the update relies more heavily on historical prototypes to ensure stability under small-sample conditions; when the proportion of new samples increases, features from new operating conditions are incorporated more rapidly. After updating,

L_{2}

normalization is applied to maintain consistency in the metric space.

Updating only the prototype centers can constrain global drift but is insufficient to regulate the distributional shape of intra-class features in the feature space. Over multiple iterations, this may lead to intra-class drift and instability of discriminative boundaries. To address this issue, a covariance alignment constraint is introduced in addition to mean updating, ensuring that feature distributions of the same class at adjacent time steps remain consistent in shape:

{\bar{Σ}}_{k}^{t} = \frac{1}{| Ω_{k}^{t} |} \sum_{x_{i} \in Ω_{k}^{t}} (F_{i}^{di, t} - {\bar{F}}_{k}^{di, t}) {(F_{i}^{di, t} - {\bar{F}}_{k}^{di, t})}^{⊤}

(19)

Here,

Σ_{k}^{t}

characterizes the scale and correlation structure of intra-class features.

To avoid abrupt distributional changes caused by direct covariance replacement, covariance matrices are also updated in a smoothed manner using EMA:

Σ_{k}^{t} \leftarrow β \cdot Σ_{k}^{t - 1} + (1 - β) \cdot {\bar{Σ}}_{k}^{t}

(20)

A variance alignment loss is further introduced by minimizing the Frobenius norm of covariance differences between adjacent time steps:

L_{var}^{t} = \frac{1}{K} \sum_{k = 1}^{K} {∥ Σ_{k}^{t} - Σ_{k}^{t - 1} ∥}_{F}^{2}

(21)

where

{∥ \cdot ∥}_{F}

denotes the Frobenius norm. This loss term enforces structural stability of intra-class feature distributions after incorporating samples from new operating conditions, thereby maintaining consistency and separability of the cross-condition feature space. By introducing covariance-level alignment constraints in addition to mean updating, the proposed method avoids intra-class structural degradation that may arise when relying solely on prototype center updates. This design effectively suppresses abnormal prototype dispersion under conditions of noisy or extremely limited new samples, thereby enhancing the stability of the continual learning process.

Through the above strategy, prototype updating for condition-invariant features achieves rapid cross-condition adaptation at the mean level while preserving intra-class structural stability at the covariance level, providing a reliable global reference for subsequent updating of condition-specific and fused feature representations.

2.4.2. Prototype Updating for Condition-Specific Features

At time step t, the condition-specific features

F_{i}^{ds, t}

disentangled from samples primarily encode operating-condition-related information, such as environmental factors, load characteristics, or structural conditions that vary across operating conditions. Therefore, stably constructing the condition prototype of a newly introduced operating condition is crucial for maintaining the discriminative capability and adaptability of the diagnostic model during the continual learning process.

At time step t, the condition prototype

c_{z_{t}}^{ds}

for the new operating condition

z_{t}

is constructed using all samples collected under this condition:

c_{z_{t}}^{ds} = \frac{1}{| Ω_{z_{t}}^{t} |} \sum_{x_{i}^{t} \in Ω_{z_{t}}^{t}} F_{i, z_{t}}^{ds, t}, Ω_{z_{t}}^{t} = {x_{i}^{t} ∣ z_{i}^{t} = z_{t}}

(22)

Here,

Ω_{z_{t}}^{t}

contains only samples collected under the newly introduced operating condition at time step t. The symbol

z_{t}

denotes the operating-condition label introduced at stage t, which follows the same labeling convention as the condition labels

z_{s}

defined in the initial stage.

F_{i, z_{t}}^{ds, t}

represents the condition-specific feature associated with operating condition

z_{t}

at time step t.

By adopting a strategy that relies exclusively on samples from the new operating condition, the resulting condition prototype accurately reflects the distribution of condition-specific features associated with the current operating condition, without being interfered with by feature characteristics from historical operating conditions.

During the optimization of this prototype, the Pull/Push strategy introduced previously is retained. First, the Pull constraint is applied to draw features of the current operating condition toward their corresponding prototype, such that features belonging to the same operating condition form a compact cluster in the feature space:

L_{pull}^{ds} = \frac{1}{B} \sum_{i = 1}^{B} {∥F_{i, z_{t}}^{ds, t} - c_{z_{t}}^{ds}∥}_{2}^{2}

(23)

where B denotes the batch size. The Pull constraint enhances intra-condition compactness and improves the stability of fault discrimination under the current operating condition.

However, relying solely on the Pull constraint may cause feature clusters corresponding to different operating conditions to become overly close, especially for historical operating conditions that are highly similar to the current one. To avoid such inter-condition adhesion, a Push constraint is further introduced. Instead of indiscriminately pushing away all historical operating conditions, this constraint focuses on the

M_{sim}^{t}

most similar historical operating conditions and enforces a minimum safety margin

δ_{ds}^{t}

, ensuring that the prototype of the current operating condition remains sufficiently separated from them:

L_{push}^{ds} = \frac{1}{| Z_{sim}^{t} |} \sum_{z_{s} \in Z_{sim}^{t}} softplus (δ_{ds}^{t} - {∥ c_{z_{t}}^{ds} - c_{z_{s}}^{ds} ∥}_{2})

(24)

Here,

softplus (x) = \ln (1 + e^{x})

is adopted to provide a smooth and differentiable margin constraint. During backpropagation, only the current condition prototype

c_{z_{t}}^{ds}

is updated, while historical condition prototypes

c_{z_{s}}^{ds}

serve as fixed references and do not participate in gradient updates, thereby avoiding unintended perturbations to historical condition representations. The margin

δ_{ds}^{t}

is computed based on a similarity-weighted baseline derived from operating-condition similarity, and the specific procedure is described as follows:

Select the $M_{sim}^{t}$ historical operating conditions that are most similar to the current operating condition $z_{t}$ , forming the similar-condition set $Z_{sim}^{t}$ ;
Compute the set of Euclidean distances between the current condition prototype and the prototypes of these similar operating conditions, denoted as $Ω_{dis}^{ds, t}$ ;
Take the median value $δ_{base}$ as the baseline margin;
Scale the baseline margin proportionally according to condition similarity to obtain $δ_{scaled}$ ;
Apply exponential moving average (EMA) smoothing to obtain $δ_{ds}^{t, ema}$ ;
Clip the result within the interval $[δ_{min}, δ_{max}]$ to obtain the final margin $δ_{ds}^{t}$ .

The above procedure can be formulated as follows:

\begin{matrix} Ω_{dis}^{ds, t} & = \{{∥c_{z_{t}}^{ds} - c_{z_{s}}^{ds}∥}_{2} | z_{s} \in Z_{sim}^{t}\}, \\ δ_{base} & = median (Ω_{dis}^{ds, t}), \\ δ_{scaled} & = δ_{base} \cdot (1 + β_{ds}^{t} \cdot \frac{1}{| Z_{sim}^{t} |} \sum_{z_{s} \in Z_{sim}^{t}} S_{(z_{t}, z_{s})}^{τ}), β_{ds}^{t} > 0, \\ δ_{ds}^{t, ema} & \leftarrow α_{ds} \cdot δ_{ds}^{t, ema, old} + (1 - α_{ds}) \cdot δ_{scaled}, α_{ds} \in [0, 1], \\ δ_{ds}^{t} & = min (δ_{max}, max (δ_{min}, δ_{ds}^{t, ema})) \end{matrix}

(25)

For reproducibility, both features and prototypes are normalized using

L_{2}

normalization. Under this setting, the default hyperparameters are set as

α_{ds} = 0.90

,

β_{ds}^{t} = 0.60

,

δ_{min} = 0.60

, and

δ_{max} = 1.60

. When the number of new operating-condition samples is extremely limited or the noise level is relatively high,

α

can be moderately increased to

0.95

; when the distributions of similar operating conditions are closer,

β_{ds}^{t}

can be adjusted to

0.80

.

Overall, the Pull constraint enforces intra-condition compactness for the current operating condition, while the Push constraint prevents excessive overlap with similar operating conditions. The combined optimization objective is defined as

L_{proto}^{ds} = λ_{pull}^{ds} L_{pull}^{ds} + λ_{push}^{ds} L_{push}^{ds}

(26)

Through this dual mechanism of intra-condition cohesion and inter-condition separation, condition-specific features are able to adapt to newly introduced operating conditions while maintaining stable representations within each condition and clear separability from other conditions in the feature space, thereby preserving discriminative capability for cross-condition fault diagnosis.

2.4.3. Prototype Updating for Fused Features

The fused feature representation

F_{i}^{df, t} = [F_{i}^{di, t} \oplus F_{i}^{ds, t}]

integrates condition-invariant information shared across operating conditions with condition-specific information associated with the current operating condition, where ⊕ denotes the feature concatenation operation. This fused representation simultaneously preserves class-related semantic information and operating-condition-specific environmental characteristics. Unlike cross-condition class prototypes that emphasize global consistency, condition-aware class prototypes are constructed solely based on samples from the current operating condition. Their purpose is to enhance intra-condition class compactness and separability, thereby providing reliable condition-specific supplementary discrimination when cross-condition feature-based diagnosis becomes unreliable.

During training at time step t, for the current operating condition

z_{t}

, class-conditional prototypes under this condition are first constructed based on the corresponding samples. Let

Ω_{k, z_{t}}^{t} = \{x_{i}^{t} | y_{i}^{t} = y_{k}, z_{i}^{t} = z_{t}\}

denote the set of samples belonging to class

y_{k}

under operating condition

z_{t}

. The corresponding condition-aware class prototype is then defined as

c_{k, z_{t}}^{df} = \frac{1}{| Ω_{k, z_{t}}^{t} |} \sum_{x_{i}^{t} \in Ω_{k, z_{t}}^{t}} F_{i}^{df, t}

(27)

The above prototype is computed exclusively from samples collected under the current operating condition, thereby accurately characterizing the distribution of fused features for each class within this condition.

Similar to other types of prototypes, both Pull and Push constraints are imposed in the fused feature space. The Pull constraint draws fused features of samples belonging to the same class toward the corresponding condition-aware class prototype:

L_{pull}^{df} = \frac{1}{B} \sum_{i = 1}^{B} {∥F_{i}^{df, t} - c_{k, z_{t}}^{df}∥}_{2}^{2}

(28)

where B denotes the batch size.

The Push constraint enforces a minimum margin between prototypes of different classes within the same operating condition, thereby enhancing intra-condition inter-class separability:

L_{push}^{df} = \frac{1}{K_{z_{t}} (K_{z_{t}} - 1)} \sum_{\begin{matrix} p, q = 1 \\ p \neq q \end{matrix}}^{K_{z_{t}}} {\max (0, δ_{df} - {∥ c_{p, z_{t}}^{df} - c_{q, z_{t}}^{df} ∥}_{2})}^{2}

(29)

Here,

K_{z_{t}}

denotes the number of fault classes under operating condition

z_{t}

, and

δ_{df}

is the predefined minimum margin threshold. It should be noted that this Push constraint differs from that used in the prototype updating of condition-specific features: the current constraint operates exclusively among class prototypes within the same operating condition and does not involve historical operating-condition data.

In summary, the optimization objective for prototype updating in the fused feature space is given by

L_{proto}^{df} = λ_{pull}^{df} L_{pull}^{df} + λ_{push}^{df} L_{push}^{df}

(30)

The Pull constraint ensures intra-class compactness within each operating condition, while the Push constraint prevents excessive proximity between feature clusters of different classes under the same operating condition. Together, these constraints enhance the robustness of local discrimination while preserving global semantic consistency across operating conditions.

2.4.4. Feature-Level Knowledge Distillation

In the continual learning process, directly performing feature distillation on samples from all historical operating conditions would not only significantly increase computational cost, but may also introduce ineffective or even harmful gradient signals from operating conditions that are weakly related to the current one, thereby impairing model plasticity and convergence efficiency. To address this issue, the proposed distillation sample selection strategy is designed as follows: (1) based on the operating-condition similarity metric defined earlier, select the

M_{sim}^{t}

historical operating conditions that are most similar to the current operating condition

z_{t}

, forming the set

Z_{sim}^{t}

; (2) select the

M_{far}^{t}

historical operating conditions that are most dissimilar to the current operating condition

z_{t}

, forming the fallback set

Z_{far}^{t}

to mitigate the risk of extreme catastrophic forgetting; (3) within the above operating-condition sets, construct the replay sample set R according to the controlled sampling strategy described previously, which is then used for subsequent feature-level knowledge distillation.

Let

F_{T}^{di} (x_{i})

and

F_{T}^{ds} (x_{i})

denote the condition-invariant and condition-specific features extracted from sample

x_{i}

by the frozen teacher model at time step

t - 1

, respectively, and let

F_{S}^{di} (x_{i})

and

F_{S}^{ds} (x_{i})

denote the corresponding outputs of the student model at time step t. Let

z_{i} = z (x_{i})

represent the operating-condition label associated with sample

x_{i}

, where

z_{i} \in (Z_{sim}^{t} \cup Z_{far}^{t})

. Here, K denotes the total number of fault classes.

For relational distillation, the replay sample set R is first partitioned into multiple subsets according to operating-condition labels as

R_{z} = {x_{i} \in R ∣ z_{i} = z}

, where

z \in (Z_{sim}^{t} \cup Z_{far}^{t})

. Then, within each subset

R_{z}

, a set of sample pairs

E_{z} \subseteq R_{z} \times R_{z}

is sampled, and the union of these sets,

E = ⋃_{z} E_{z}

is used as the sample-pair set for relational distillation. The overall procedure of this stage is illustrated in Figure 3.

As illustrated in Figure 3, for condition-invariant features, a sample-wise

L_{2}

distance constraint

L_{KD}^{di}

is adopted to preserve the stability of cross-condition shared semantics. To further enhance statistical stability, a weak prototype-level constraint

L_{KD - proto}^{di}

is additionally introduced. This term is assigned a relatively small weight and is only used to suppress global drift of the condition-invariant feature distribution. The corresponding formulations are given as

\begin{matrix} L_{KD}^{di} & = \frac{1}{| R |} \sum_{x_{i} \in R} {∥F_{S}^{di} (x_{i}) - F_{T}^{di} (x_{i})∥}_{2}^{2} \\ L_{KD - proto}^{di} & = \frac{1}{K} \sum_{k = 1}^{K} {∥c_{k}^{di, t} - c_{k}^{di, t - 1}∥}_{2}^{2} \end{matrix}

(31)

Here, R denotes the replay sample set and K is the total number of fault classes.

For condition-specific features, considering that adaptability to newly introduced operating conditions must be preserved, strict sample-wise alignment is deliberately avoided. Instead, consistency with the corresponding operating-condition prototype from the teacher model is enforced. In addition, a relational distillation term

L_{KD - rel}^{ds}

is incorporated to maintain the relative structure among samples:

\begin{matrix} L_{KD - proto}^{ds} & = \frac{1}{| R |} \sum_{x_{i} \in R} {∥F_{S}^{ds} (x_{i}) - c_{z_{i}}^{ds, (t - 1)}∥}_{2}^{2} \\ L_{KD - rel}^{ds} & = \frac{1}{| E |} \sum_{(x_{i}, x_{j}) \in E} | {∥ F_{S}^{ds} (x_{i}) - F_{S}^{ds} (x_{j}) ∥}_{2} - {∥ F_{T}^{ds} (x_{i}) - F_{T}^{ds} (x_{j}) ∥}_{2} | \end{matrix}

(32)

Here,

c_{z_{i}}^{ds, (t - 1)}

denotes the condition prototype corresponding to the operating condition

z_{i} = z (x_{i})

of sample

x_{i}

in the teacher model.

Since the fused feature representation

F^{df, t}

is obtained by direct concatenation of condition-invariant and condition-specific features, and both components are already constrained through feature-level distillation, no additional distillation is imposed on fused features in order to avoid redundant computational overhead.

By integrating all the above components, the overall feature-level knowledge distillation loss is defined as

\begin{matrix} L_{KD} = & λ_{KD}^{di} L_{KD}^{di} + λ_{proto}^{di} L_{KD - proto}^{di} \\ + λ_{KD - proto}^{ds} L_{KD - proto}^{ds} + λ_{KD - rel}^{ds} L_{KD - rel}^{ds} \end{matrix}

(33)

2.5. Stage-Wise Optimization Objectives and Overall Training Procedure

To address the continual learning problem in fault diagnosis of cotton harvester picking-head drivetrains, a stage-wise training strategy is adopted in this study. Unlike conventional approaches that jointly optimize all loss terms within a single training stage, the proposed method explicitly distinguishes learning objectives across different stages and designs corresponding optimization targets and training strategies in a targeted manner. The overall training process is divided into an initialization stage (

t = 0

) and an incremental learning stage (

t > 0

). The initialization stage is further subdivided into two sub-stages to sequentially accomplish feature disentanglement, prototype structure construction, and subsequent continual adaptation.

2.5.1. $t = 0$ Stage: Stage 1—Feature Disentanglement

At the

t = 0

stage, training samples are collected under multiple operating conditions, and all samples are annotated with fault type labels. The primary objective of Stage 1 is to establish stable feature disentanglement capability using multi-condition labeled data, enabling the model to effectively separate condition-invariant features with fault-discriminative semantics from condition-specific features that characterize operating-condition differences, while ensuring information completeness and stability during the feature decomposition process. In this stage, the encoder, feature disentanglement modules, classifier, discriminators, and decoder are all jointly optimized.

Given a training sample

(x_{i}^{0}, y_{i}^{0}, z_{i}^{0})

, the encoder first extracts a holistic feature representation, which is then processed by the feature disentanglement modules to obtain condition-invariant features

F_{i}^{di, 0}

and condition-specific features

F_{i}^{ds, 0}

. To achieve the above objectives, the following joint loss function is constructed as the optimization criterion for Stage 1:

L^{{(0)}_{1}} = λ_{re} L_{re} + λ_{cls} L_{cls} + λ_{D_{di}} (e) L_{D_{di}} + λ_{D_{ds}} L_{D_{ds}} + λ_{decorr} L_{decorr}

(34)

Here,

L_{cls}

enforces the class discriminability of condition-invariant features, while

L_{re}

denotes the reconstruction loss used to ensure information fidelity during the feature disentanglement process.

L_{D_{di}}

and

L_{D_{ds}}

correspond to the adversarial condition discrimination loss for condition-invariant features and the operating-condition discrimination loss for condition-specific features, respectively.

Considering that adversarial training may introduce unstable gradients at early training stages, a soft-start scheduling strategy is adopted for

λ_{D_{di}}

, such that its weight gradually increases as training progresses. This design prioritizes discriminative usability of features in the early phase while progressively strengthening the operating-condition invariance constraint in later phases. In addition, a lightweight decorrelation regularization term,

L_{decorr} = {∥\frac{{(F^{di})}^{⊤} F^{ds}}{∥ F^{di} ∥_{F} {∥ F^{ds} ∥}_{F}}∥}_{F}^{2}

is introduced to suppress information leakage between condition-invariant and condition-specific features and to promote their statistical disentanglement.

2.5.2. $t = 0$ Stage: Stage 2—Prototype Structure Formation

After completing Stage 1, the model has acquired a stable capability for feature disentanglement. At this point, the training objective shifts from feature separability and reconstructability to prototype structure formation and discriminative boundary refinement. Accordingly, in Stage 2, the classifier, decoder, and domain discriminators are all frozen. The lower layers of the encoder and the feature disentanglement modules are gradually fixed, while only high-level feature representations and geometric adjustment modules remain trainable to preserve sufficient plasticity.

In this stage, three types of prototypes are constructed in the condition-invariant feature space, condition-specific feature space, and fused feature space, respectively, including class prototypes, condition prototypes, and condition-aware class prototypes. To achieve the above objectives, the following joint loss function is defined as the optimization criterion for Stage 2:

L^{{(0)}_{2}} = λ_{pull}^{di} L_{pull}^{di} + λ_{push}^{di} L_{push}^{di} + λ_{pull}^{ds} L_{pull}^{ds} + λ_{push}^{ds} L_{push}^{ds} + λ_{pull}^{df} L_{pull}^{df} + λ_{push}^{df} L_{push}^{df}

(35)

Through the Pull–Push mechanism, the model progressively enlarges the feature margins between different classes or operating conditions while maintaining intra-class compactness, thereby forming discriminative prototype structures. It should be emphasized that neither classification loss nor reconstruction loss is introduced at this stage, so as to avoid disturbing the already established feature disentanglement semantics and to ensure that the optimization process is fully dedicated to geometric refinement of prototype structures.

2.5.3. $t > 0$ Stage: Prototype-Driven Adaptation and Knowledge Distillation in Incremental Learning

In the incremental learning stage, new operating conditions arrive sequentially, and each condition is associated with only a small number of labeled samples. Compared with the initialization stage, the core challenges in this stage are twofold: on the one hand, the model must rapidly adapt to new operating-condition distributions under small-sample constraints; on the other hand, it must avoid disrupting fault knowledge learned from historical operating conditions, so as to prevent catastrophic forgetting. The objective of the

t > 0

stage is to guide structured adaptation to new operating conditions via prototype constraints while preserving the stability of discriminative structures in condition-invariant features, and to restrict deviations from historical knowledge through knowledge distillation mechanisms, thereby achieving a balance between “adapting to new conditions” and “retaining old knowledge”.

Accordingly, in the

t > 0

stage, most model parameters are frozen, including the entire encoder and the lower layers of the feature disentanglement modules, while only a small number of geometric adjustment modules are kept trainable for adaptation to new operating conditions. This freezing strategy follows the principle of “semantic structure priority and geometric plasticity”: low-level feature extraction and disentanglement modules that are directly responsible for cross-condition semantic modeling are frozen after the initialization stage to prevent discriminative semantic drift, whereas high-level mapping modules related to prototype geometric adjustment and new-condition adaptation retain limited plasticity to enable controlled updates under small-sample conditions.

To achieve the above objectives, the following joint loss function is constructed as the optimization criterion at the t-th incremental stage:

\begin{matrix} L_{inc}^{(t)} = & \underset{class prototypes}{\underset{⏟}{λ_{var - di} L_{var}}} + \underset{condition prototypes}{\underset{⏟}{λ_{pull}^{ds} L_{pull}^{ds} + λ_{push}^{ds} L_{push}^{ds}}} + \underset{condition-aware class prototypes}{\underset{⏟}{λ_{pull}^{df} L_{pull}^{df} + λ_{push}^{df} L_{push}^{df}}} \\ + \underset{condition-invariant feature distillation}{\underset{⏟}{λ_{KD}^{di} L_{KD}^{di} + λ_{KD-proto}^{di} L_{KD-proto}^{di}}} + \underset{condition-specific feature distillation}{\underset{⏟}{λ_{KD-proto}^{ds} L_{KD-proto}^{ds} + λ_{KD-rel}^{ds} L_{KD-rel}^{ds}}} \end{matrix}

(36)

Here, prototype-based constraints guide the model to form consistent and well-structured representations under new operating conditions by regulating the geometric relationships among class prototypes, condition prototypes, and condition-aware class prototypes. Meanwhile, feature-level and prototype-level distillation losses enforce consistency between the current model and historical models in terms of feature distributions and prototype relations, thereby effectively mitigating catastrophic forgetting during incremental learning.

In summary, through stage-wise objective design and a progressive freezing strategy, the proposed method enables the model to establish stable feature disentanglement and prototype structures during the initialization stage, and to subsequently achieve effective adaptation to new operating conditions in incremental stages while avoiding degradation of historical knowledge.

2.6. Prototype-Based Cross-Condition Diagnostic Inference

After completing prototype construction and feature disentanglement during the training stage, the model enters the inference stage. The core objective of inference is to rapidly and accurately determine the fault category of a given vibration signal sample. Unlike conventional approaches that rely on parametric classifiers, the proposed method performs fault diagnosis in a prototype-driven manner, where fault categories are determined directly by nearest-neighbor measurement with respect to class prototypes obtained during training.

To maintain consistency with the notation defined earlier, let

{c_{k}^{di}}_{k = 1}^{K}

denote the set of condition-invariant class prototypes, which represent category centers shared across operating conditions;

{c_{s}^{ds}}_{s = 1}^{S}

denote the set of condition prototypes, each characterizing the operating-condition-specific feature center; and

{c_{k, s}^{df} ∣ y_{k} \in Y_{z_{s}}}

denote the set of condition-aware class prototypes under operating condition

z_{s}

, where Y is the global class label set and

Y_{z_{s}} \subseteq Y

represents the subset of classes that actually appear under condition

z_{s}

.

During inference, an input sample is first processed by the trained encoder–disentangler to obtain the condition-invariant feature

F^{di}

and the condition-specific feature

F^{ds}

.

For clarity, the overall inference procedure is summarized in the flowchart shown in Figure 4. The inference process follows a confidence-based two-stage decision mechanism, which is detailed as follows:

Step 1: Class Prediction Based on Condition-Invariant Features

First, the condition-invariant feature

F^{di}

is compared with all condition-invariant class prototypes. The Euclidean distance between

F^{di}

and the k-th class prototype is computed as

d_{k}^{di} = {∥ F^{di} - c_{k}^{di} ∥}_{2}, k = 1, \dots, K

(37)

Subsequently, a Softmax normalization based on negative distances is applied to map the distances into class probabilities:

\Pr^{di} (y = k) = \frac{\exp (- γ^{di} d_{k}^{di})}{\sum_{j = 1}^{K} \exp (- γ^{di} d_{j}^{di})}

(38)

Here,

γ^{di}

serves as a temperature parameter that controls the smoothness of the resulting probability distribution.

If the maximum class probability

{Pr}_{max}^{di} = {max}_{k} \Pr^{di} (y = k)

exceeds a predefined confidence threshold

τ

(initially set to

0.75

and adjustable according to application requirements), the predicted class

\hat{k} = arg {max}_{k} \Pr^{di} (y = k)

is directly output as the final diagnosis result. In this case, the model exhibits high confidence in its decision, and no additional operating-condition information is required for further discrimination.

Step 2: Condition-Aware Fusion Decision Under Low Confidence

When

{P r}_{max}^{di} < τ

, the model considers the cross-condition prediction to be unreliable. In this case, operating-condition information is introduced to refine the diagnosis. The refinement procedure consists of three successive steps:

Operating-condition identification: First, the condition-specific feature $F^{ds}$ is matched against all condition prototypes. The most likely operating condition is determined according to the minimum Euclidean distance criterion:

${\hat{z}}_{s} = \arg \min_{s} {∥ F^{ds} - c_{s}^{ds} ∥}_{2}, s = 1, \dots, S$

(39)

This step is equivalent to identifying the operating state of the current sample, which provides contextual information for subsequent condition-aware diagnosis. To avoid unstable multi-condition interference under limited-sample settings, a hard condition-selection strategy is adopted, where only the most similar operating condition is retained for further inference.
Condition-aware fusion-based classification: Given the identified operating condition ${\hat{z}}_{s}$ , the fused feature $F^{df}$ is compared with the condition-aware class prototypes constructed under this condition. Let $Y_{{\hat{z}}_{s}} \subseteq Y$ denote the set of fault categories that have actually appeared under condition ${\hat{z}}_{s}$ , with cardinality $K_{{\hat{z}}_{s}} = | Y_{{\hat{z}}_{s}} |$ . For each category $y_{k} \in Y_{{\hat{z}}_{s}}$ , the Euclidean distance between the fused feature and the corresponding condition-aware class prototype is computed as

$d_{k, {\hat{z}}_{s}}^{df} = {∥ F^{df} - c_{k, {\hat{z}}_{s}}^{df} ∥}_{2}, k = 1, \dots, K_{{\hat{z}}_{s}}$

(40)

Subsequently, the distances are mapped to condition-specific class probabilities via Softmax normalization based on negative distances:

${P r}_{{\hat{z}}_{s}}^{df} (y = k) = \frac{\exp (- γ^{df} d_{k, {\hat{z}}_{s}}^{df})}{\sum_{j = 1}^{K_{{\hat{z}}_{s}}} \exp (- γ^{df} d_{j, {\hat{z}}_{s}}^{df})}, y_{k} \in Y_{{\hat{z}}_{s}}$

(41)

It should be noted that the observable fault category set may differ across operating conditions. Therefore, the condition-aware classification is defined only on $Y_{{\hat{z}}_{s}}$ and must be extended to the global category space Y to ensure consistency during probability fusion. For fault categories that have never appeared under condition ${\hat{z}}_{s}$ , a small probability $ϵ$ is assigned. In this work, $ϵ = 10^{- 6}$ is used solely for numerical stability and does not affect the final decision. The extended condition-aware probability distribution ${\bar{P r}}_{{\hat{z}}_{s}}^{df} (y)$ is defined as

${\bar{P r}}_{{\hat{z}}_{s}}^{df} (y) = \{\begin{matrix} (1 - ϵ \cdot | Y ∖ Y_{{\hat{z}}_{s}} |) {P r}_{{\hat{z}}_{s}}^{df} (y), & y \in Y_{{\hat{z}}_{s}} \\ ϵ, & y \in Y ∖ Y_{{\hat{z}}_{s}} \end{matrix}$

(42)

Here, $| Y ∖ Y_{{\hat{z}}_{s}} |$ denotes the number of unseen categories.
Probability fusion: Finally, the global category probability ${P r}^{di} (y)$ obtained from the condition-invariant features is fused with the extended condition-aware probability ${\bar{P r}}_{{\hat{z}}_{s}}^{df} (y)$ via weighted combination:

${P r}_{final} (y) = α {P r}^{di} (y) + (1 - α) {\bar{P r}}_{{\hat{z}}_{s}}^{df} (y), y \in Y$

(43)

The fusion weight is defined as $α = \max (0.3, \min (1, \frac{{P r}_{max}^{di}}{τ}))$ . This design ensures that when condition-specific information is insufficient or unreliable, the cross-condition prediction still plays a dominant role in the final decision.

The final predicted category is obtained as

\hat{y} = arg {max}_{y \in Y} {P r}_{final} (y)

. If the maximum probability remains below a rejection threshold

τ_{final}

, the sample is marked as “low confidence/to be reviewed”, which can trigger manual inspection or incremental model updating in practical systems.

Through the above inference procedure, the proposed model realizes a two-stage decision strategy characterized by “direct decision under high confidence and condition-aware refinement under low confidence”. This strategy enables the model to fully exploit condition-invariant features for robust cross-condition generalization, while selectively leveraging condition-specific information to enhance discriminability when necessary, thereby achieving a balanced trade-off between generalization and specialization.

3. Experimental Methodology

3.1. Dataset Description

The experiments in this study involve three datasets: the CWRU bearing fault dataset, the HUST (Huazhong University of Science and Technology) gear fault dataset, and the CHPH-FETB (Cotton Harvester Picking-Head Drivetrain Fault Emulation Test Bench) bearing-gear dataset.

The CWRU (Case Western Reserve University) bearing fault dataset is one of the most widely used public benchmarks in the field of mechanical fault diagnosis. In this work, its multi-operating-condition characteristics are exploited to construct operating-condition incremental learning tasks, which are used to evaluate the fault recognition performance and knowledge retention capability of the proposed model during the gradual introduction of new operating conditions. In the CWRU dataset, localized defects are artificially introduced on the inner race, outer race, and rolling elements of bearings using Electrical Discharge Machining (EDM). These defects correspond to inner race faults (IR), outer race faults (OR), and rolling element faults (B), respectively, together with the normal condition (N), forming four fault categories in total. Regarding defect severity, a defect size of 21 mil is selected in this study. This defect scale represents a moderate level, which is sufficiently large to produce clear and stable fault-related features, while avoiding the issues caused by overly small defects with weak signatures or excessively large defects that introduce strong impulsive components and obscure operating-condition differences. In terms of operating conditions, the CWRU dataset provides four load levels: 0, 1, 2, and 3 hp, corresponding to shaft speeds of approximately 1797, 1772, 1750, and 1730 rpm, respectively. Variations in load directly induce systematic differences in rotational speed and vibration response characteristics, and can therefore be naturally regarded as distinct operating-condition domains. Based on these characteristics, different load conditions are treated as independent incremental tasks in this study, and an operating-condition incremental learning sequence is constructed as summarized in Table 1. At each task stage, the set of fault categories remains unchanged, while new operating conditions are progressively introduced. This setting forms a typical continual learning evaluation scenario characterized by “incremental operating conditions with fixed fault categories”.

The HUST gearbox dataset is a publicly available benchmark for gear fault diagnosis [28]. By systematically combining multiple load levels and rotational speed patterns, this dataset constructs a collection of gear vibration signals with pronounced operating-condition differences, making it well suited for evaluating fault recognition and generalization performance under complex condition evolution scenarios. The HUST gearbox data acquisition platform consists of an electric motor, a two-stage gearbox, and a controllable load device, with vibration signals sampled at a frequency of 25.6 kHz. The experimental object involves three gear health states: the normal condition (N), broken tooth fault (BT), and missing tooth fault (MT). All fault conditions are artificially introduced in advance to ensure controllability and consistency in fault type and location. In terms of operating-condition design, the dataset includes five load levels (0, 0.113, 0.226, 0.339, and 0.452 Nm) and six rotational speed patterns, including five constant-speed conditions ranging from 20 to 40 Hz and one time-varying speed condition of 0–40–0 Hz. These combinations result in a total of 30 original operating-condition configurations. Variations in load and rotational speed significantly alter the dynamic response characteristics of the gearbox system and can therefore be naturally regarded as distinct operating-condition domains. Leveraging the rich operating-condition configurations provided by the dataset, this study constructs operating-condition incremental continual learning tasks to evaluate the adaptability and knowledge retention capability of the proposed model during the gradual introduction of new conditions. Seven representative operating conditions are selected and organized into an operating-condition incremental learning sequence consisting of five task stages, as summarized in Table 2. The sequence starts from several steady-state operating conditions as the base task and subsequently introduces zero-load, rated-load, cross-load, and finally time-varying speed conditions. This setting forms a continual learning evaluation scenario characterized by “gradually expanding operating conditions with fixed fault categories”, which closely simulates practical applications where gearbox operating conditions continuously evolve and diagnostic models must be updated accordingly.

The CHPH-FETB bearing-gear dataset was collected from a cotton harvester picking-head drivetrain fault emulation test bench designed and constructed by the School of Mechanical Engineering, Xinjiang University. This experimental platform is intended to simulate fault excitations and vibration response characteristics of the picking-head drivetrain under controlled laboratory conditions, thereby providing engineering-oriented data for intelligent fault diagnosis research. The structural schematic and physical layout of the test bench have been reported in the authors’ previous work and are illustrated in Figure 5. The CHPH-FETB platform consists of a driving motor, a drivetrain shafting system, interchangeable bearing and gear components, and a vibration signal acquisition system. It supports repeatable vibration tests for single fault sources under different operating states. In terms of structural configuration and power transmission path, the test bench is consistent with the actual cotton harvester picking-head drivetrain, while operating in a fully controlled laboratory environment. Due to experimental constraints, the platform is not equipped with an independent external load regulation device. Instead, operating conditions are primarily controlled by adjusting the rotational speed of the driving motor, while the system load is naturally formed by the mechanical resistance and friction characteristics of the internal drivetrain components. On this basis, different rotational speeds are used to simulate bearing and gear fault vibration characteristics under various operating conditions. Regarding fault configuration, the CHPH-FETB dataset covers representative health and fault states of two key components in the picking-head drivetrain: bearings and gears. For bearing faults, two bearing types (6205 and 6007) are employed. Localized defects are introduced on the inner race and outer race using wire electrical discharge machining, forming groove-shaped defects with controlled width and depth to simulate localized bearing damage caused by fatigue and wear in practical operation. For gear faults, two typical fault types are considered: tooth breakage and root crack. The tooth breakage fault is implemented by milling a notch of a certain depth at the middle of the tooth width, while the root crack fault is simulated by wire cutting at a specified inclination angle near the gear root. Based on the CHPH-FETB test bench, different rotational-speed conditions are organized into an operating-condition incremental learning task sequence, as summarized in Table 3. This setup is used to evaluate the fault recognition performance and knowledge retention capability of the proposed method under conditions that closely approximate practical engineering operation.

3.2. Task Configuration

In this study, an Operating Condition Incremental Learning (OCIL) scenario is adopted to evaluate the effectiveness of the proposed method. OCIL assumes that all fault categories have been learned at the initial stage (

t = 0

), and that only new operating conditions are gradually introduced during subsequent learning stages, while the fault category set remains unchanged throughout all stages. This setting is consistent with practical engineering applications, where operating conditions of equipment evolve over time, whereas fault types are relatively fixed.

To account for differences in the number and distribution of operating conditions across datasets, a unified task design principle is followed when constructing operating-condition incremental learning tasks:

The initial task $S_{0}$ includes several representative operating conditions to ensure that the base model achieves a reasonable level of initial generalization capability;
Each incremental task $S_{i}$ ( $i \geq 1$ ) introduces only one new operating condition, thereby emphasizing the model’s adaptability and knowledge transfer ability under condition changes;
When applicable, variable-speed operating conditions are retained to introduce non-stationary characteristics and increase task difficulty;
The fault category set remains identical across all task stages.

Based on these unified principles, operating-condition incremental learning task sequences are constructed for the CWRU, HUST, and CHPH-FETB datasets, respectively. The detailed task partitions and sample configurations are summarized in Table 1, Table 2 and Table 3.

3.3. Comparison Methods

To evaluate the effectiveness and stability of the proposed method under the operating-condition incremental learning scenario, both conventional baseline strategies and representative continual learning methods are employed for comparison. Specifically, a lower-bound baseline and an upper-bound oracle are first introduced, followed by several classical continual learning approaches that have been extensively validated in prior studies:

Fine-tune: At each incremental stage, the model is fine-tuned using only data from the current operating condition, without introducing any historical information constraints or knowledge retention mechanisms. This approach represents the most basic end-to-end updating strategy and serves as a lower-bound reference, reflecting the performance degradation when continual learning capability is absent.
Joint Oracle: This method assumes that data from all historical operating conditions and the current condition are simultaneously accessible at each stage, and the model is trained jointly on the full dataset. Although infeasible in practical online scenarios, it serves as an upper-bound reference to quantify the performance gap between continual learning methods and the ideal offline training result. It should be noted that the data distribution and operating-condition sequence used by the Joint Oracle are consistent with the incremental task setup, and no additional balancing across stages is applied. The performance gain is therefore solely attributed to the availability of all historical data.
Learning Without Forgetting (LwF) [30]: LwF mitigates catastrophic forgetting by constraining the model to preserve its outputs on previously learned tasks via knowledge distillation, while learning new tasks. This method does not require explicit storage of historical data and introduces minimal additional memory overhead, making it a representative continual learning approach under memory-free settings.
Incremental Learning with Dual Memory (IL2M) [31]: IL2M addresses class imbalance introduced during incremental learning by incorporating a dual-memory mechanism, which leverages both exemplar memory and statistical information to assist model updating. By combining fine-tuning with memory-based correction, IL2M improves learning stability across multiple incremental stages.
End-to-End Incremental Learning (EEIL) [32]: EEIL jointly employs cross-entropy loss and knowledge distillation during incremental training, enabling the model to adapt to new data distributions while preserving discriminative knowledge of previously learned tasks. This method achieves relatively stable performance without requiring explicit expansion of the model architecture.
Incremental Classifier and Representation Learning (iCaRL) [33]: iCaRL is a representative replay-based continual learning method that alleviates catastrophic forgetting by storing a limited number of historical samples as an exemplar memory and replaying them together with new data during training. In addition, iCaRL adopts a nearest-mean-of-exemplars classification strategy, which decouples feature representation learning from classification decisions.
Learning a Unified Classifier Incrementally via Rebalancing (LuCIR) [34]: LuCIR focuses on mitigating classifier bias caused by imbalanced data distributions between old and new tasks. It introduces a cosine-normalized classifier and rebalancing strategies to enhance inter-class separability in the feature space. By jointly incorporating distillation constraints and discriminative losses, LuCIR enables progressive updating of a unified classifier.

Overall, the selected comparison methods span a wide range of incremental learning strategies, from conventional updating without knowledge retention to classical continual learning approaches based on distillation, replay, and memory mechanisms. Through systematic comparison with these methods, the adaptability, stability, and long-term learning capability of the proposed approach under gradually evolving operating conditions can be comprehensively evaluated.

3.4. Network Architecture

The network architecture adopted in this study consists of a feature encoder, a feature disentanglement module, prototype-related mapping modules, and several auxiliary branches. The overall architecture and layer-wise parameter configurations are summarized in Table 4. All modules are implemented in an end-to-end manner, and their parameters are selectively updated at different training stages according to the predefined scheduling strategy.

The feature encoder follows a hierarchical convolutional neural network structure. It is composed of multiple two-dimensional convolutional layers, batch normalization layers, and nonlinear activation functions, which progressively extract discriminative representations from the time-frequency domain. A global average pooling operation is applied at the top of the encoder to produce a fixed-dimensional high-level feature representation. This encoder serves as a shared backbone for subsequent feature disentanglement and is trained in the initialization stage to learn stable and general feature representations.

Based on the encoder output, the model constructs a condition-invariant feature disentangler and a condition-specific feature disentangler. Both disentanglers are implemented using multi-layer fully connected networks that map the global feature representation into lower-dimensional embedding spaces. To enhance the geometric adaptability of the feature space, lightweight adapter modules are introduced at the higher layers of the disentanglers. These adapters are combined with the original feature mappings through residual connections, and the resulting features are subsequently normalized. This design preserves feature dimensional consistency while providing sufficient flexibility for structural adjustment and incremental adaptation in later stages.

The condition-invariant and condition-specific features are concatenated along the feature dimension and further projected into a unified fused feature space via an independent mapping head. The fused features are used for subsequent prototype construction and geometric constraint computation. In addition, the network includes a discriminator branch for operating-condition recognition and a classification head for fault category prediction. Both branches are implemented using multi-layer fully connected networks and are activated or frozen at different training stages according to the training strategy. A decoder branch is also incorporated to enforce feature reconstruction constraints during training. This branch takes the fused feature representation as input and reconstructs the original time-frequency representation through progressive upsampling and skip connections, thereby enhancing the preservation of structural information in the encoded features from an unsupervised perspective. The decoder is enabled only during training to assist feature learning and representation regularization, and it is excluded from forward computation during testing and inference, thus incurring no additional inference complexity.

At the implementation level, all batch normalization layers are switched to evaluation mode when the corresponding modules are frozen, in order to prevent distribution drift caused by updates to running statistics. Different learning rates can be assigned to different modules during training to accommodate their functional roles at various stages. All implementation details are realized using the PyTorch framework, with unified parameter initialization and training schedules strictly followed to ensure experimental stability and reproducibility.

3.5. Experimental Details

In all experiments conducted in this study, the loss weights involved in model training follow a general principle of "primary supervision first, structural constraints second, and lightweight regularization". Unless otherwise specified, the hyperparameter settings remain consistent across all experiments. To ensure reproducibility, the random seed is fixed throughout all experiments, and the network architecture, training strategy, and hyperparameter configurations are kept identical. All experiments are conducted on a workstation equipped with an Intel Core i5-14600K CPU and an NVIDIA RTX 4070 Ti GPU (16 GB memory). The models are implemented using the PyTorch 2.1 framework and accelerated with CUDA 12.2.

The initialization stage (

t = 0

) is divided into two sub-stages, namely Stage 1 and Stage 2. Stage 1 corresponds to the joint feature learning phase, whose objective is to learn discriminative condition-invariant features and condition-specific features from multi-condition labeled samples, while preserving information completeness during feature disentanglement. In this stage, all network modules are trainable. The classification loss and reconstruction loss serve as the primary supervision terms, with weights set to

λ_{cls} = 1.0

and

λ_{re} = 1.0

, respectively. The operating-condition discrimination loss for condition-specific features is assigned a weight of

λ_{D_{ds}} = 0.5

to ensure effective encoding of condition-related information. For the adversarial discrimination loss on condition-invariant features, a soft-start scheduling strategy is adopted to alleviate instability during early training. Its weight increases gradually with training progress:

λ_{D_{di}} (e) = λ_{D_{di}}^{max} \cdot \frac{1}{1 + \exp (- α^{di} (\frac{e}{T_{1}} - β^{di}))}

, where

λ_{D_{di}}^{max} = 0.5

,

α^{di} = 10

,

β^{di} = 0.4

, e denotes the current training epoch, and

T_{1}

is the total number of training epochs in Stage 1. A decorrelation regularization term is introduced as a lightweight constraint with weight

λ_{decorr} = 0.01

.

Stage 2 corresponds to the prototype structure formation phase, which aims to further refine the prototype geometry. In this stage, the classifier, decoder, and domain discriminators are frozen. The front convolutional layers of the encoder and the lower layers of the feature disentanglement modules are progressively fixed, while only high-level features and geometric adjustment modules remain trainable. The front convolutional layers of the encoder (Conv1–Conv5 in Table 4) and the first fully connected layer of the feature disentanglement modules (FS_FC1) are progressively fixed after Stage 1 training, while only high-level encoder layers (Conv6–Conv8) and geometric adjustment modules remain trainable. Training in this stage relies solely on prototype-based Pull-Push constraints to avoid disturbing the established feature semantics. The corresponding loss weights are set as follows:

λ_{pull}^{di} = 1.0

,

λ_{push}^{di} = 0.3

,

λ_{pull}^{ds} = 0.8

,

λ_{push}^{ds} = 0.4

,

λ_{pull}^{df} = 1.0

, and

λ_{push}^{df} = 0.2

. Among them, the Pull term for condition-invariant features primarily enhances intra-class compactness, while the Push term separates different classes. The constraints on condition-specific and fused features are relatively weaker.

In the incremental learning stage (

t > 0

), new operating conditions are introduced sequentially, and each condition contains only a small number of labeled samples. To prevent catastrophic forgetting, most model parameters are frozen, including the entire encoder (Conv1–Conv8) and the lower layers of the feature disentanglement modules (FS_FC1 and FS_FC2). Only the higher disentanglement layer (FS_FC3), adapter components, and fused feature mapping modules (Proj_Df) remain trainable for adaptation to new operating conditions. Each incremental step is trained for a limited number of epochs using a smaller learning rate, together with an early-stopping strategy to reduce overfitting risk under small-sample conditions. Early stopping is determined based on validation performance, and the running statistics of batch normalization layers are frozen synchronously when the corresponding modules are fixed, in order to avoid feature distribution drift. This stage adopts a combined optimization strategy of prototype constraints and multi-level knowledge distillation, with the following loss weights:

λ_{var - di} = 0.05

,

λ_{pull}^{ds} = 1.0

,

λ_{push}^{ds} = 0.5

,

λ_{pull}^{df} = 1.0

,

λ_{push}^{df} = 0.2

,

λ_{KD}^{di} = 1.0

,

λ_{KD - proto}^{di} = 0.05

,

λ_{KD - proto}^{ds} = 0.3

, and

λ_{KD - rel}^{ds} = 0.2

.

Regarding training epochs, the total number of epochs in the initialization stage (

t = 0

) is set to 200, with Stage 1 and Stage 2 occupying 130 and 70 epochs, respectively. For the incremental learning stages (

t > 0

), the maximum number of training epochs for each operating-condition increment is set to 50. The actual number of epochs is dynamically determined by the validation-based early-stopping strategy, so as to balance adaptation capability and overfitting risk under limited-sample settings.

For sample selection, the proposed method selects relatively similar and relatively dissimilar operating-condition subsets from historical conditions based on the operating-condition similarity ranking. Although the selection process is parameterized by the ratios

ρ_{top}

and

ρ_{bot}

in the method description, the number of available historical operating conditions in the CWRU, HUST, and CHPH-FETB datasets is relatively limited. As a result, the numbers of selected conditions often degenerate to small discrete values in practice. Therefore, a fixed selection scheme is adopted in all experiments: at each incremental stage,

M_{sim}^{t} = 2

similar operating conditions and

M_{far}^{t} = 2

dissimilar operating conditions are selected for the distillation process. This setting balances similarity-driven knowledge transfer and necessary long-range contrast constraints under small-scale operating-condition scenarios.

Regarding optimization details, all trainable network parameters are initialized using Xavier initialization and updated using the Adam optimizer with momentum parameters

β_{1} = 0.9

and

β_{2} = 0.999

.

4. Experimental Results and Analysis

4.1. Overall Performance Analysis Under OCIL

To systematically evaluate the overall performance of different methods under the Operating-Condition Incremental Learning (OCIL) scenario, this study adopts an “overall trend analysis followed by dataset-specific analysis” strategy for result discussion. First, the diagnostic performance evolution of different methods is examined from a global perspective as new operating conditions are progressively introduced. Subsequently, detailed analyses are conducted for each dataset in the following subsections.

From the overall performance perspective, the classification accuracy trends on the CWRU, HUST, and CHPH-FETB datasets are illustrated in Figure 6, Figure 7, and Figure 8, respectively, while the corresponding forgetting rate variations are shown in Figure 9. Overall, despite the differences among datasets in terms of equipment characteristics, signal acquisition conditions, and operating-condition complexity, consistent performance evolution trends can be observed across all datasets during the OCIL process, providing a reliable basis for the subsequent dataset-specific analysis.

From the perspective of overall classification accuracy, all methods achieve relatively high performance on the CWRU dataset, and the performance gaps among different approaches are comparatively small. This can be attributed to the limited number of operating conditions, stable running states, and relatively high signal-to-noise ratio of the CWRU dataset. In this scenario, the distribution shifts among different operating conditions are mainly caused by load variations, which impose a relatively mild impact on feature representations. As a result, most continual learning methods are able to preserve previously learned knowledge to a certain extent, and their overall performance approaches the upper bound established by offline joint training.

As task difficulty increases, the HUST and CHPH-FETB datasets introduce a wider range of load–speed combinations as well as non-stationary variable-speed operating conditions, leading to increasingly pronounced performance differences among methods. Under multi-stage operating-condition increments, models are required not only to adapt to newly introduced running conditions but also to maintain reliable discrimination for historical conditions. This setting poses greater challenges for continual learning strategies, and differences in accuracy retention and forgetting suppression become more evident.

Across all three datasets, continual learning methods based on knowledge distillation and end-to-end optimization, such as EEIL and LwF, demonstrate relatively stable performance retention during operating-condition increments. By constraining the consistency between the outputs of the current model and the historical model, these methods effectively alleviate the degradation of knowledge learned under previous operating conditions while adapting to new ones. Among them, EEIL typically exhibits superior overall stability under complex operating conditions by jointly optimizing the cross-entropy loss and distillation constraints during incremental training.

The sample-replay-based iCaRL method also shows strong continual learning capability across multiple datasets. By incorporating a small number of historical operating-condition samples as distribution anchors during training, iCaRL can partially mitigate representation drift induced by operating-condition changes, thereby maintaining a more balanced performance between new and old conditions. However, as the number and complexity of operating conditions increase, its performance is still constrained by the size and representativeness of the selected replay samples.

In contrast, methods primarily designed for class imbalance or class-incremental scenarios, such as IL2M and LuCIR, exhibit relatively limited advantages under the OCIL setting. Since the fault category set remains unchanged across all stages in the considered tasks, certain mechanisms in these methods that are tailored to address new-old class bias cannot be fully exploited, resulting in larger performance fluctuations under complex operating conditions.

From the forgetting-rate perspective, Fine-tune consistently exhibits severe forgetting across all datasets as operating conditions are progressively introduced, confirming that models lacking explicit knowledge-preservation mechanisms struggle to cope with continual operating-condition evolution. In comparison, most continual learning methods suppress forgetting to varying degrees, with approaches based on distillation constraints or sample-selection mechanisms demonstrating more stable forgetting control in multi-stage tasks. Joint Oracle maintains the lowest forgetting rate on all datasets and serves as a theoretical performance upper bound, providing a reference for evaluating the long-term effectiveness of continual learning methods.

Overall, the experimental results indicate that the performance of continual learning strategies under the OCIL setting is closely related to their design objectives. As operating-condition complexity increases, methods that can effectively balance adaptation to new conditions and preservation of historical knowledge tend to achieve more stable long-term performance. Based on these observations, a more detailed analysis of the experimental results on individual datasets is provided in the following sections.

4.2. Performance Analysis on Different Datasets

4.2.1. Results on the CWRU Dataset

The operating conditions in the CWRU dataset are relatively simple, mainly characterized by rotational speed variations under different load levels. The system operates in a stable regime with a high signal-to-noise ratio, and the distribution differences among operating conditions are relatively limited. Therefore, this dataset serves as a baseline benchmark for evaluating the continual learning capability of different methods under relatively ideal conditions.

From the experimental results, it can be observed that most continual learning methods achieve consistently high classification accuracy on the CWRU dataset, with relatively small performance gaps among different approaches. Even during incremental stages where new operating conditions are gradually introduced, methods based on knowledge distillation, sample selection, or memory mechanisms are generally able to preserve discriminative performance on previously learned conditions. Overall, their performance remains close to the upper bound achieved by joint training (Joint Oracle). This observation indicates that when operating-condition shifts are moderate and signal characteristics remain stable, the difficulty of the continual learning task is relatively low.

Despite the limited overall performance gap, noticeable differences still exist among methods in terms of feature stability and discriminative consistency. The Fine-tune method, which updates the model using only data from the current operating condition without any historical knowledge constraints, exhibits a clear performance degradation on early operating conditions as new conditions are introduced. This behavior reflects the typical forgetting phenomenon that arises when no explicit knowledge retention mechanism is employed. In contrast, the remaining continual learning methods are able to effectively suppress the accumulation of forgetting, maintaining relatively low forgetting rates throughout the incremental process.

It is worth noting that under these relatively favorable experimental conditions, the proposed method maintains stable class discrimination behavior across different operating-condition stages. This property is closely related to the introduced class-prototype constraints, which help stabilize class-level feature representations as operating conditions evolve, thereby preventing unnecessary fluctuations of decision boundaries. Moreover, by explicitly modeling the similarity relationships among different operating conditions, the model can adapt to new conditions in a smoother manner, inheriting existing discriminative structures rather than reconstructing them solely from current data.

Overall, the experimental results on the CWRU dataset confirm the basic effectiveness of various continual learning methods under stable operating conditions. Although the performance differences among methods are not significantly amplified in this dataset, their distinct behaviors during operating-condition increments already begin to emerge, providing a useful reference for further analysis under more complex operating scenarios.

4.2.2. Results on the HUST Dataset

Compared with the CWRU dataset, the HUST gearbox fault dataset exhibits significantly higher complexity in its operating-condition settings. In addition to multiple combinations of load and rotational speed, non-stationary variable-speed operating conditions are introduced in later stages. The multi-stage incremental introduction of operating conditions, together with the alternation between stationary and non-stationary regimes, makes this dataset a critical benchmark for evaluating the adaptability and stability of continual learning methods under complex operating scenarios.

From the overall performance trends, the performance gaps among different methods on the HUST dataset become noticeably larger than those observed on the CWRU dataset. During the initial stage and early incremental stages (

S_{0}

–

S_{2}

), most continual learning methods are still able to maintain relatively stable diagnostic performance, indicating that the models retain a certain degree of generalization capability when adapting to newly introduced operating conditions under fixed-speed and fixed-load settings. However, as the number of operating conditions increases and the differences among them become more pronounced, the ability of the model to balance adaptation to new conditions and retention of historical knowledge gradually emerges as the dominant factor influencing performance.

When the S-shaped variable-speed operating condition is introduced at stage

S_{4}

, the performance differences among methods become even more evident. Under this operating condition, the rotational speed varies continuously over time, resulting in strongly non-stationary signal characteristics and a substantial distribution shift compared with previous stationary conditions. At this stage, the Fine-tune method, which lacks effective knowledge preservation mechanisms, exhibits a pronounced performance degradation, highlighting the limitations of relying solely on current data updates under severe distribution shifts.

In contrast, continual learning methods equipped with explicit knowledge retention mechanisms demonstrate more stable diagnostic performance under variable-speed conditions. Knowledge distillation-based approaches constrain the consistency between the outputs of the current and previous models, thereby mitigating excessive parameter updates under non-stationary conditions. Meanwhile, strategies that leverage historical information provide reference distributions from previously learned operating conditions, helping alleviate representation drift caused by operating-condition changes. These mechanisms play a critical role in maintaining performance during the later stages of multi-stage learning.

Notably, the performance of the proposed method on the HUST dataset is highly consistent with its design objectives for handling operating-condition evolution. By introducing class-prototype constraints, the model is able to maintain a relatively stable class-level discriminative structure across multiple operating conditions. Furthermore, the operating-condition similarity-guided strategy encourages the model to focus on the relationships between new and previously observed conditions during adaptation, thereby avoiding abrupt representation shifts in non-stationary stages. This advantage becomes increasingly evident after the introduction of variable-speed conditions, providing effective support for stable learning under complex operating scenarios.

4.2.3. Results on the CHPH-FETB Dataset

The CHPH-FETB dataset is collected from the CHPH-FETB experimental platform, which focuses on the picking-head drivetrain of a cotton harvester as a specific engineering component. By constructing a dedicated support structure and an independent power input system, the platform enables experimental acquisition of vibration signals under different rotational speed conditions. Compared with general-purpose experimental platforms such as CWRU and HUST, this system exhibits stronger engineering specificity in terms of structural configuration, transmission paths, and internal friction characteristics. However, it still operates under laboratory conditions and inevitably differs from real field operating environments.

Due to experimental constraints, no controllable external load mechanism is introduced in the CHPH-FETB platform. Operating-condition variations are mainly achieved by adjusting the motor rotational speed, while the system load is primarily determined by internal mechanical friction and structural impedance within the picking-head drivetrain. As a result, the load level cannot be precisely specified. Despite these limitations, the dataset exhibits a higher degree of structural complexity and stronger coupling between mechanical components and operating conditions than generic test benches, providing a more challenging evaluation environment for studying the long-term behavior of continual learning methods in complex systems.

On the CHPH-FETB dataset, as operating conditions are gradually introduced over multiple incremental learning stages, the overall diagnostic performance of all methods further declines compared with public datasets, and the performance gaps among different methods become more pronounced. In this scenario, the model is required not only to adapt to continuously changing operating states, but also to maintain reliable discrimination of historical operating conditions over a longer task sequence. This places stricter demands on the stability and robustness of continual learning strategies.

From the perspective of forgetting behavior, the performance differences among methods on the CHPH-FETB dataset are mainly reflected in their ability to preserve knowledge from historical operating conditions. Methods lacking effective constraint mechanisms exhibit evident accumulation of forgetting after multiple incremental stages. In contrast, methods incorporating distillation constraints or explicit utilization of historical information are able to suppress performance degradation to some extent, resulting in a relatively smoother growth of forgetting. In systems with high structural complexity and strong operating-condition coupling, such long-term stability is particularly critical for practical deployment.

On this dataset, the proposed method demonstrates relatively consistent learning behavior throughout the multi-stage operating-condition evolution. By constraining class-level feature representations through the class-prototype mechanism, the model is able to maintain a stable discriminative structure under complex structural configurations and diverse operating conditions. Meanwhile, the operating-condition similarity-guided strategy enables the model to inherit previously learned knowledge more smoothly when adapting to new conditions, rather than inducing drastic representation reconstruction. This property helps suppress the continuous accumulation of forgetting over long task sequences, providing effective support for stable operation in engineering-oriented scenarios.

4.2.4. Summary of Dataset-Specific Results

By jointly considering the experimental results on the CWRU, HUST, and CHPH-FETB datasets, a consistent performance evolution trend of different continual learning methods under the OCIL setting can be observed as operating-condition complexity increases. When operating conditions are relatively stable, performance differences among methods remain limited. However, with the introduction of non-stationary operating conditions and increasing system complexity, the ability of a model to balance adaptation to new conditions and retention of historical knowledge gradually becomes the dominant factor determining long-term performance.

The experimental results indicate that methods capable of maintaining discriminative structure stability at the class-representation level, while effectively exploiting the relationships among operating conditions during system evolution, tend to exhibit more robust long-term behavior under multi-stage learning scenarios. These observations provide a solid basis for further mechanism-level analysis of the applicability of different continual learning strategies and the advantages of the proposed approach.

4.3. Model Adaptation Capability Under Few-Shot Conditions

In the preceding experiments, each newly introduced operating condition provides 50 samples per fault class at each incremental stage, which ensures that all methods are trained under relatively sufficient supervision. However, in practical engineering environments, fault samples are often extremely scarce. Therefore, the ability of a model to rapidly adapt to new operating conditions under few-shot settings is a critical criterion for evaluating the practical applicability of continual learning methods. To this end, few-shot experiments are conducted on the CHPH-FETB dataset.

In the initial stage

S_{0}

, the number of training samples remains unchanged at 200 samples per class. In the subsequent stages

S_{1}

to

S_{4}

, the number of samples per class for each newly introduced operating condition is reduced to 30 or 10. The continual learning results under these few-shot settings are summarized in Table 5.

As shown in Table 5, when the number of newly introduced samples per class is gradually reduced from 50 to 30 and further to 10, the diagnostic performance of all methods exhibits a certain degree of degradation across different incremental stages. This trend reflects the inevitable impact of weakened supervision signals on model parameter updating and feature adaptation. Nevertheless, the extent of performance degradation and the overall stability vary significantly among different methods, indicating notable differences in their sensitivity to few-shot conditions.

From the comparison results, the Fine-tune method suffers the most severe performance degradation as the number of samples decreases, demonstrating its strong dependence on sufficient supervision from new operating conditions and its lack of effective knowledge preservation mechanisms. Knowledge-distillation-based methods, such as LwF and EEIL, alleviate catastrophic forgetting to some extent and enable the model to retain basic discriminative capability under few-shot conditions. However, since these methods mainly rely on consistency constraints at the output or feature level, their ability to characterize the feature distribution of new operating conditions remains constrained by the limited sample size.

Methods incorporating sample selection or memory-related mechanisms, such as iCaRL, IL2M, and LuCIR, exhibit relatively better overall stability. With the assistance of historical samples or statistical information, the degree of performance degradation is partially mitigated. Nevertheless, when the number of samples for newly introduced operating conditions is further reduced, these methods still show limited adaptability to extremely scarce data. This limitation becomes more pronounced under multi-stage incremental learning scenarios, where model updates are simultaneously affected by sample scarcity and distribution shift.

The Joint Oracle method, which serves as an ideal upper-bound reference, maintains high performance across all stages. However, it relies on simultaneous access to all historical and current data, making it impractical for online deployment. Therefore, it is only used to characterize the upper performance limit of continual learning methods.

In contrast, the proposed method demonstrates more stable diagnostic performance under different few-shot settings and maintains a consistent advantage across multiple incremental stages. These results indicate that, even when the number of samples for newly introduced operating conditions is further reduced, the proposed method is still capable of rapidly adapting to new conditions. The key reason lies in the introduction of three types of prototype structures corresponding to condition-invariant features, condition-specific features, and fused features, which establish geometrically anchored representations with explicit semantic constraints in the feature space. During incremental learning, the limited new samples are mainly used to locally adjust the existing prototype structure rather than to relearn the entire decision boundary. As a result, the dependence of the model on the quantity of new samples is significantly reduced.

This prototype-centered structured updating mechanism enables the model to maintain discriminative consistency and knowledge stability under limited supervision, highlighting the strong potential of the proposed method for practical engineering scenarios involving few-shot operating-condition evolution.

4.4. Cross-Condition Inference Capability Under Unseen Operating Conditions

In operating-condition incremental learning scenarios, model performance is typically evaluated after the model has been updated using samples from newly introduced operating conditions. However, in practical engineering applications, diagnostic systems often need to perform preliminary analysis and decision-making on vibration signals collected under new operating conditions before the model has been fully updated. Such early-stage diagnosis is crucial for online condition monitoring and decision support. Therefore, whether a model possesses a certain level of cross-condition inference capability—namely, the ability to produce diagnostically meaningful results under unseen operating conditions—serves as an important supplementary criterion for assessing the practical usefulness of continual learning methods.

Based on this consideration, cross-condition inference experiments under unseen operating conditions are conducted on the HUST and CHPH-FETB datasets. Specifically, at each incremental stage

S_{i}

, when a new operating condition

C_{i}

is introduced, the model trained up to the previous stage

S_{i - 1}

is directly used to evaluate samples from the new operating condition, without performing any training or adaptation on

C_{i}

. Subsequently, the model is updated following the standard operating-condition incremental learning procedure, and its diagnostic performance on the new operating condition is evaluated again after training. It should be emphasized that during the unseen-condition inference phase, all model parameters, prototype structures, and batch normalization statistics remain frozen. Cross-condition inference is conducted solely based on the class prototypes, condition prototypes, and condition-aware class prototypes constructed from previously learned operating conditions. The experimental results are summarized in Table 6.

The results indicate that, on both datasets, the diagnostic model retains a certain level of discriminative capability under unseen operating conditions across different incremental stages, demonstrating its inherent cross-condition generalization ability. It is worth noting that, since the model has not yet undergone explicit adaptation using samples from the new operating condition, its diagnostic performance under unseen conditions is generally lower than that achieved after training. Nevertheless, the predictions produced at this stage still exhibit meaningful reference value and can support rapid preliminary assessment during the early phase of operating-condition introduction. This observation suggests that, even without explicit modeling of the new operating condition, the model can leverage the feature structures and prototype distributions learned from historical conditions to map samples from unseen conditions into a relatively consistent discriminative space and perform preliminary classification. Moreover, the performance differences observed across incremental stages and between datasets indicate that the similarity between new and historical operating conditions in terms of operating characteristics and vibration patterns directly affects the reliability and stability of cross-condition inference.

After training with samples from the new operating condition, the diagnostic accuracy on that condition improves significantly. This improvement confirms that samples from the new operating condition effectively participate in the prototype updating process, thereby enhancing the model’s adaptability. By comparing the results before and after training, it can be observed that the proposed method does not rely on complete relearning of new operating conditions. Instead, it achieves rapid alignment with the feature distribution of new conditions through localized adjustments based on the existing prototype structure. This prototype-centered inference and updating mechanism enables the model to provide diagnostically useful results during the early stage of operating-condition evolution, while quickly improving performance once a small number of labeled samples become available. These characteristics highlight the potential advantages of the proposed method in practical engineering applications involving gradually evolving operating conditions.

4.5. Ablation Study

To systematically investigate the role and necessity of each key component in the proposed method under the operating condition incremental learning (OCIL) scenario, ablation experiments are conducted on the CHPH-FETB dataset. The proposed framework is built upon the concept of prototype-driven continual learning, which integrates similarity-guided condition selection, multi-level knowledge distillation at both feature and prototype levels, and a two-stage inference strategy based on class and condition prototypes. These components jointly aim to enhance rapid adaptation to new operating conditions while suppressing catastrophic forgetting. To clarify the individual contribution of each module, this section evaluates several ablated variants by selectively removing or modifying specific components, while keeping the network architecture and training protocol unchanged.

To investigate the contribution of each key component in the proposed method, a series of ablation studies are designed as follows:

Uniform condition replay: The replay mechanism is retained, while the condition-similarity guidance is removed. Instead of similarity-aware selection, historical samples are uniformly sampled across all previously seen operating conditions and fault categories. This variant is used to evaluate the effectiveness of the proposed similarity-guided condition selection strategy during the replay stage.
Without distillation: The distillation constraints are removed, and the model is updated solely using samples from the newly introduced operating condition at each stage. This setting characterizes the lower performance bound in the absence of an explicit knowledge retention mechanism.
First-order inference only: The second-order inference process based on condition prototypes is disabled during inference, and fault recognition is performed solely via nearest-prototype matching with class prototypes. This variant is designed to assess the contribution of the second-order inference mechanism to cross-condition fault discrimination.

The classification accuracy (Accuracy, Acc) of different ablation variants across incremental stages (

S_{0}

–

S_{4}

) is illustrated in Figure 10, while the corresponding forgetting rate (Forgetting Rate, FR) is reported in Figure 11.

From the overall performance trends, the complete method consistently achieves the highest or near-highest diagnostic accuracy across all incremental stages, while maintaining the lowest forgetting rate. This indicates that the proposed components, when jointly applied, effectively balance adaptation to new operating conditions and retention of previously learned knowledge. Comparing the uniform condition replay variant with the full method, it can be observed that the performance difference is relatively small during early stages of continual learning (e.g.,

S_{0}

–

S_{2}

). This is because the number of historical operating conditions is limited, and uniform replay in practice degenerates to replaying most available conditions. However, as the number of operating conditions increases (from

S_{3}

onward), uniform sampling must distribute a fixed replay budget over a larger condition set. As a result, the proportion of replay samples highly relevant to the current condition decreases, leading to a gradual reduction in diagnostic accuracy and an increase in forgetting. In contrast, the proposed similarity-guided replay strategy prioritizes historically related conditions while retaining a small number of distant conditions as regularization, which more effectively preserves prototype structure and decision boundaries in later stages.

The without distillation variant exhibits the most severe performance degradation across all incremental stages. Its diagnostic accuracy rapidly declines as new conditions are introduced, accompanied by a significantly higher forgetting rate. This result confirms that updating the model solely based on new-condition samples is insufficient to prevent catastrophic forgetting, highlighting the fundamental role of distillation mechanisms in continual learning.

When comparing the first-order inference only variant with the full method, removing the condition-aware second-stage inference leads to noticeable degradation in both accuracy and forgetting rate, particularly in mid-to-late stages. This suggests that relying solely on class prototypes becomes insufficient under complex and evolving operating conditions. Incorporating condition prototypes in the second-stage inference provides effective contextual constraints when class-level confidence is low, thereby improving the robustness of cross-condition diagnosis.

Overall, the ablation results demonstrate that the key components of the proposed framework play complementary roles at different stages of continual learning. Similarity-guided replay improves knowledge retention efficiency under limited replay budgets, distillation constraints effectively mitigate catastrophic forgetting, and the two-stage inference mechanism based on class and condition prototypes further enhances diagnostic robustness under gradually evolving operating conditions. Their synergistic design enables the proposed method to achieve both stability and adaptability in OCIL scenarios.

5. Conclusions

This study investigates the fault diagnosis problem of cotton harvester picking-head drivetrains under practical application scenarios where the operating-condition coverage is limited during the initial model construction stage, while fault samples corresponding to multiple rotational speeds and operating conditions are gradually introduced during subsequent use. Under the conditions that the initial training stage provides relatively sufficient samples but limited operating-condition diversity, and that new-condition samples are acquired in a stage-wise and small-scale manner, the key challenge lies in achieving stable model updating while avoiding performance degradation. To address these challenges, a continual learning-based fault diagnosis framework is developed in this work.

The proposed framework is built upon the explicit disentanglement of condition-invariant features and condition-specific features. By constructing a multi-level representation system consisting of class prototypes, condition prototypes, and condition-aware class prototypes, and by integrating prototype-driven structured updates with controlled knowledge transfer mechanisms, the framework achieves stable anchoring of fault-discriminative semantics while orderly absorbing operating-condition differences during the cumulative introduction of new-condition samples. As a result, common issues in continual learning, such as performance degradation and discriminative structure drift, are effectively alleviated.

Considering the characteristics of limited new-condition samples and significant distribution differences among operating conditions during continual learning, an operating-condition similarity measurement method based on condition-specific features is further introduced, together with a proportion-adaptive sample selection strategy. This strategy prioritizes the retention of historical knowledge that is highly relevant to the current operating condition during model updates, while maintaining the overall discriminative boundaries of the feature space through contrastive constraints from less-related operating conditions. Compared with continual learning methods that rely solely on full replay or random replay, the proposed mechanism effectively improves long-term stability and robustness without significantly increasing storage or computational overhead.

Experimental results under progressively expanding operating-condition scenarios demonstrate that the proposed method exhibits stable advantages in overall performance retention, cross-stage stability, and resistance to performance degradation. Compared with representative continual learning approaches, the proposed model achieves an average classification accuracy improvement of approximately 3–5% across multiple datasets at the final incremental stage, while reducing the forgetting rate by more than 20% in most operating-condition expansion scenarios. As the number of operating conditions increases, competing methods generally exhibit amplified performance fluctuations or gradual degradation, whereas the proposed approach maintains relatively stable performance evolution and preserves clear class structures during multi-stage model updating.

Overall, the proposed continual learning-based diagnostic framework, through the synergistic mechanism of feature disentanglement, prototype constraints, and similarity-driven knowledge transfer, offers a stable and scalable solution for long-term condition monitoring and fault diagnosis of cotton harvester picking-head drivetrains under gradually expanding operating-condition coverage. Although the current study is primarily validated using vibration data collected under laboratory conditions, the results demonstrate strong potential for addressing performance degradation and knowledge forgetting in scenarios with progressively introduced operating conditions. Future work will further integrate practical sensing constraints and maintenance requirements, optimize model structures and update strategies accordingly, and validate the proposed approach under experimental conditions that are closer to real-world agricultural operations.

Author Contributions

Conceptualization, W.S. and H.J.; Methodology, H.J.; Software, H.J. and X.W.; Data curation, H.J. and H.W.; Formal analysis, H.J. and H.W.; Investigation, H.W.; Visualization, H.J. and X.W.; Validation, H.J. and X.W.; Resources, W.S. and X.W.; Writing-original draft preparation, H.J.; Writing-review and editing, W.S. and H.W.; Supervision, W.S.; Funding acquisition, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Key Research and Development Projects of Xinjiang Uygur Autonomous Region (Grant Nos. 2020B02014 and 2022B02016).

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to submit the manuscript for publication.

References

Tian, L.; Shi, F.; Cui, J.; Luo, H. Analysis of Cotton Production Status and Key Supporting Technologies in Xinjiang. In Cotton Production Trends and Uses; IntechOpen: London, UK, 2025. [Google Scholar] [CrossRef]
Shi, M.; Gao, X.; Zhang, W.; Ran, K.; Zhong, L.; Xu, L. Development Status and Prospect of Research on Key Technologies of Cotton Pickers. J. Agric. Mach. 2025, 56, 167–183. [Google Scholar] [CrossRef]
Chen, T.; Zhang, H.; Wang, L.; Zhang, L.; Wang, J.; Li, J.; Gu, Y. Optimization and Experiments of Picking Head Transmission System of Horizontal Spindle-Type Cotton Picker. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2020, 36, 18–26. [Google Scholar] [CrossRef]
Ma, L.; Li, Z.; Yang, S.; Wang, J.; Ma, L.; Li, Z.; Yang, S.; Wang, J. A Review on Vibration Sensor: Key Parameters, Fundamental Principles, and Recent Progress on Industrial Monitoring Applications. Vibration 2025, 8, 56. [Google Scholar] [CrossRef]
Roshchupkin, O.; Pavlenko, I.; Roshchupkin, O.; Pavlenko, I. Modern Methods for Diagnosing Faults in Rotor Systems: A Comprehensive Review and Prospects for AI-Based Expert Systems. Appl. Sci. 2025, 15, 5998. [Google Scholar] [CrossRef]
Wang, H.; Lao, L.; Zhang, H.; Tang, Z.; Qian, P.; He, Q.; Wang, H.; Lao, L.; Zhang, H.; Tang, Z.; et al. Structural Fault Detection and Diagnosis for Combine Harvesters: A Critical Review. Sensors 2025, 25, 3851. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Wang, L.; Wicker, J.; Dobbie, G. Continual Learning: A Systematic Literature Review. Neural Netw. 2026, 195, 108226. [Google Scholar] [CrossRef]
Jiang, M.; Fan, J.; Li, F. Advances in Continual Learning: A Comprehensive Review. Expert Syst. Appl. 2025, 294, 128739. [Google Scholar] [CrossRef]
Zhu, H.; Shen, C.; Li, L.; Wang, D.; Huang, W.; Zhu, Z. Reserving Embedding Space for New Fault Types: A New Continual Learning Method for Bearing Fault Diagnosis. Reliab. Eng. Syst. Saf. 2024, 252, 110433. [Google Scholar] [CrossRef]
Zhang, Y.; Shen, C.; Yang, H.; Wang, D.; Shi, J.; Huang, W. Weight Alignment Prototype Contrastive Network for Rotating Machinery Continual Fault Diagnosis Under Class Imbalanced Data. IEEE Trans. Instrum. Meas. 2025, 74, 1–11. [Google Scholar] [CrossRef]
Shan, H.; Zhang, X.; Liang, W.; Wu, Z.; Shao, H.; Qin, G. A Prototype Learning Framework Based on Continual Learning for Motor Incremental Fault Diagnosis Under Few-Shot Conditions. IEEE Trans. Instrum. Meas. 2025, 74, 1–11. [Google Scholar] [CrossRef]
Li, J.; Yue, K.; Chen, Z.; Xia, J.; Li, W.; Zhang, X. An Uncertainty-Aware Continual Learning Framework for Fault Diagnosis of Rotating Machinery with Homogeneous-Heterogeneous Faults. IEEE Trans. Autom. Sci. Eng. 2024, 23, 3284–3298. [Google Scholar] [CrossRef]
Tian, J.; Yu, Y.; Karimi, H.R.; Lin, J. A Test-Time Adaptation Method Using Evidential Deep Learning for Online Machinery Fault Diagnosis. Knowl.-Based Syst. 2026, 331, 114831. [Google Scholar] [CrossRef]
Tian, J.; Yu, Y.; Karimi, H.R.; Gao, F.; Lin, J. A Continual Test-Time Domain Adaptation Method for Online Machinery Fault Diagnosis under Dynamic Operating Conditions. Neural Netw. 2026, 194, 108192. [Google Scholar] [CrossRef]
Shen, C.; He, Z.; Chen, B.; Huang, W.; Li, L.; Wang, D. Dynamic Branch Layer Fusion: A New Continual Learning Method for Rotating Machinery Fault Diagnosis. Knowl.-Based Syst. 2025, 313, 113177. [Google Scholar] [CrossRef]
Ding, A.; Qin, Y.; Wang, B.; Cheng, X.; Jia, L. An Elastic Expandable Fault Diagnosis Method of Three-Phase Motors Using Continual Learning for Class-Added Sample Accumulations. IEEE Trans. Ind. Electron. 2024, 71, 7896–7905. [Google Scholar] [CrossRef]
He, Z.; Shen, C.; Chen, B.; Shi, J.; Huang, W.; Zhu, Z.; Wang, D. A New Feature Boosting Based Continual Learning Method for Bearing Fault Diagnosis with Incremental Fault Types. Adv. Eng. Inform. 2024, 61, 102469. [Google Scholar] [CrossRef]
Ding, A.; Qin, Y.; Wang, B.; Guo, L.; Jia, L.; Cheng, X. Evolvable Graph Neural Network for System-Level Incremental Fault Diagnosis of Train Transmission Systems. Mech. Syst. Signal Process. 2024, 210, 111175. [Google Scholar] [CrossRef]
Wang, S.; Lei, Y.; Lu, N.; Yang, B.; Li, X.; Li, N. Graph Continual Learning Network: An Incremental Intelligent Diagnosis Method of Machines for New Fault Detection. IEEE Trans. Autom. Sci. Eng. 2024, 23, 3214–3224. [Google Scholar] [CrossRef]
Lin, T.; Song, L.; Cui, L.; Wang, H. Continual Learning for Unknown Domain Fault Diagnosis in Rotating Machinery via Diffusion-Integrated Dynamic Mixture Experts. Eng. Appl. Artif. Intell. 2025, 156, 111056. [Google Scholar] [CrossRef]
Liu, C.; Zhang, L.; Zheng, Y.; Jiang, Z.; Zheng, J.; Wu, C. Online Industrial Fault Prognosis in Dynamic Environments via Task-Free Continual Learning. Neurocomputing 2024, 598, 127930. [Google Scholar] [CrossRef]
Zheng, X.; Cheng, Z.; Cheng, J.; Hon, C.; Yang, Y. Blaschke Learning Machine: A Novel and Efficient Continual Learning Classifier toward Intelligent Fault Diagnosis. Mech. Syst. Signal Process. 2025, 239, 113350. [Google Scholar] [CrossRef]
Wang, H.; Li, J.; Lin, T.; Lu, X.; Song, L. GALMOR: Memory-Constrained Continual Learning With Efficient Replay for Fault Diagnosis of Rotating Machinery. IEEE Trans. Instrum. Meas. 2025, 74, 1–12. [Google Scholar] [CrossRef]
Gao, H.; Huo, X.; Zhu, C.; He, C.; Meng, J. Task Similarity-Based Continual Learning for Multi-Phase Environments and Its Application in Few-Shot Fault Diagnosis. Mech. Syst. Signal Process. 2025, 235, 112862. [Google Scholar] [CrossRef]
Wu, H.; Zhong, S.; Zhao, M.; Fu, X.; Zhang, Y.; Fu, S. Continual Contrastive Reinforcement Learning: Towards Stronger Agent for Environment-Aware Fault Diagnosis of Aero-Engines through Long-Term Optimization under Highly Imbalance Scenarios. Adv. Eng. Inform. 2025, 65, 103297. [Google Scholar] [CrossRef]
Xu, X.; Bao, S.; Liang, P.; Qiao, Z.; He, C.; Shi, P. A Broad Learning Model Guided by Global and Local Receptive Causal Features for Online Incremental Machinery Fault Diagnosis. Expert Syst. Appl. 2024, 246, 123124. [Google Scholar] [CrossRef]
Chen, B.; Zhang, X.; Shen, C.; Li, Q.; Song, Z. CoUDA: Continual Unsupervised Domain Adaptation for Industrial Fault Diagnosis Under Dynamic Working Conditions. IEEE Trans. Ind. Inform. 2025, 21, 4072–4082. [Google Scholar] [CrossRef]
Zhao, C.; Zio, E.; Shen, W. Domain Generalization for Cross-Domain Fault Diagnosis: An Application-Oriented Perspective and a Benchmark Study. Reliab. Eng. Syst. Saf. 2024, 245, 109964. [Google Scholar] [CrossRef]
Jiao, H.; Sun, W.; Wang, H.; Wan, X. Proto-DISFNet: A Prototype-Guided Dual-Feature Transfer Learning Method for Cross-Condition Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains. Agriculture 2026, 16, 87. [Google Scholar] [CrossRef]
Li, Z.; Hoiem, D. Learning Without Forgetting. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 614–629. [Google Scholar] [CrossRef]
Belouadah, E.; Popescu, A. IL2M: Class Incremental Learning with Dual Memory. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 583–592. [Google Scholar] [CrossRef]
Castro, F.M.; Marín-Jiménez, M.J.; Guil, N.; Schmid, C.; Alahari, K. End-to-End Incremental Learning. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11216, pp. 241–257. [Google Scholar] [CrossRef]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5533–5542. [Google Scholar] [CrossRef]
Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a Unified Classifier Incrementally via Rebalancing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 831–839. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of feature disentanglement and prototype construction at the initial stage.

Figure 2. Schematic illustration of prototype updating in the continual learning stage.

Figure 3. Schematic illustration of feature-level knowledge distillation in the continual learning stage.

Figure 4. Prototype-based cross-condition diagnostic inference procedure.

Figure 5. Overall schematic and experimental setup of the CHPH-FETB drivetrain fault-emulation platform. (A) Transmission schematic diagram showing shaft arrangement and fault injection locations; (B1) top-view close-up of key drivetrain components; (B2) rear-view close-up of the power transmission line; (C) full experimental test bench with data acquisition modules. This figure is reproduced from our previous publication [29] under the CC BY license.

Figure 6. Classification accuracy across incremental learning stages on the CWRU dataset.

Figure 7. Classification accuracy across incremental learning stages on the HUST dataset.

Figure 8. Classification accuracy across incremental learning stages on the CHPH-FETB dataset.

Figure 9. Forgetting rate at different incremental learning stages.

Figure 10. Diagnostic accuracy of ablation experiments on the CHPH-FETB dataset.

Figure 11. Forgetting rate of ablation experiments on the CHPH-FETB dataset.

Table 1. Continual learning task setup on the CWRU dataset.

Task Stage	Operating-Condition Increment	Samples per Class (Train/Test)	Fault Category Set
S0	C1: 0 HP @ 1797 rpm	200/100	{N, IR, OR, B}
	C2: 1 HP @ 1772 rpm	200/100
S1	+C3: 2 HP @ 1750 rpm	50/100	{same as above}
S2	+C4: 3 HP @ 1730 rpm	50/100

Table 2. Continual learning task setup on the HUST gearbox dataset.

Task Stage	Operating-Condition Increment	Samples per Class (Train/Test)	Fault Category Set
S0	C1: 0.113 Nm @ 30 Hz	200/100	{N, BT, MT}
	C2: 0.226 Nm @ 25 Hz	200/100
	C3: 0.339 Nm @ 35 Hz	200/100
S1	+C4: 0 Nm @ 20 Hz	50/100
S2	+C5: 0.452 Nm @ 40 Hz	50/100	same as above
S3	+C6: 0.226 Nm @ 35 Hz	50/100
S4	+C7: 0.339 Nm @ 0–40–0 Hz (variable speed)	50/100

Table 3. Continual learning task setup on the CHPH-FETB dataset.

Task Stage	Operating-Condition Increment	Samples per Class (Train/Test)	Health State Set
S0	C1: 500 rpm;	200/100	{Normal; Bearing 1: IR, OR;
	C2: 750 rpm;	200/100	Bearing 2: IR, OR;
	C3: 1100 rpm	200/100	Gear: WR, HB;}
S1	+C4: 800 rpm	50/100
S2	+C5: 1000 rpm	50/100	same as above
S3	+C6: 1200 rpm	50/100
S4	+C7: S-curve variable speed 0–1200–0 rpm	50/100

Table 4. Structure of proposed networks.

Model	Layer	Parameter Settings	Output Shape
CWT	-	Morlet	(B, 1, 200, 2000)
Encoder	Conv1	Conv2d(1→16), BN, ReLU, MaxPool2d(2 × 2)	(B, 16, 100, 1000)
	Conv2	Conv2d(16→32), BN, ReLU, MaxPool2d(2 × 2)	(B, 32, 50, 500)
	Conv3	Conv2d(32→64), BN, ReLU, MaxPool2d(2 × 2)	(B, 64, 25, 250)
	Conv4	Conv2d(64→128), BN, ReLU, MaxPool2d(2 × 2)	(B, 128, 12, 125)
	Conv5	Conv2d(128→256), BN, ReLU, MaxPool2d(2 × 2)	(B, 256, 6, 62)
	Conv6	Conv2d(256→512), BN, ReLU	(B, 512, 6, 62)
	Conv7	Conv2d(512→512), BN, ReLU	(B, 512, 6, 62)
	Conv8	Conv2d(512→512), BN, ReLU, GlobalAvgPool	(B, 512)
Feature Separators	FS_Di_FC1	FC(512→256), BN, ReLU	(B, 256)
(Invariant)	FS_Di_FC2	FC(256→128), BN, ReLU	(B, 128)
	FS_Di_FC3	FC(128→64)	(B, 64)
	Adapter_Di_FC1	FC(64→64), ReLU	(B, 64)
	Adapter_Di_FC2	FC(64→64)	(B, 64)
	Add_Di	ResidualAdd(FS_Di_FC3_out, Adapter_Di_FC2_out)	(B, 64)
	Norm_Di	L2Norm(Add_Di_out)	(B, 64)
Feature Separators	FS_Ds_FC1	FC(512→256), BN, ReLU	(B, 256)
(Specific)	FS_Ds_FC2	FC(256→128), BN, ReLU	(B, 128)
	FS_Ds_FC3	FC(128→64)	(B, 64)
	Adapter_Ds_FC1	FC(64→64), ReLU	(B, 64)
	Adapter_Ds_FC2	FC(64→64)	(B, 64)
	Add_Ds	ResidualAdd(FS_Ds_FC3_out, Adapter_Ds_FC2_out)	(B, 64)
	Norm_Ds	L2Norm(Add_Ds_out)	(B, 64)
Fusion Head	Concat	Concat(Norm_Di_out, Norm_Ds_out)	(B, 128)
	Proj_Df_FC1	FC(128→64), BN, ReLU	(B, 64)
	Proj_Df_FC2	FC(64→64)	(B, 64)
	Norm_Df	L2Norm(Proj_Df_FC2_out)	(B, 64)
Discriminator	D_Di_FC1	SpectralNorm(FC(64→32)), LeakyReLU(0.2)	(B, 32)
(Invariant)	D_Di_FC2	FC(32→16), LeakyReLU(0.2)	(B, 16)
	D_Di_FC3	FC(16→S₀), Softmax	(B, S₀)
Discriminators	D_Ds_FC1	SpectralNorm(FC(64→32)), LeakyReLU(0.2)	(B, 32)
(Specific)	D_Ds_FC2	FC(32→16), LeakyReLU(0.2)	(B, 16)
	D_Ds_FC3	FC(16→S₀), Softmax	(B, S₀)
Classifier	C_FC1	FC(64→128), BN, ReLU, Dropout(0.3)	(B, 128)
	C_FC2	FC(128→64), BN, ReLU, Dropout(0.3)	(B, 64)
	C_FC3	FC(64→32), ReLU	(B, 32)
	C_FC4	FC(32→K)	(B, K)
Decoder	FC1	FC(128→1024), ReLU	(B, 1024)
	FC2	FC(1024→4096), ReLU	(B, 4096)
	FC3	FC(4096→20,000), ReLU	(B, 20,000)
	Reshape	-	(B, 32, 5, 125)
	Deconv1	ConvTranspose2d(32→32), BN, ReLU	(B, 32, 12, 125)
	CatSkip1	Concat(Deconv1_out, Encoder_Conv4)	(B, 160, 12, 125)
	Deconv2	ConvTranspose2d(160→16), BN, ReLU	(B, 16, 25, 250)
	CatSkip2	Concat(Deconv2_out, Encoder_Conv3)	(B, 80, 25, 250)
	Deconv3	ConvTranspose2d(80→8), BN, ReLU	(B, 8, 50, 500)
	Deconv4	ConvTranspose2d(8→4), BN, ReLU	(B, 4, 100, 1000)
	Deconv5	ConvTranspose2d(4→2), BN, ReLU	(B, 2, 200, 2000)
	FinalConv	Conv2d(2→1), Sigmoid	(B, 1, 200, 2000)

Table 5. Comparison of diagnostic accuracy (%) for continual learning tasks on the CHPH-FETB dataset under few-shot conditions.

Method	S0	S1			S2			S3			S4
Method	S0	10s	30s	50s	10s	30s	50s	10s	30s	50s	10s	30s	50s
Fine-tune	90.04	72.06	75.37	80.08	58.23	63.56	68.44	55.28	59.26	62.83	52.78	54.55	58.95
Joint Oracle	90.72	79.99	87.28	89.78	79.90	85.82	89.49	82.76	85.13	89.39	78.71	85.94	88.67
LuCIR	91.01	77.02	82.52	85.64	72.68	77.90	81.84	73.45	78.18	80.51	68.25	76.75	79.10
IL2M	89.47	74.85	81.00	84.40	73.41	78.69	81.41	74.50	75.86	80.69	71.93	75.81	78.63
iCaRL	89.82	78.69	82.37	86.37	73.52	78.77	83.76	76.19	78.24	82.60	70.59	75.45	80.06
EEIL	89.69	78.29	81.93	85.57	75.58	81.69	83.85	70.58	78.51	81.65	70.34	75.83	80.89
LwF	89.07	80.40	83.74	87.06	76.06	81.13	84.47	73.25	79.30	82.38	72.73	76.61	80.09
Ours	90.05	80.67	83.28	87.76	79.50	81.39	85.70	77.47	80.57	85.59	74.50	79.56	83.84

Table 6. Comparison of diagnostic accuracy under unseen operating conditions and after training on new conditions (%).

Dataset	S1 (C4)		S2 (C5)		S3 (C6)		S4 (C7)
Dataset	Untrained	After Training	Untrained	After Training	Untrained	After Training	Untrained	After Training
HUST	78.31	90.43	81.72	91.05	82.46	87.84	78.65	86.58
CHPH-FETB	71.64	85.02	74.29	87.48	73.58	81.57	70.28	81.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiao, H.; Sun, W.; Wang, H.; Wan, X. A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions. Agriculture 2026, 16, 566. https://doi.org/10.3390/agriculture16050566

AMA Style

Jiao H, Sun W, Wang H, Wan X. A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions. Agriculture. 2026; 16(5):566. https://doi.org/10.3390/agriculture16050566

Chicago/Turabian Style

Jiao, Huachao, Wenlei Sun, Hongwei Wang, and Xiaojing Wan. 2026. "A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions" Agriculture 16, no. 5: 566. https://doi.org/10.3390/agriculture16050566

APA Style

Jiao, H., Sun, W., Wang, H., & Wan, X. (2026). A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions. Agriculture, 16(5), 566. https://doi.org/10.3390/agriculture16050566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Disentangled Prototype-Driven Continual Learning Framework for Fault Diagnosis of Cotton Harvester Picking-Head Drivetrains Under Gradually Expanding Operating Conditions

Abstract

1. Introduction

2. Continual Learning Model Based on Feature Disentanglement and Prototype Updating

2.1. Continual Learning Task Definition and Problem Formulation

2.2. Feature Disentanglement and Prototype Construction at the Initial Stage

2.2.1. Feature Disentanglement

2.2.2. Prototype Construction

Construction of Class Prototypes Based on Condition-Invariant Features

Construction of Condition Prototypes Based on Condition-Specific Features

Construction of Condition-Aware Class Prototypes Based on Fused Features

2.3. Sample Selection Mechanism in the Continual Learning Stage

2.3.1. Operating-Condition Similarity Measurement

2.3.2. Sample Selection Strategy

2.4. Prototype Updating and Knowledge Distillation in the Continual Learning Stage

2.4.1. Prototype Updating for Condition-Invariant Features

2.4.2. Prototype Updating for Condition-Specific Features

2.4.3. Prototype Updating for Fused Features

2.4.4. Feature-Level Knowledge Distillation

2.5. Stage-Wise Optimization Objectives and Overall Training Procedure

2.5.1. t = 0 Stage: Stage 1—Feature Disentanglement

2.5.2. t = 0 Stage: Stage 2—Prototype Structure Formation

2.5.3. t > 0 Stage: Prototype-Driven Adaptation and Knowledge Distillation in Incremental Learning

2.6. Prototype-Based Cross-Condition Diagnostic Inference

3. Experimental Methodology

3.1. Dataset Description

3.2. Task Configuration

3.3. Comparison Methods

3.4. Network Architecture

3.5. Experimental Details

4. Experimental Results and Analysis

4.1. Overall Performance Analysis Under OCIL

4.2. Performance Analysis on Different Datasets

4.2.1. Results on the CWRU Dataset

4.2.2. Results on the HUST Dataset

4.2.3. Results on the CHPH-FETB Dataset

4.2.4. Summary of Dataset-Specific Results

4.3. Model Adaptation Capability Under Few-Shot Conditions

4.4. Cross-Condition Inference Capability Under Unseen Operating Conditions

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.5.1. $t = 0$ Stage: Stage 1—Feature Disentanglement

2.5.2. $t = 0$ Stage: Stage 2—Prototype Structure Formation

2.5.3. $t > 0$ Stage: Prototype-Driven Adaptation and Knowledge Distillation in Incremental Learning