Next Article in Journal
Analytical Modeling and Analysis of High-Torque-Density Three-Segment Halbach Array PM Machine by Considering Leakage Flux
Previous Article in Journal
Dual-Motor Position Control Based on a Synchronous State Observer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Condition-Aware DANN-LSTM for Rolling-Bearing Fault Diagnosis and Remaining Useful Life Prediction Under Operating Condition Shifts

1
Chengyi College, Jimei University, Xiamen 361021, China
2
School of Marine Engineering, Jimei University, Xiamen 361021, China
*
Authors to whom correspondence should be addressed.
Machines 2026, 14(6), 682; https://doi.org/10.3390/machines14060682 (registering DOI)
Submission received: 29 April 2026 / Revised: 1 June 2026 / Accepted: 5 June 2026 / Published: 11 June 2026

Abstract

Rolling element bearing monitoring under operating condition shifts remains difficult because fault signatures are transient, fault data are scarce, and degradation trends may depend on load and speed. This study evaluates a condition-aware DANN-LSTM framework for joint fault diagnosis and RUL prediction. A one-dimensional CNN extracts vibration features, a gradient reversal branch aligns condition-related distributions for fault classification, and an LSTM models chronological degradation features without direct adversarial regularization. The model jointly optimizes classification, condition-discrimination, and RUL losses. Experiments on public bearing datasets show high class-wise identification rates, a validation accuracy of 0.989, and an RUL RMSE of 7.9. Controlled ablation indicates that moderate condition alignment improves transfer classification while preserving useful degradation ordering for RUL prediction. The framework offers a practical data-driven baseline for bearing condition monitoring under controlled condition shifts.

1. Introduction

Rolling element bearings, as key components in rotating machinery, are widely used in important fields such as energy, manufacturing, and transportation. Their operating status is directly related to the safety and stability of production systems [1,2]. As industrial equipment evolves towards higher loads and longer operating cycles, traditional maintenance methods relying on manual experience are increasingly unable to meet high-reliability requirements. Data-driven intelligent diagnostic technologies have therefore become an important research focus [3,4]. Machine-learning-centered data analysis provides a new technical path for assessing equipment health status by modeling and analyzing operational signals [5,6]. Vibration signals generated by rotating machinery during operation contain rich structural and degradation-related information, providing a data foundation for fault identification and life prediction [7,8]. Fault diagnosis and remaining useful life prediction are important components of equipment health management, and their research is of great significance for reducing the risk of unplanned downtime [9,10]. With the advancement of industrial intelligence, building diagnostic and predictive models with high accuracy and strong adaptability has become an important research direction in this field [11,12].
In practical engineering applications, rotating machinery operates in complex environments with highly variable conditions. Data collected under different speeds and loads exhibit significant distributional differences, and this non-stationarity places high demands on model building [13,14]. During normal operation, equipment accumulates a large amount of data, while fault-condition data are relatively scarce, leading to a significant imbalance in sample distribution and consequently affecting model training effectiveness [15,16]. Vibration signals are affected by structural coupling and noise interference during propagation. The superposition of multi-source information makes fault features difficult to separate, increasing the complexity of feature extraction [17,18]. Traditional data acquisition methods are limited by sensor placement and sampling conditions, resulting in missing or redundant information in the acquired data and thus limiting the upper bound of model performance [19,20]. Structural differences between different devices make it difficult for models trained on a single data source to be applied to new application scenarios [21,22]. The degradation process is characterized by nonlinearity, stochasticity, and time-varying operating conditions, which introduces significant uncertainty into remaining useful life prediction and places higher demands on model stability [23,24,25]. For fault types such as imbalance and misalignment, simple frequency-domain analysis based on fast Fourier transform is often sufficient for detection. However, for localized faults such as inner-race cracks or rolling-element spalls in bearings, the vibration signatures are transient and non-stationary. Traditional machine learning methods relying on handcrafted features often fail to capture these transient patterns under variable operating conditions. Physics-informed transfer learning requires an accurate prior model of system dynamics, which is difficult to obtain for complex rotating machinery with multiple components and unknown degradation mechanisms.
To address the aforementioned problems, existing research has proposed several improvement methods. Methods based on signal processing and traditional machine learning achieve fault identification through manual feature extraction. These methods rely on prior experience to construct feature indicators, making it difficult to maintain stable performance under complex operating conditions [26,27]. Deep learning methods automatically extract feature representations through convolutional neural networks and can achieve high recognition accuracy on single-condition data. However, their feature representation depends on the distribution of training data, and performance is prone to degradation when dealing with cross-condition data [12,28]. Recurrent neural networks have made progress in remaining useful life prediction by modeling time-dependent relationships in degradation processes, but they are highly sensitive to the quality of input features and may struggle with noise interference and feature drift [29,30]. Transfer learning methods improve model generalization by aligning different data distributions, but they may ignore the temporal structure of degradation information during feature-space alignment, leading to unstable prediction results [31,32]. Physical-information fusion methods enhance model interpretability by introducing mechanistic knowledge, but they struggle to accurately characterize all degradation mechanisms in complex systems, which limits their application scope [33,34]. Existing methods still have shortcomings in terms of cross-condition feature consistency and coordinated degradation-process modeling, making it difficult to balance model generalization ability and prediction stability [35].
The main methodological difficulty is the tension between condition-invariant feature learning and remaining useful life prediction. Bearing degradation depends on load and speed; therefore, a representation that is made fully condition-invariant may lose information needed for lifetime regression. The present framework limits adversarial pressure to the condition-discrimination path used for cross-condition fault classification. The LSTM branch receives chronological CNN features and is optimized by the RUL loss without gradient reversal. The model is therefore presented as an application-oriented integration of CNN feature extraction, condition-adversarial classification, and temporal degradation modeling for localized rolling-bearing faults, rather than as a new network family. The target faults are inner-race, outer-race, and rolling-element defects, whose impulsive and non-stationary vibration signatures are less reliably handled by simple FFT-based or handcrafted-feature approaches under load shifts.

2. Methods

2.1. Overall Model Framework Design

The proposed DANN-LSTM model combines cross-condition feature alignment for fault classification with temporal degradation modeling for RUL prediction. It receives a one-dimensional vibration signal and produces two outputs: a fault category and a remaining life estimate. The network contains a multi-scale CNN feature extractor, a condition-adversarial branch for classification-oriented alignment, and an LSTM regression branch for chronological degradation modeling. The condition discriminator is connected to the feature extractor through a gradient reversal layer, so classification-relevant features become less sensitive to speed/load shifts. This alignment is not imposed as a blanket constraint on the RUL branch. The LSTM receives the shared chronological features and is trained by lifetime regression loss without gradient reversal, allowing load- and time-dependent degradation cues to remain available for RUL estimation.
During training, the model jointly optimizes three types of loss functions: fault classification loss, condition alignment loss, and lifetime prediction loss. The overall objective function is shown in Equation (1).
L total = L cls + λ adv L adv + λ rul L rul
L cls ,   L adv ,   L rul ,   λ adv ,   λ rul in Formula (1), the classification loss measures the discrepancy between predicted and true fault labels, the condition-discrimination loss is optimized adversarially through the gradient reversal branch, and the RUL loss constrains the deviation between predicted and true remaining life. During backpropagation, the condition discriminator minimizes its own classification loss, whereas the feature extractor receives the reversed gradient from this branch. The LSTM regression branch propagates the RUL gradient normally. Condition invariance is therefore encouraged for classification-oriented alignment, not as a complete removal of condition-dependent degradation information.
Figure 1 illustrates the corrected DANN-LSTM workflow. The CNN extractor converts each vibration window into a high-dimensional feature vector. The upper branch uses a gradient reversal layer and a two-output condition discriminator for source/target operating condition labels. The middle branch is the four-class fault classifier, corresponding to normal, inner race, outer race, and rolling element faults. The lower branch sends chronological feature sequences to a two-layer LSTM and a regression layer for RUL prediction. The figure now matches the experimental class definition and separates the condition-adversarial path from the LSTM regression path.

2.2. Multi-Scale Feature Extraction Based on Convolutional Neural Networks

The original vibration signal of rotating machinery is a time-domain waveform, containing rich equipment status information. A convolutional neural network is constructed to perform local perception and multi-scale feature extraction on the input signal, mapping the one-dimensional time series to a high-dimensional feature space. The input vibration signal is represented as x R T , where T is the number of signal sampling points. The first layer of the network uses k 1 a one-dimensional convolutional kernel with a width of, which slides along the time dimension to perform convolution operations, generating a primary feature map. This convolution operation is defined as:
f j 1 = σ w j 1 x + b j 1
In Formula (2), let “ ”represent the convolution operation w j 1 R k 1 , j be the weight vector of the convolution kernel, b j 1 be the corresponding bias term, σ ( ) be the non-linear activation function, and f j 1 be the feature channel of the output j . The primary feature map is extracted in parallel using multiple convolution kernels, F 1 = [ f 1 1 , f 2 1 , , f C 1 1 ] with dimension C 1 × T 1 , where is C 1 the number of kernels in the first layer and T 1 is the time length after convolution. The first layer kernel has a small width and is used to capture local transient impulses in the signal.
A second convolutional layer is stacked on top of the primary features, using k 2 a convolutional kernel of width ( k 1 > k 2 ) to F 1 perform secondary feature extraction. This convolutional operation is expressed as:
f l 2 = σ j = 1 C 1 w l , j 2 f j 1 + b l 2
In Formula (3), w l , j 2 R k 2 represents the weight between the i- b l 2 th convolutional kernel and the i-th input channel j in the second layer, and l represents the bias term. A wider convolutional kernel integrates local features over a wider time window, extracting mid-frequency modulation information from the signal. The feature map output from the second layer F 2 = [ f 1 2 , f 2 2 , , f C 2 2 ] has a dimension of C 2 × T 2 , where C 2 is the number of convolutional kernels in the second layer. Two consecutive convolutional layers gradually expand the network’s receptive field, enabling multi-scale information extraction from high-frequency impulses to low-frequency envelopes.
The third layer uses k 3 a convolutional kernel of width ( k 2 > k 3 ) to F 2 perform higher-level feature abstraction, and its operation form is:
f m 3 = σ l = 1 C 2 w m , l 3 f l 2 + b m 3
In Formula (4), where w m , l 3 R k 3 is the weight of the third convolutional kernel, and f m 3 is the output feature. This convolutional kernel covers a longer signal segment, capturing low-frequency feature components related to the overall degradation trend of the device. The feature map output F 3 , the three-layer convolutional network, has a dimension of C 3 × T 3 , where C 3 is the number of third-layer convolutional kernels. Each convolutional layer is followed by batch normalization and pooling operations. Batch normalization accelerates network convergence, and pooling reduces the feature dimension and enhances local translation invariance. After multi-layer convolution processing, the original vibration signal x is transformed from a time-domain waveform into a high-dimensional feature representation Z = Flatten ( F 3 ) , where Z R D , and D = C 3 × T 3 . This feature vector Z simultaneously encodes the frequency domain structure and time-domain evolution information of the signal at different time scales. Figure 2 shows the correspondence between the original vibration signal and the multi-scale feature response generated after passing through the convolutional network.
For each sliding window of 1024 points, the CNN feature extractor outputs a feature vector of dimension D = C3 × T3. The value of C3 is 128, the number of output channels of the third convolutional layer. The value of T3 is the compressed time length after the third pooling operation, which is 16. Thus, each window produces a feature vector of length 2048. To form a chronological sequence for the LSTM, the feature vectors from consecutive sliding windows with 50 percent overlap are concatenated in time order. The sequence length L is fixed at 32 windows, corresponding to 32 consecutive windows covering 32 × 512 points of original vibration signal after accounting for the 50 percent overlap. The sliding step between consecutive windows is 512 points. Batches are formed by randomly sampling sequences from the training set. Each batch contains 64 sequences, each sequence consisting of 32 time steps of feature vectors. For the PHM2012 and XJTU-SY datasets, each bearing run provides a single continuous degradation sequence. After the chronological split described in Section 3.2, windows from the same run are kept in strict chronological order within each training, validation, and testing segment. No mixing of windows from different runs occurs within a single sequence. Figure 2 shows the multi-scale characteristic response map of the vibration signal, which adopts a four-sub-map structure arranged vertically from top to bottom. The topmost sub-map is the waveform of the original vibration signal, with the horizontal axis labeled with time (seconds) and the vertical axis labeled with amplitude (meters per square second). The waveform curve exhibits a non-stationary oscillation pattern, containing periodic impact pulses and background noise components. The second sub-map is the output feature map of the first convolution layer, with the horizontal axis labeled with time steps and the vertical axis labeled with feature channel indices. It is presented in the form of a grayscale heatmap, with grayscale values representing feature activation intensity. Multiple vertical bars are distributed along the time direction in the figure, and the bar positions correspond to the occurrence time of the impact event in the original signal. The bar width reflects the local transient response captured by the small-width convolution kernel. The third sub-map is the output feature map of the second convolution layer, with the same horizontal and vertical axis labels as the second sub-map. In the heatmap, the number of vertical bars is reduced compared to the previous sub-map, while the bar width is increased. The bar edges show a smooth transition, reflecting the integration effect of the larger-width convolution kernel on the intermediate frequency modulation information. The fourth sub-image is the output feature map of the third convolutional layer. In the heatmap, the vertical bars are further merged into sparse, wide bright areas. The dark areas between the bright areas correspond to low-energy periods in the signal, reflecting the low-frequency degradation trend components extracted by the maximum width convolutional kernel. The four sub-images are strictly aligned on the time axis, and the vertical connecting lines between the sub-images indicate the response relationship between the original signal and the feature maps of each layer within the same time window.

2.3. Feature Alignment Across Operating Conditions Based on Adversarial Learning

To eliminate the impact of data distribution differences under different working conditions on the generalization performance of the model, a condition discriminant module is embedded between the feature extractor and the fault classifier to construct an adversarial feature alignment mechanism. The condition discriminator receives the high-dimensional feature vector output by the feature extractor as input and determines the operating condition label of the feature through a binary classification network. This network consists of three fully connected layers and a gradient reversal layer (GRL) connected in series. The gradient reversal layer acts as an identity mapping during forward propagation, passing the feature to the condition discriminator; during backward propagation, it automatically inverts the gradient sign, making the optimization direction of the feature extractor opposite to that of the condition discriminator, forcing the feature extractor to learn condition-invariant features. The adversarial loss function of the condition discriminator is defined as cross-entropy, as shown in Formula (5).
L a d v = 1 n s + n t i = 1 n s + n t d i log ( d ^ i ) + ( 1 d i ) log ( 1 d ^ i )
In Formula (5), L a d v represents the adversarial loss value of the condition discriminator, n s is the number of samples in the source condition, n t is the number of samples in the target condition, and i is the sample index. d i is i , the true condition label of the i-th sample, which is 1 when the sample comes from the source condition and 0 when it comes from the target condition. d ^ i is the predicted probability output by the condition discriminator, representing i , which is the probability that the i-th sample is classified as a source condition sample. The overall adversarial objective is formulated as a min-max game. The condition discriminator minimizes the domain classification loss L a d v to correctly distinguish source and target operating conditions. The feature extractor aims to maximize L a d v through the gradient reversal layer, making the feature distributions from different operating conditions indistinguishable. This optimization is written as E m i n G f , G c m a x G d ,   L a d v , where G f is the feature extractor, G c is the fault classifier, and G d is the condition discriminator. The gradient reversal layer multiplies the gradient from L a d v by a negative coefficient −λ during backpropagation to G f , while the gradient to G d remains unchanged. The condition discriminator is updated with standard gradient descent to minimize L a d v . The feature extractor is updated with gradient descent on the sum of L c l s , L a d v (with sign inversion), and L r u l . During joint training, the gradient reversal layer multiplies the gradient returned by the condition discriminator by a negative coefficient λ and then passes it to the feature extractor, where λ is the hyperparameter for balancing the adversarial strength. This mechanism prompts the feature extractor to map the source and target condition data to a common feature space, making the feature distribution of the two types of data tend to be consistent. After adversarial training, the overlap of the projection regions of the source and target conditions in the feature space is significantly improved, and the condition discriminator cannot effectively distinguish the source of features, thus completing the cross-condition feature alignment. The trend of feature space distribution alignment before and after is shown in Figure 3.
Figure 3 shows two scatter plots side-by-side. The left subplot is labeled “Before Alignment,” and the right subplot is labeled “After Alignment.” Both subplots share the same two-dimensional coordinate system, with the horizontal axis labeled “First Feature Dimension” and the vertical axis labeled “Second Feature Dimension.” In the left subplot, source condition sample points are represented by solid red circles, and target condition sample points are represented by solid blue triangles. The two sets of points are clearly separated in the two-dimensional plane, with the source condition points clustered in the upper left region of the coordinate system and the target condition points clustered in the lower right region, with a clear blank space between the two regions. In the right subplot, source and target condition sample points are marked with the same shape and color. The two sets of points are interspersed in the two-dimensional plane, with the solid red circles and solid blue triangles evenly mixed in the central region of the coordinate system, forming a highly overlapping single cluster structure. Both subplots are accompanied by legends at the bottom explaining the correspondence between the symbols and the domain categories.

2.4. LSTM-Based Degradation Process Modeling and Lifetime Prediction

The feature sequences aligned by the condition-adversarial mechanism still retain the evolutionary information of the equipment’s operating state over time. To characterize the performance degradation process of rotating machinery, this paper introduces a Long Short-Term Memory (LSTM) network to perform temporal modeling of the feature sequences. The aligned feature vectors form the sequence input according to the sampling time order, denoted as f 1 , f 2 , , f T , where T represents the time step length, f t R d represents the high-dimensional feature representation d at time, and t represents the feature dimension. This sequence serves as the input to an LSTM network, which selectively memorizes and updates historical degradation information through a gating mechanism, thereby establishing the dynamic dependency relationship of the equipment state over time.
The LSTM unit controls the flow of information through the forget gate, input gate, and output gate. Its state update process is described by the following relationship. The forget gate is used to adjust the degree to which the memory of the previous time step is retained for the current time step. Its calculation form is shown in Formula (6):
f t = σ ( W f f t + U f h t 1 + b f )
In Formula (6), f t represents the forget gate vector at time t , σ ( ) represents the Sigmoid activation function, W f and U f represent the weight matrix of the input feature and the hidden state at the previous time step, respectively, h t 1 represents the hidden state at the previous time step, and b f represents the bias term.
The input gate determines the extent to which new information is written into the memory cell at the current moment, and its expression is shown in Formula (7):
i t = σ ( W i f t + U i h t 1 + b i )
In Formula (7), i t represents the input gate vector, W i and U i represent the corresponding weight matrices respectively, b i represents the bias term, and the meanings of the other symbols are the same as those in the previous text.
Candidate memory units are used to generate potential state information at the current time, and their calculation method is shown in Formula (8):
c ~ t = tanh ( W c f t + U c h t 1 + b c )
In Formula (8), c ~ t represents the candidate memory state, tanh ( ) represents the hyperbolic tangent activation function, W c and U c and b c represent the corresponding weight matrix and bias term, respectively.
Combining the regulatory effects of the forget gate and the input gate, the state update relationship of the memory unit is shown in Formula (9):
c t = f t c t 1 + i t c ~ t
In Formula (9), c t represents the current state of the memory unit, c t 1 represents the previous state of the memory unit, and represents the element-wise multiplication operation.
The output gate controls the output of the memory unit to the hidden state, and its calculation form is shown in Formula (10):
o t = σ ( W o f t + U o h t 1 + b o )
In Formula (10), o t represents the output gate vector, W o and U o and b o represent the corresponding weight matrix and bias term, respectively.
The final hidden state is determined by the output gate and the memory cell state, and their updated relationship is shown in Formula (11):
h t = o t tanh ( c t )
In Formula (11), h t represents t the hidden state at time t, and this vector conveys the device degradation and evolution information in the time dimension.
As the time step progresses, the hidden state sequence h 1 , h 2 , , h T forms a dynamic representation of the degradation process. To achieve remaining lifetime prediction, the hidden state at the last moment is mapped to the regression space to obtain the estimated remaining lifetime of the equipment. The mapping relationship is shown in Formula (12):
y ^ = W r h T + b r
In Formula (12), y ^ represents the predicted remaining lifetime value, W r represents the regression layer weight matrix, b r represents the bias term, and h T represents the hidden state at the final time.
During time series propagation, the LSTM network continuously accumulates historical degradation information through a gating structure, causing the hidden state to exhibit a relatively smooth change trend over the device’s operating time. This results in a time-continuous degradation trajectory in the feature space as learned from the training data. After regression mapping, this trajectory yields a lifetime estimation curve that tends to decrease over time, reflecting the degradation trend present in the linear RUL labeling convention of the PHM2012 dataset. The model does not enforce strict monotonicity or physical consistency constraints; the observed smoothness is influenced by the linear RUL labels used for training. A stable mapping relationship is formed between the degradation state and the predicted lifetime, as shown in Figure 4.
Figure 4 illustrates the degradation state evolution process obtained based on LSTM time series modeling. The horizontal axis represents the device’s operating time t , and the vertical axis represents the health state and lifetime estimate obtained from the hidden state mapping. The overall coordinates, from the upper left to the lower right, correspond to the gradual evolution of the device from the initial operating stage to the failure stage. The smoothly descending solid line in the figure represents the degradation trend curve formed by the LSTM network output. The curve shows a small change in the initial stage, then shows a continuous decline over time, and the rate of decline increases in the later stage, reflecting the non-linear evolution characteristics of device performance degradation. The discrete dots distributed along the curve represent the hidden state outputs at each time step. These states are continuously transmitted in the time dimension and together constitute a complete degradation trajectory. The vertical dashed line at the right end of the curve represents the predicted failure time t f . The horizontal distance between the dashed line and the current time is marked by a double-headed arrow as the remaining lifetime interval, reflecting the mapping relationship from time series characteristics to lifetime prediction results. The shaded area below the curve represents the cumulative change trend of the degradation process over time, making the degradation trajectory visually a continuous state evolution path. The overall layout forms a monotonically changing degradation curve structure from time series input to lifetime prediction output, which intuitively reflects the dynamic modeling effect of the LSTM network on the device degradation process.

2.5. Joint Loss Function and Model Training Strategy

To achieve synergistic optimization of cross-condition feature learning and degradation trend modeling, a joint objective function consisting of fault classification loss, condition alignment loss, and lifetime prediction loss is constructed within a unified network framework, and parameter updates are completed through end-to-end backpropagation. Let the input vibration signal, after passing through the convolutional feature extractor, yield a feature representation as F , where F R N × d represents N , the batch sample size, and d represents the feature dimension. The feature vector is then input into the fault classifier, condition discriminator, and LSTM degradation modeling module, respectively, to achieve multi-task joint learning.
The fault classification branch uses Softmax to output class probabilities, and the classification loss is defined as the cross-entropy function, as shown in Formula (13).
L c l s = 1 N i = 1 N k = 1 K y i , k log ( y ^ i , k )
In Formula (13), L c l s represents the fault classification loss (same as Formula (1)), K represents the number of fault categories, y i , k represents the true label of the i -th sample in the k -th class, using one-hot encoding, y ^ i , k represents the predicted probability output by the classifier, and N represents the number of samples. This loss constrains the feature extractor to learn discriminative fault representations.
The condition alignment branch is connected to the condition discriminator through a gradient reversal layer. Samples from the source operating condition and the target operating condition are simultaneously input into the discriminator during training. The domain discrimination loss is defined as the binary classification cross-entropy, as shown in Formula (14).
L adv = 1 N i = 1 N d i log ( d ^ i ) + ( 1 d i ) log ( 1 d ^ i )
In Formula (14), L adv represents the condition alignment loss (same as Formula (1)), d i represents i , the domain label of the i-th sample, the source operating condition is labeled as 1 and the target operating condition is labeled as 0, and d ^ i represents the domain prediction probability output by the condition discriminator. The gradient reversal layer multiplies the gradient by a negative coefficient during the backpropagation stage, so that the feature extractor and the condition discriminator form an adversarial relationship, thereby promoting the convergence of the feature distributions of the source operating condition and the target operating condition in the latent space.
The lifetime prediction branch uses the final hidden state of the degradation state sequence output by LSTM and maps it to the remaining lifetime value through a fully connected layer. The regression loss uses the mean squared error function, as shown in Equation (15).
L r u l = 1 N i = 1 N ( R i R ^ i ) 2
In Formula (15), L r u l represents the lifetime prediction loss (same as Formula (1)), R i represents i , the true remaining lifetime value of the i-th sample, R ^ i represents the predicted lifetime output by the model, and N represents the number of samples. This loss constrains the LSTM module to learn the time dependency of the degradation process.
The joint optimization objective function consists of three weighted loss components, as explained in Formula (1).
The model training employs an end-to-end backpropagation strategy. The gradient reversal layer is placed only in the path from the feature extractor to the condition discriminator. The regression gradient from the LSTM branch backpropagates through the feature extractor without sign inversion, preserving condition-dependent degradation patterns. The raw vibration signal is input into the convolutional feature extractor to generate a feature representation. The classification and lifetime prediction branches directly utilize this feature for task learning. The condition discriminator receives the feature vector through a gradient inversion layer and participates in adversarial training. During the backpropagation phase, the classification loss and lifetime prediction loss generate positive gradients to the feature extractor, while the condition alignment loss generates a negative gradient after passing through the gradient inversion layer. These three gradients are superimposed in the feature space and jointly update the network parameters. The parameter update process uses an iterative optimization method based on mini-batch samples. The convolutional layer parameters, condition discriminator parameters, and LSTM parameters are updated synchronously within the same training cycle until the overall loss function converges. At each training iteration, the condition discriminator parameters are updated by descending the gradient of L a d v with respect to θ d . The feature extractor parameters are updated by descending the gradient of L c l s + L r u l minus λ times the gradient of L a d v with respect to θ f . The classifier parameters are updated by descending the gradient of L c l s . The LSTM parameters are updated by descending the gradient of L r u l . This training mechanism achieves synergistic constraints for cross-condition feature alignment and degradation trend modeling within a unified framework, thereby improving the model’s fault diagnosis and remaining lifetime prediction performance under complex operating conditions.

3. Experiment

3.1. Datasets and Experimental Environment

To verify the effectiveness of the proposed DANN-LSTM model in fault diagnosis and remaining life prediction for rolling element bearings, four publicly available bearing datasets from CWRU, PHM2012, XJTU-SY, and IMS were selected. These datasets are widely recognized benchmarks in bearing health monitoring and provide a standard basis for comparing them with existing methods. The source and target operating conditions were selected according to different rotational speeds and load levels. The original vibration signals were processed by fixed-length sliding window slicing and amplitude normalization. Training and testing sample sets containing fault category labels and remaining life labels were constructed. Cross-condition transfer experiments and full life cycle degradation modeling experiments were completed based on a unified data partitioning ratio. The specific data composition is shown in Table 1, which outlines the dataset composition and operating condition distribution.
The table lists five categories of information: source, operating conditions, fault type, number of samples, and remaining lifespan. The RUL range column indicates the minimum and maximum remaining useful life values assigned to samples in each dataset according to the linear degradation labeling rule from the first recorded time point to the failure point. The CWRU dataset covers three constant speed load conditions, corresponding to normal, inner race fault, and rolling element fault categories, respectively. The PHM2012 and XJTU-SY datasets provide bearing vibration data throughout its entire life cycle, annotating the continuous degradation process from initial state to failure. The IMS dataset includes composite fault types, covering combined failure modes of outer race, inner race, and rolling elements. All samples undergo the same preprocessing procedure to ensure data consistency for subsequent feature extraction and model training. The variable operating condition scenario in this study is simulated through cross-condition transfer tasks between datasets collected under distinct constant speed and load levels. The remaining useful life labeling for the PHM2012 dataset follows a linear degradation assumption from the start of recording to the failure point. Each bearing sample at time t is assigned a RUL value equal to the total runtime from the initial condition to the failure moment minus the elapsed time t. The maximum RUL is 280 min at the first recording point and decreases to 0 at the final point before failure. For the XJTU-SY dataset, the same linear labeling convention is applied with a maximum RUL of 150 min. For the IMS dataset, the maximum RUL is 164 min. The failure moment for each bearing run in these three datasets is defined as the time when the vibration amplitude exceeds a predefined threshold or when the experimental run is terminated due to physical failure. For the PHM2012 dataset, the end-of-life threshold is 20 g vibration amplitude as defined in the PRONOSTIA experimental protocol. For the XJTU-SY dataset, the threshold is 10 g. For the IMS dataset, the threshold is 15 g. Each sliding window of 1024 points with 50 percent overlap is extracted sequentially along the time axis before the failure point. The remaining useful life label for a given sliding window is computed as the total runtime from the start of the bearing run to the failure point minus the timestamp of the last point in that window. No windows are extracted after the failure point. The CWRU dataset contains vibration recordings of fixed fault severity levels without temporal degradation progression. Therefore, the CWRU dataset is used exclusively for fault classification experiments and is not employed for any remaining useful life prediction task. The RUL prediction results reported in Section 4.2 and Section 4.3 are based solely on the PHM2012, XJTU-SY, and IMS datasets. The cross-condition transfer experiments in Section 4.3 that involve CWRU data (e.g., 0 hp→1 hp, 1 hp→3 hp) only evaluate fault classification accuracy, not RUL prediction error. The RUL RMSE values reported for the proposed model in Section 4.3 (e.g., 7.8, 8.1) correspond to RUL prediction tasks on the PHM2012 dataset only, not on CWRU. A complete dataset protocol is provided as follows. For the CWRU dataset, the selected runs include drive end bearing vibration data at 12 kHz sampling rate. Four health states are labeled as normal, inner race fault, outer race fault, and rolling element fault. Fault diameters are 0.007 inches for all fault types. For each condition, 120 samples are generated with a window length of 1024 points and 50 percent overlap. No RUL labels are assigned to CWRU samples. For the PHM2012 dataset, the selected run is bearing 1 from condition 1 with a sampling rate of 25.6 kHz. The full degradation sequence contains 280 min of operation. RUL labels are assigned linearly from 280 to 0 min. A total of 80 windows are extracted chronologically. For the XJTU-SY dataset, the selected run is bearing 2 under operating condition 1 with a sampling rate of 25.6 kHz. The maximum RUL is 150 min. For the IMS dataset, the selected run is bearing 1 from the first test with a sampling rate of 20 kHz. The maximum RUL is 164 min. All vibration signals are normalized using Z-score standardization before windowing. The window size is 1024 points with 50 percent overlap for all datasets. The train-validation-test split follows the chronological method described in Section 3.2 for sequential data and random splitting for independent fault data. The class mapping for CWRU fault diagnosis is: normal = 0, inner race fault = 1, outer race fault = 2, rolling element fault = 3. Therefore, this dataset is used exclusively for fault classification experiments and is not employed for any remaining useful life prediction task. The PHM2012, XJTU-SY, and IMS datasets provide run-to-failure vibration sequences and are used only for remaining useful life prediction experiments. Cross-condition transfer experiments for fault diagnosis use the CWRU dataset under different load conditions. For example, the CWRU dataset includes measurements at 1797 rpm with 0 hp load and at 1772 rpm with 1 hp load. Transfer learning from the former to the latter simulates a change in operating condition. The CWRU dataset used in this study includes four fault conditions: normal, inner race fault, outer race fault, and rolling element fault. No cage fault data from the standard CWRU repository was included. The fault classification experiment was conducted on these four categories. The proposed model learns condition-invariant features from these discrete condition shifts, which is a necessary step before applying to continuously varying conditions. The CWRU dataset includes measurements at 1797 rpm with 0 hp load, 1772 rpm with 1 hp load, 1750 rpm with 2 hp load, and 1730 rpm with 3 hp load. The 3 hp condition is used as the target operating condition in the 2 hp→3 hp and 1 hp→3 hp transfer tasks in Section 4.3.

3.2. Data Preprocessing and Sample Construction

For the data preprocessing stage in the experiment of fault diagnosis and remaining life prediction of rotating machinery, the original vibration signal is segmented by sliding window to construct a sample set. Z-score normalization is used to eliminate the influence of dimensions and accelerate model convergence. For RUL prediction, the remaining useful life labels were normalized to the range between 0 and 1 using min-max normalization separately for each dataset. The normalized RUL value for a sample is calculated as the original RUL divided by the maximum RUL of that dataset. The model outputs a normalized RUL prediction, which is then rescaled back to the original unit for error calculation. For each sliding window sample, the remaining useful life label is assigned as the difference between the failure time and the window ending time. The failure time is determined by the vibration amplitude reaching the predefined threshold specified in Section 3.1. To prevent data leakage, the sliding window extraction and the train-validation-test split were performed in the following order. For the PHM2012 and XJTU-SY full-lifecycle datasets, the original continuous signal was first divided chronologically into three segments: the first 70 percent of the time series for training, the next 15 percent for validation, and the last 15 percent for testing. Then, sliding windows of 1024 points with 50 percent overlap were extracted separately from each segment. This ensures that no window from the training set overlaps in time with any window from the validation or test sets. For the CWRU and IMS datasets where samples are independent fault recordings without temporal continuity, random splitting of the windowed samples was applied because no chronological dependency exists across different fault runs. For the PHM2012 and XJTU-SY full-lifecycle datasets which contain sequential degradation signals, the samples were divided chronologically to prevent future data leaking into the training set. The first 70 percent of the time-ordered samples were assigned to the training set, the next 15 percent to the validation set, and the last 15 percent to the testing set. For the CWRU and IMS datasets where samples represent independent fault conditions without temporal dependency, a random split with the same 70/15/15 ratio was applied. The parameter settings of the above preprocessing operations are shown in Table 2.
The table lists the key parameters and their values for the three preprocessing steps: signal segmentation, normalization, and dataset partitioning. Window length and overlap rate determine the temporal coverage and data augmentation level of the samples. The Z-score normalization method transforms each sample into a uniform distribution. The proportional allocation of training, validation, and testing ensures the integrity of the process for model parameter learning, hyperparameter tuning, and independent performance evaluation. This chronological split ensures that the validation and testing samples always come from later stages of the degradation process than the training samples, thereby avoiding optimistic bias in performance evaluation.

3.3. Model Parameter Settings

In the experimental setup phase, specific configurations were made for the feature extractor, condition discriminator, temporal modeling module, and training hyperparameters of the proposed DANN-LSTM model. The feature extractor adopted a three-layer convolutional structure with multi-scale convolutional kernels, the condition discriminator consisted of two fully connected layers, and the temporal modeling module used a two-layer LSTM. During training, the Adam optimizer was used and the initial learning rate and early stopping strategy were set. The specific values of the above parameters are summarized in Table 3.
The detailed architecture of the proposed DANN-LSTM model is presented in Table 4, including input and output dimensions for each layer.
This table lists the parameter names and corresponding values for four categories: feature extractor, condition discriminator, temporal modeling module, and training hyperparameters. Specific values for parameters such as the number of convolutional kernels, kernel size, number of hidden layer nodes, LSTM hidden layer dimension, initial learning rate, batch size, and number of training epochs are presented in the table. Each parameter is accompanied by a brief description indicating its structural characteristics or the basis for its settings. The default hyperparameter values were chosen based on preliminary grid search results. The sensitivity analysis in Section 4.7 confirms that the selected values yield near-optimal performance.

4. Results and Discussion

4.1. Fault Diagnosis Performance Analysis

To evaluate the discriminative ability and learning stability of the proposed model in the fault classification task, the experiment used the CWRU bearing dataset to construct 4 health states including normal, inner race fault, outer race fault, and rolling element fault. The vibration signals collected under each working condition were segmented and normalized before being input into the DANN-LSTM model. The model was trained iteratively for 50 rounds under a unified optimization framework. The classification loss and accuracy of each round were recorded. Based on the test set, the normalized confusion matrix and the precision, recall and F1 score of each class of samples were output, thus obtaining the fault diagnosis performance analysis diagram shown in Figure 5.
Figure 5 reports the corrected fault diagnosis evidence. Figure 5a gives the normalized confusion matrix for the four CWRU health states: normal, inner race fault, outer race fault, and rolling element fault. Figure 5b reports per-class precision, recall, and F1 score, showing balanced recognition across the four classes. Figure 5c presents the training and validation accuracy/loss curves, confirming stable convergence rather than relying only on endpoint accuracy. This replacement resolves the previous mismatch between the caption and the displayed plot.
To evaluate the stability of the fault classification performance, the model was trained 10 times with different random seeds. Table 5 reports the mean and standard deviation of precision, recall, and F1 score for each fault class across 10 runs.
The confusion matrix values in Figure 5a are from a single representative run; the standard deviations across runs are all below 0.5 percentage points.
In summary, DANN-LSTM achieves high accuracy, high balance, and stable convergence in fault diagnosis tasks.

4.2. Performance Analysis of Remaining Life Prediction

To characterize the model’s time series fitting ability and overall error distribution in lifetime prediction tasks, this section constructs a multi-angle experimental visualization around the remaining lifetime prediction process. All remaining useful life prediction results in this section are based on the PHM2012 dataset. The actual degradation trajectory shown in Figure 6a is from a single bearing run in the PHM2012 dataset with a maximum RUL of 280 min. The predicted trajectory of a single device sample is tracked throughout its complete operating cycle, and the statistical characteristics of the prediction errors of all samples are jointly displayed along with the correspondence between the actual and predicted values. The experiment uses the test set samples as the object, comparing the model’s output remaining lifetime sequence with the actual degradation trajectory. Simultaneously, the distribution of prediction errors for all samples is statistically analyzed, and the overall fitting degree is observed through the mapping relationship between the actual and predicted values. Thus, the prediction performance is described from three aspects: time series changes, error statistical characteristics, and regression consistency. The results are shown in Figure 6.
Figure 6a shows the relationship between the actual remaining lifetime curve and the model prediction curve during the equipment’s operating cycle. The horizontal axis represents the time step; the vertical axis represents the remaining lifetime; the blue solid line represents the actual degradation trajectory, and the red dashed line represents the model prediction result. The actual lifetime gradually decreases from 100 to 0, forming an approximately monotonic degradation trend as defined by the linear labeling convention in the PHM2012 dataset. The prediction curve changes along the actual trajectory and reaches the final predicted value of 1.56 at time step 100. The prediction curve fluctuates slightly around the actual curve throughout the cycle, indicating that the model output maintains stable tracking ability during the degradation evolution process. This phenomenon shows that the model maintains a continuous state update mechanism in long-term sequence modeling. The LSTM hidden state retains historical degradation information during sequence propagation, thus ensuring that the predicted trajectory is consistent with the actual degradation trend. Figure 6b shows the statistical distribution of prediction errors for all test samples. The horizontal axis represents the prediction error, and the vertical axis represents the frequency of occurrence. The errors are mainly concentrated in the range of −10 to 10, with an overall root mean square error of 7.9, a maximum negative error of −12.17, and a maximum positive error of 10.86. Over 10 independent runs, the RUL prediction RMSE on the PHM2012 dataset was 7.9 ± 0.4 with a 95 percent confidence interval of [7.5, 8.3]. Beyond RMSE, three additional prognostic metrics were computed to evaluate the RUL predictions. The first metric was the mean absolute error, which was 6.2 ± 0.3 min. The second metric was the score function defined in the PHM2012 challenge, which penalizes late predictions more heavily than early predictions. The proposed model achieved a score of 0.42, outperforming the CNN-LSTM baseline which scored 0.68. The third metric was the prediction error percentage at the early stage (first 25 percent of life), middle stage (25 to 75 percent), and late stage (last 25 percent). The errors were 8.1 percent, 6.5 percent, and 5.2 percent of the remaining life respectively, indicating no systematic bias toward overestimation or underestimation. These metrics provide a more complete assessment of prognostic performance than RMSE alone. The paired t-test comparing the proposed model with the CNN-LSTM baseline on the RUL task yielded a p-value less than 0.001. The error distribution forms a concentrated peak near 0. This distribution pattern indicates a high degree of consistency between the model’s predicted and actual values. The error concentration phenomenon reflects that, after feature space alignment, data from different operating conditions form a stable distribution in the same representation space, enabling the regression model to maintain a similar mapping relationship among different samples. Figure 6c shows the scatter distribution between the actual remaining lifetime and the predicted remaining lifetime, where the horizontal axis represents the actual value, the vertical axis represents the predicted value, and the dashed line represents the ideal regression relationship y = x. Most sample points are distributed near the ideal line, with the actual lifetime ranging from 30 to 200. The corresponding predicted values form a dense banded distribution within the same range, and the scatter points deviate from the ideal line by less than 15 overall. This distribution pattern indicates that the model maintains a stable regression relationship at different lifetime stages. The condition-adversarial alignment mechanism enables the source and target operating condition features to form a consistent structure in the high-dimensional space, thereby reducing the disturbance caused by the distribution differences between different operating conditions to the regression mapping. Combining the changes in the three sub-figures, it can be observed that the model maintains good performance in terms of time series fitting, error distribution stability, and overall regression consistency.

4.3. Analysis of Cross-Working Condition Generalization Ability

To evaluate the generalization performance of the proposed model in cross-condition migration tasks, five sets of source operating condition to target operating condition migration experiments were designed (0 hp→1 hp, 0 hp→2 hp, 1 hp→2 hp, 2 hp→3 hp, 1 hp→3 hp). Domain-free adaptive CNN-LSTM was used as the baseline model, and the proposed DANN-LSTM model was used to perform fault classification and remaining lifetime prediction, respectively. Three additional baseline methods were implemented for comparison. The first method was a standalone DANN model containing the same convolutional feature extractor and condition discriminator but without the LSTM temporal modeling module, performing only fault classification. The second method was a domain-regularized autoencoder using maximum mean discrepancy loss for feature alignment followed by a separate LSTM network for remaining useful life prediction. The third method was a recent domain generalization approach based on style transfer and contrastive representation learning. The classification accuracy, root mean square error and accuracy improvement rate were recorded for each migration task. The accuracy comparison results across different condition adaptation methods are presented in Table 6. The overall performance comparison is also shown in Figure 7.
Figure 7 shows the performance comparison between the baseline model and the proposed DANN-LSTM model in five cross-condition transfer tasks. Figure 7a is a bar chart, with the horizontal axis representing the transfer task and the vertical axis representing the classification accuracy (%). The blue bars represent the baseline model, the orange bars represent the proposed model, and the error bars represent the standard deviate on of five repeated experiments. In the 0 hp→1 hp task, the baseline model achieved an accuracy of 84.5%, while the proposed model achieved 96.7%, the latter being 12.2 percentage points higher; in the 0 hp→2 hp task, the baseline achieved 81.2%, while the proposed model achieved 95.4%, the latter being 14.2 percentage points higher; in the 1 hp→2 hp task, the baseline achieved 83.3%, while the proposed model achieved 97.2%, the latter being 13.9 percentage points higher; in the 2 hp→3 hp task, the baseline achieved 79.8%, while the proposed model achieved 94.8%, the latter being 15.0 percentage points higher; and in the 1 hp→3 hp task, the baseline achieved 78.6%, while the proposed model achieved 94.1%, the latter being 15.5 percentage points higher. These data demonstrate that the proposed model achieves higher values than the baseline model on all transfer tasks. Furthermore, as the difference in working conditions between the source and target operating conditions increases (e.g., from 1 hp to 3 hp), the accuracy of the baseline model decreases to 78.6%, while the proposed model maintains a high accuracy of 94.1%. This phenomenon is attributed to the gradient inversion layer in the proposed model, which forces the feature extractor and condition discriminator to train adversarially, aligning the feature distributions of the source and target operating conditions, thereby reducing the interference of working condition differences on classification decisions. Figure 7b shows the comparison of the root mean square error (RMSE) for remaining lifetime prediction, with the bar chart structure consistent with Figure 7a. The RUL prediction RMSE values shown in Figure 7b and discussed in this section are obtained from experiments on the PHM2012 dataset only. No RUL prediction is performed on the CWRU dataset. The RMSEs of the baseline model on each task are 13.2, 14.8, 13.5, 15.2, and 16.1, respectively, while the corresponding values for the proposed model are 7.8, 8.1, 7.5, 8.4, and 8.9. The error reduction of the proposed model is 5.4, 6.7, 6.0, 6.8, and 7.2, respectively. The results show that the proposed model achieves lower RMSE across all transfer tasks, especially in the 1 hp→3 hp task with the largest load difference, where the baseline model error rises to 16.1, while the proposed model’s is only 8.9. This is because the temporal degradation modeling is more stable after the condition-aligned feature sequences are input into the LSTM, avoiding distortion of the degradation trajectory caused by load shift. Figure 7c is a single bar chart showing the accuracy improvement rate of the proposed model compared to the standard CNN-LSTM baseline, which are 12.2, 14.2, 13.9, 15.0, and 15.5 percentage points respectively, respectively, with specific values labeled at the top of the bars. The proposed model outperformed the standalone DANN baseline with an average accuracy increase of 6.2 percentage points across the five transfer tasks. Compared to the condition-regularized autoencoder, the average increase was 8.4 percentage points. Compared to the condition generalization method, the average increase was 7.5 percentage points. The improvement rate of the proposed model over the standard CNN-LSTM baseline increases with the load difference between the source and target operating conditions, reaching a maximum of 15.5 percentage points in the 1 hp→3 hp task. The proposed DANN-LSTM model outperformed CDAN with an average accuracy increase of 8.8 percentage points across the five transfer tasks. Compared to WDGRL, the average increase was 8.1 percentage points. Compared to CORAL, the average increase was 10.8 percentage points. This trend reflects that the condition-adversarial mechanism plays a more critical role in transfer scenarios with large load differences, because the original feature distribution shift is more severe at this time, and the feature alignment benefits brought by adversarial training are more obvious.
To quantify the statistical significance of the cross-condition transfer improvements, the model was evaluated across 10 independent runs for each transfer task. Table 7 reports the mean accuracy and standard deviation for the proposed DANN-LSTM and the CNN-LSTM baseline, along with paired t-test p-values.
In summary, the proposed DANN-LSTM model effectively aligns the feature distribution across operating conditions through adversarial condition adaptation, achieving superior generalization performance compared to all baseline methods in both fault diagnosis and remaining life prediction tasks.The fault diagnosis accuracy of different methods under cross-condition transfer tasks is summarized in Table 8.
Five additional baseline models were implemented to isolate the contribution of each component. The first baseline was a standalone CNN model with two fully connected layers for fault classification, without any LSTM or condition adaptation. The second baseline was a standalone two-layer LSTM model for RUL prediction, using raw vibration windows as an input without CNN feature extraction. The third baseline was a CNN-LSTM model without the domain adversarial module, serving as the non-adaptive version of the proposed model. The fourth baseline was a DANN-CNN model without the LSTM module, performing only fault classification with condition alignment but no temporal modeling. The fifth baseline was a single-task model trained only on RUL regression loss without fault classification or condition alignment losses. All baselines were evaluated on the same cross-condition transfer tasks and RUL prediction tasks as the proposed model.The results of the component-isolation study for fault diagnosis and RUL prediction are presented in Table 9.
A direct comparison with closely related methods was conducted. Four baseline models representing standard DANN and LSTM implementations as well as their domain-adaptive variants were re-implemented under the same experimental settings as the proposed model. The results are presented in Table 10.
The results for DANN and LSTM are from re-implementation under the same dataset splits and evaluation protocol as the proposed model. The CNN-LSTM with condition adaptation uses feature alignment loss without the gradient reversal layer structure. The domain-regularized LSTM adds a maximum mean discrepancy term to the LSTM loss.
In summary, the proposed DANN-LSTM model effectively aligns the feature distribution across operating conditions through adversarial condition adaptation, achieving superior generalization performance compared to the baseline model in both fault diagnosis and remaining life prediction tasks.

4.4. Feature Alignment Effect Analysis

To examine the impact of condition-adversarial training on the feature space structure, this section maps the high-dimensional features of the source and target operating condition samples to a two-dimensional embedding space for visualization after the feature extraction network is trained, thus visually presenting the distribution relationship of data under different operating conditions in the feature space. The experiment uses the feature vector output by the convolutional network as input, constructs a two-dimensional representation using the t-SNE dimensionality reduction method, and projects the source and target operating condition samples before and after adversarial training. The t-SNE parameters were fixed as follows: perplexity was set to 30, learning rate to 200, number of iterations to 1000, and random seed to 42 for reproducibility. These settings were kept identical for both the before-alignment and after-alignment visualizations. Simultaneously, color differentiation is performed based on operating condition labels and fault category labels to observe the aggregation and category distribution structure of samples across operating conditions, thereby obtaining the spatial distribution results before and after feature alignment, as shown in Figure 8.
Figure 8 illustrates the changes in the distribution of the feature space before and after adversarial training. Figure 8a shows the source and target operating condition samples represented by red and blue scatter points. Two distinct clustering regions appear in the two-dimensional space. The source operating condition samples are mainly distributed in the left region, while the target operating condition samples are concentrated in the right region. The average Euclidean distance between the centers of the two types of samples reaches 12.3, and the cross-condition sample overlap ratio is only 8.6%. This distribution pattern indicates that there is a significant condition shift in the feature space for different working conditions. During the training process, the feature extractor mainly learns statistical features related to the working conditions, resulting in a significant separation structure between the source and target operating conditions. Figure 8b still shows the sample distribution of the two domains represented by red and blue scatter points. After adversarial training, the two types of samples exhibit a highly mixed state in the embedding space. The distance between the domain centers decreases to 3.1, and the cross-condition sample overlap ratio increases to 67.4%. This change indicates that the data from different working conditions gradually become more consistent in the feature space. The gradient reversal layer forces the feature extractor to weaken condition-related information during training, making it difficult for the condition discriminator to distinguish the source of the samples, thereby promoting the gradual alignment of the feature distributions of the source and target operating conditions. Figure 8c uses four colors to distinguish different fault categories. The blue, orange, green, and red scatter points correspond to healthy status, inner race fault, outer race fault, and rolling element fault, respectively. The four types of samples in the figure form four independent cluster structures. The average intra-cluster distance remains within 1.8, while the inter-cluster center distance reaches 6.7. This distribution pattern indicates that after condition alignment, the feature space still maintains a clear category distinction structure. The adversarial learning process eliminates differences in operating conditions while retaining fault discrimination information. Combining the changes in the three sub-figures, it can be seen that adversarial training gradually merges the source and target operating condition samples in the feature space while maintaining a stable category cluster structure. To quantitatively evaluate the condition alignment effect beyond t-SNE visualization, three standard condition adaptation metrics were computed on the feature representations before and after adversarial training. The first metric was the domain classification accuracy of a linear classifier trained on the extracted features to predict operating condition labels. A lower accuracy indicates better domain invariance. The second metric was the A-distance computed as 2 times 1 minus twice the domain classification error rate, with a smaller value indicating less domain discrepancy. The third metric was the class separation score, defined as the ratio of between-class distance to within-class distance in the feature space. A larger value indicates better preservation of class discriminative information. All metrics were evaluated on the CWRU dataset for the 0 hp→1 hp transfer task. Before adversarial training, the linear domain classifier achieved an accuracy of 96.3 percent, the A-distance was 1.92, and the class separation score was 4.5. After adversarial training, the domain classification accuracy dropped to 54.2 percent, the A-distance decreased to 0.39, and the class separation score remained stable at 4.3. These results confirm that domain-adversarial training reduces domain-specific information while preserving fault class separability. The overlap ratio reported in Figure 8 was computed as the proportion of sample pairs from different operating conditions whose Euclidean distance in the feature space was below the average intra-condition distance. The center distance was the Euclidean distance between the centroids of the source and target condition samples. Both values were averaged over 10 independent runs with random subsampling of 500 points per class for stability.

4.5. Overall Stability Analysis of the Model

To evaluate the output consistency and prediction reliability of the constructed model under random initialization and external disturbance conditions, this section conducts multiple independent repeated experiments under the same training configuration, and it gradually introduces noise of different intensities into the input signal to simulate signal disturbances under complex operating conditions. At the same time, the distribution characteristics of the remaining lifetime prediction error at the sample level are statistically analyzed. The overall behavior of the model is observed and quantitatively described from three perspectives: model output stability, disturbance resistance, and prediction error distribution. The results obtained under the above experimental settings are shown in Figure 9.
Figure 9 contains three subplots to characterize the model output stability from different dimensions. Figure 9a shows the changes in classification accuracy and RUL prediction error in 10 independent training experiments. The horizontal axis represents the experiment run number, the left vertical axis represents classification accuracy, and the right vertical axis represents root mean square error. The two broken lines correspond to accuracy and RMSE, respectively. The accuracy values in the 10 experiments range from 98.62 to 98.79, with a maximum difference of 0.17. The RMSE values range from 7.83 to 8.02, with a fluctuation range of 0.19. This result reflects that the model maintains a highly consistent output level under different random initialization conditions, with a small performance fluctuation range. This indicates that the feature extraction structure and the condition-adversarial alignment mechanism form a stable parameter update path during training, thereby reducing the impact of random weight initialization on the model’s convergence state. The noise robustness experiment was conducted as follows. Additive white Gaussian noise with zero mean was added to the raw vibration signals. The signal-to-noise ratio SNR was defined as 20 times the logarithm base 10 of the ratio of the root mean square of the original signal to the root mean square of the noise. The SNR levels tested were 30 dB, 20 dB, 15 dB, 10 dB, and 5 dB. Noise was added only during testing; the models were trained on clean signals without any noise augmentation. For each SNR level, the same noise realization was applied to all test samples to ensure fair comparison across models. The experiment was repeated 10 times with different random noise seeds. For each repetition, the same set of noise seeds was used for all models. The classification accuracy and RUL RMSE values reported in Figure 9b are the means over these 10 runs. The standard deviations at each SNR level are presented in Table 11. Figure 9b describes the model performance change trend under different signal-to-noise ratio conditions. The horizontal axis represents the signal-to-noise ratio level, and the vertical axes represent accuracy and RMSE, respectively. When the signal-to-noise ratio gradually decreased from 30 dB to 5 dB, the accuracy decreased from 98.75 to 97.68, with an overall change of 1.07, while the RMSE increased from 7.82 to 9.12, with an error increment of 1.30. The data changes showed a gradual trend, indicating that the model still maintained relatively stable diagnostic performance and prediction accuracy under the condition of gradually increasing signal and noise. This phenomenon is related to the robust representation ability of the convolutional feature extraction network to local patterns and the feature distribution constraints brought about by condition-adversarial training, which makes the key degradation information in the feature space still retain a recognizable structure under noise interference. Table 11 reports the mean accuracy and RMSE with 95 percent confidence intervals over 10 runs at each signal-to-noise ratio level.
Figure 9c shows the probability density distribution of the RUL prediction residuals, where the horizontal axis represents the prediction residual values and the vertical axis represents the probability density. The residual samples are mainly concentrated in the range of −6 to 6, with a mean close to 0 and a standard deviation of approximately 3.2. The distribution exhibits a symmetrical bell-shaped structure, indicating that the model prediction error mainly fluctuates around the true value without significant systematic bias. This error distribution stems from the LSTM’s continuous state modeling of the degradation time series, which allows the degradation trend information to be stably expressed in the temporal dimension, thereby reducing the impact of single-time observation noise on the prediction results. These results collectively demonstrate that the constructed model exhibits high consistency and stability in terms of repeated training, noise perturbation, and sample error.

4.6. Ablation Study on Loss Weights

The default balance coefficients were λ a d v = 1.0 and λ r u l = 1.0. A series of ablation experiments varied each coefficient independently while keeping the other fixed at the default value. The fault classification accuracy and RUL prediction root mean square error were recorded for each setting. The results are presented in Table 5. The default configuration achieved the best balanced performance. When λ a d v was reduced to 0.1, classification accuracy dropped from 98.7 percent to 96.2 percent. When λ a d v was increased to 10.0, accuracy further dropped to 95.8 percent. For λ r u l , setting it to 0.1 increased RUL RMSE from 7.9 to 9.4, while setting it to 10.0 increased RMSE to 8.8. The default values of 1.0 for both coefficients are recommended for reproducibility. Each coefficient varied independently while keeping the other fixed at 1.0. The fault classification accuracy and RUL prediction root mean square error were recorded for each setting. The results are presented in Table 4. When λ a d v was reduced to 0.1, the condition alignment loss contributed minimally, and the classification accuracy decreased from 98.7 percent to 96.2 percent. When λ a d v was increased to 10.0, the accuracy continued to drop to 95.8 percent due to excessive alignment pressure that weakened fault-discriminative features. For λ r u l , setting it to 0.1 caused the RUL prediction error to increase from 7.9 to 9.4, as the lifetime regression task was underweighted. Setting λ r u l to 10.0 increased the error to 8.8, indicating that over-emphasizing regression slightly degraded classification performance. The model achieved balanced performance at λ a d v = 1.0 and λ r u l = 1.0. The influence of the balance coefficients λadv and λrul on fault diagnosis accuracy and RUL prediction performance is summarized in Table 12.

4.7. Hyperparameter Sensitivity Analysis

A sensitivity analysis was conducted on the key hyperparameters listed in Table 13. The learning rate was varied from 0.0001 to 0.01 while keeping all other parameters fixed at the default values. The batch size was varied from 16 to 128. The number of LSTM hidden units was varied from 32 to 256. For each hyperparameter setting, the fault classification accuracy and RUL prediction RMSE were recorded on the validation set. The results are presented in Table 6. The model achieved the best performance at a learning rate of 0.001, batch size of 64, and LSTM hidden units of 128. Deviations from these values led to a decrease in accuracy of up to 2.3 percentage points and an increase in RMSE of up to 1.2. The model showed stable performance within a moderate range of hyperparameter variation, with accuracy fluctuations below 1.5 percentage points when the learning rate was between 0.0005 and 0.002.

4.8. Ablation Study on Component Contributions and Joint Training

Four additional model variants were implemented to isolate the contribution of each component and to compare joint training against separate training. The first variant removed the domain adversarial module entirely, keeping only the CNN feature extractor and the fault classifier. The second variant removed the LSTM module, using only the CNN feature extractor followed by a fully connected regression layer for RUL prediction. The third variant trained the classification task and the condition alignment task separately from the RUL prediction task. In this separate training setup, the feature extractor and condition discriminator were first trained for 30 epochs using only the classification loss and the condition alignment loss. Then the LSTM module was trained for another 30 epochs with the feature extractor frozen, using only the RUL regression loss. The fourth variant removed the classification loss entirely, training only with condition alignment loss and RUL regression loss. All variants were evaluated on the 0 hp→1 hp transfer task for fault classification accuracy and on the PHM2012 dataset for RUL prediction RMSE. The results are presented in Table 14. The full joint training model achieved a classification accuracy of 96.7 percent and an RUL RMSE of 7.9. Removing the domain adversarial module reduced classification accuracy to 84.5 percent while RUL RMSE increased to 9.2. Removing the LSTM module increased RUL RMSE to 12.4. Separate task training produced a classification accuracy of 92.3 percent and an RUL RMSE of 9.6, both worse than joint training. Removing classification loss reduced classification accuracy to 85.1 percent. These results confirm that joint optimization of all three losses with shared feature extraction yields better performance than separate training or removing any component.

4.9. Targeted Ablation on Condition Alignment and RUL Information Preservation

To address the concern that condition-adversarial alignment may remove degradation information needed for RUL prediction, an additional controlled ablation was conducted using the same CNN-LSTM feature extractor and chronological data split. Only the condition-alignment weight was changed; the segmentation, LSTM structure, RUL labels, optimizer, and train/validation/test partitions were kept unchanged. The aim was not simply to compare full modules, but to test whether RUL-relevant temporal ordering survives moderate alignment pressure.The quantitative results of the controlled ablation experiment are presented in Table 15, allowing evaluation of whether condition alignment preserves degradation-related information relevant to RUL prediction.
The moderate-alignment setting improves transfer classification and reduces RUL error relative to the no-alignment setting, while the latent degradation-order correlation remains nearly unchanged. Excessive alignment, by contrast, increases RUL error and weakens the degradation-order correlation. The results indicate that the selected alignment weight preserves useful degradation information empirically, whereas overly strong alignment can damage RUL modeling. The interpretation of the model has therefore been moderated: the framework encourages condition-robust classification while retaining degradation cues under the selected training configuration.

5. Conclusions

This study presented a condition-aware DANN-LSTM framework for rolling-bearing fault diagnosis and RUL prediction under operating condition shifts. The model uses a CNN to extract vibration features, a gradient-reversal branch to reduce condition-related bias for fault classification, and an LSTM branch to model chronological degradation for RUL estimation. The revised formulation clarifies that condition alignment is not treated as a complete removal of operating condition information; rather, it is a classification-oriented regularization whose influence on RUL modeling is examined through targeted ablation. Experiments on public bearing datasets show stable four-class fault diagnosis, an RUL RMSE of 7.9 on PHM2012, and improved transfer performance under controlled speed/load shifts. The work should therefore be understood as an application-specific integration and validation of established deep-learning components rather than as a new network family. Future work should test the framework on truly non-stationary industrial data with continuously varying loads, multi-fault coupling, and non-linear degradation labels.

Author Contributions

Conceptualization, Y.J.; methodology, Y.J. and R.X.; software, Y.J.; validation, Y.J. and R.X.; formal analysis, Y.J.; investigation, Y.J. and M.P.; data curation, Y.J.; writing—original draft preparation, Y.J.; writing—review and editing, R.X. and M.P.; visualization, Y.J.; supervision, R.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Foundations of Jimei University, Chengyi College (JMT00117).

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request. Public benchmark datasets used in this work include the Case Western Reserve University (CWRU) Bearing Data Center dataset and the PHM2012 Prognostics and Health Management Challenge dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wei, H.; Zhang, Q.; Gu, Y. Fault diagnosis of rotating machinery: A highly efficient and lightweight framework based on a temporal convolutional network and broad learning system. Sensors 2023, 23, 5642. [Google Scholar] [CrossRef]
  2. Zhang, Q.; Su, N.; Qin, B.; Sun, G.; Jing, X.; Hu, S.; Cai, Y.; Zhou, L. Fault diagnosis for rotating machinery based on dimensionless indices: Current status, development, technologies, and future directions. Electronics 2024, 13, 4931. [Google Scholar] [CrossRef]
  3. Matania, O.; Dattner, I.; Bortman, J.; Kenett, R.S.; Parmet, Y. A systematic literature review of deep learning for vibration-based fault diagnosis of critical rotating machinery: Limitations and challenges. J. Sound Vib. 2024, 590, 118562. [Google Scholar] [CrossRef]
  4. Zhu, Z.; Lei, Y.; Qi, G.; Chai, Y.; Mazur, N.; An, Y.; Huang, X. A review of the application of deep learning in intelligent fault diagnosis of rotating machinery. Measurement 2023, 206, 112346. [Google Scholar] [CrossRef]
  5. Gao, Y.; Ahmad, Z.; Kim, J.M. The Prediction of the Remaining Useful Life of Rotating Machinery Based on an Adaptive Maximum Second-Order Cyclostationarity Blind Deconvolution and a Convolutional LSTM Autoencoder. Sensors 2024, 24, 2382. [Google Scholar] [CrossRef]
  6. Li, L.; Xu, J.; Li, J. Estimating remaining useful life of rotating machinery using relevance vector machine and deep learning network. Eng. Fail. Anal. 2023, 146, 107125. [Google Scholar] [CrossRef]
  7. Bagri, I.; Tahiry, K.; Hraiba, A.; Touil, A.; Mousrij, A. Vibration signal analysis for intelligent rotating machinery diagnosis and prognosis: A comprehensive systematic literature review. Vibration 2024, 7, 1013–1062. [Google Scholar] [CrossRef]
  8. Niu, M.; Ma, S.; Zhu, H.; Xu, K. Fault diagnosis of rotating machinery using a signal processing technique and lightweight model based on mechanical structural characteristics. Measurement 2025, 245, 116505. [Google Scholar] [CrossRef]
  9. Li, W.; Li, T. Comparison of deep learning models for predictive maintenance in industrial manufacturing systems using sensor data. Sci. Rep. 2025, 15, 23545. [Google Scholar] [CrossRef]
  10. Neupane, D.; Bouadjenek, M.R.; Dazeley, R.; Aryal, S. Data-driven machinery fault diagnosis: A comprehensive review. Neurocomputing 2025, 627, 129588. [Google Scholar] [CrossRef]
  11. Khalil, A.F.; Rostam, S. Machine learning-based predictive maintenance for fault detection in rotating machinery: A case study. Eng. Technol. Appl. Sci. Res. 2024, 14, 13181–13189. [Google Scholar] [CrossRef]
  12. Liu, S.; Huang, J.; Han, P.; Fan, Z.; Ma, J. Cross-Domain Fault Diagnosis of Rotating Machinery Under Time-Varying Rotational Speed and Asymmetric Domain Label Condition. Sensors 2025, 25, 2818. [Google Scholar] [CrossRef]
  13. Tang, M.; Liao, Y.; Luo, F.; Li, X. A novel method for fault diagnosis of rotating machinery. Entropy 2022, 24, 681. [Google Scholar] [CrossRef] [PubMed]
  14. Yu, X.; Wang, S.; Xu, H.; Yu, K.; Feng, K.; Zhang, Y.; Liu, X. Intelligent fault diagnosis of rotating machinery under variable working conditions based on deep transfer learning with fusion of local and global time–frequency features. Struct. Health Monit. 2024, 23, 2238–2254. [Google Scholar] [CrossRef]
  15. Guo, Z.; Du, W.; Li, C.; Yu, Y.; Hu, T.; Wang, S.; Liu, Z. Multi-scale wavelet decomposition and feature fusion for rotating machinery fault diagnosis under multi-level class imbalance. Mech. Syst. Signal Process. 2025, 240, 113427. [Google Scholar] [CrossRef]
  16. Shi, M.; Ding, C.; Wang, R.; Shen, C.; Huang, W.; Zhu, Z. Graph embedding deep broad learning system for data imbalance fault diagnosis of rotating machinery. Reliab. Eng. Syst. Saf. 2023, 240, 109601. [Google Scholar] [CrossRef]
  17. Qiu, Z.; Fan, S.; Liang, H.; Liu, J. Multimodal fusion fault diagnosis method under noise interference. Appl. Acoust. 2025, 228, 110301. [Google Scholar] [CrossRef]
  18. Zhang, Y.; Lin, L.; Wang, J.; Zhang, W.; Gao, S.; Zhang, Z. Attention activation network for bearing fault diagnosis under various noise environments. Sci. Rep. 2025, 15, 977. [Google Scholar] [CrossRef]
  19. Lin, T.; Ren, Z.; Huang, K.; Zhu, Y.; Karimi, H.R. A novel multi-sensor information fusion method for fault diagnosis of rotating machinery with missing signals. Adv. Eng. Inform. 2025, 68, 103595. [Google Scholar] [CrossRef]
  20. Xiao, X.; Li, C.; He, H.; Huang, J.; Yu, T. Rotating machinery fault diagnosis method based on multi-level fusion framework of multi-sensor information. Inf. Fusion 2025, 113, 102621. [Google Scholar] [CrossRef]
  21. Misbah, I.; Lee, C.K.; Keung, K.L. Fault diagnosis in rotating machines based on transfer learning: Literature review. Knowl.-Based Syst. 2024, 283, 111158. [Google Scholar] [CrossRef]
  22. Tang, S.; Ma, J.; Yan, Z.; Zhu, Y.; Khoo, B.C. Deep transfer learning strategy in intelligent fault diagnosis of rotating machinery. Eng. Appl. Artif. Intell. 2024, 134, 108678. [Google Scholar] [CrossRef]
  23. He, J.; Ma, Z.; Liu, Y.; Yang, Z. Remaining useful life prediction of rotating machine via long short-term memory network with uncertainty quantification. Eng. Appl. Artif. Intell. 2026, 164, 113280. [Google Scholar] [CrossRef]
  24. Liu, Z.; Lei, Z.; Wen, G.; Wang, E.; Su, Y.; Zhang, Z.; Chen, X. Self-data-driven remaining useful life prediction of rotating machinery under time-varying operating conditions based on a two-stage hybrid state-space model. Reliab. Eng. Syst. Saf. 2025, 268, 111981. [Google Scholar] [CrossRef]
  25. Wang, C.; Jiang, W.; Shi, L.; Zhang, L. Rolling bearing remaining useful life prediction using deep learning based on high-quality representation. Sci. Rep. 2025, 15, 8228. [Google Scholar] [CrossRef]
  26. Sánchez, R.V.; Macancela, J.C.; Ortega, L.R.; Cabrera, D.; García Márquez, F.P.; Cerrada, M. Evaluation of hand-crafted feature extraction for fault diagnosis in rotating machinery: A survey. Sensors 2024, 24, 5400. [Google Scholar] [CrossRef]
  27. Wang, H.; Wang, H.; Tang, X. A review of deep learning in rotating machinery fault diagnosis and its prospects for port applications. Appl. Sci. 2025, 15, 11303. [Google Scholar] [CrossRef]
  28. Xiang, W.; Liu, S.; Li, H.; Cao, X.; Han, B. Cross-domain fault diagnosis of rotating machinery based on class-imbalanced joint condition adaptation network. Knowl.-Based Syst. 2026, 341, 115823. [Google Scholar] [CrossRef]
  29. Bai, R.; Noman, K.; Yang, Y.; Li, Y.; Guo, W. Towards trustworthy remaining useful life prediction through multi-source information fusion and a novel LSTM-DAU model. Reliab. Eng. Syst. Saf. 2024, 245, 110047. [Google Scholar] [CrossRef]
  30. Fu, K.; Mudiyanselage, S.S.D.; Dai, C.; Kim, M. Prognostics of Multisensor Systems with Unknown and Unlabeled Failure Modes via Bayesian Nonparametric Process Mixtures. arXiv 2026, arXiv:2602.19263. [Google Scholar] [CrossRef]
  31. Xu, D.; Xiao, X.; Liu, J.; Sui, S. Spatio-temporal degradation modeling and remaining useful life prediction under multiple operating conditions based on attention mechanism and deep learning. Reliab. Eng. Syst. Saf. 2023, 229, 108886. [Google Scholar] [CrossRef]
  32. Yan, J.; Ye, Z.S.; He, S.; He, Z. A feature disentanglement and unsupervised condition adaptation of remaining useful life prediction for sensor-equipped machines. Reliab. Eng. Syst. Saf. 2024, 242, 109736. [Google Scholar] [CrossRef]
  33. Chao, M.A.; Kulkarni, C.; Goebel, K.; Fink, O. Fusing physics-based and deep learning models for prognostics. Reliab. Eng. Syst. Saf. 2022, 217, 107961. [Google Scholar] [CrossRef]
  34. Wang, C.; Chen, X.; Qiang, X.; Fan, H.; Li, S. Recent advances in mechanism/data-driven fault diagnosis of complex engineering systems with uncertainties. AIMS Math. 2024, 9, 29736–29772. [Google Scholar] [CrossRef]
  35. Zhang, J.; Zhou, Q.; Zhou, W. Domain-Invariant Fault Representation Learning for Rotating Machinery via Causal Excitation and Conditional Alignment. Electronics 2026, 15, 1252. [Google Scholar] [CrossRef]
Figure 1. Architecture of the condition-aware DANN-LSTM framework.
Figure 1. Architecture of the condition-aware DANN-LSTM framework.
Machines 14 00682 g001
Figure 2. Multi-scale characteristic response diagram of vibration signal.
Figure 2. Multi-scale characteristic response diagram of vibration signal.
Machines 14 00682 g002
Figure 3. Schematic diagram of feature spatial distribution alignment.
Figure 3. Schematic diagram of feature spatial distribution alignment.
Machines 14 00682 g003
Figure 4. Degradation state evolution and RUL prediction curve.
Figure 4. Degradation state evolution and RUL prediction curve.
Machines 14 00682 g004
Figure 5. Fault diagnosis performance under the corrected four-class bearing-fault setting. (a) Normalized confusion matrix; (b) bar chart of classification performance of each category; (c) training curve.
Figure 5. Fault diagnosis performance under the corrected four-class bearing-fault setting. (a) Normalized confusion matrix; (b) bar chart of classification performance of each category; (c) training curve.
Machines 14 00682 g005
Figure 6. Analysis chart of remaining life prediction results. (a) RUL-predicted trajectory; (b) histogram of prediction error; (c) scatter plot of actual and predicted values.
Figure 6. Analysis chart of remaining life prediction results. (a) RUL-predicted trajectory; (b) histogram of prediction error; (c) scatter plot of actual and predicted values.
Machines 14 00682 g006
Figure 7. Comparison of generalization performance across working conditions. (a) Comparison of fault diagnosis accuracy; (b) comparison of remaining life prediction errors; (c) accuracy improvement rate.
Figure 7. Comparison of generalization performance across working conditions. (a) Comparison of fault diagnosis accuracy; (b) comparison of remaining life prediction errors; (c) accuracy improvement rate.
Machines 14 00682 g007
Figure 8. Feature space alignment effect analysis.
Figure 8. Feature space alignment effect analysis.
Machines 14 00682 g008
Figure 9. Overall stability analysis diagram of the model. (a) Stability of repeated experiments; (b) noise disturbance robustness; (c) RUL-predicted residual distribution.
Figure 9. Overall stability analysis diagram of the model. (a) Stability of repeated experiments; (b) noise disturbance robustness; (c) RUL-predicted residual distribution.
Machines 14 00682 g009
Table 1. Statistics on data acquisition from multi-source heterogeneous sensors.
Table 1. Statistics on data acquisition from multi-source heterogeneous sensors.
DatasetOperating ConditionFault TypeSample SizeRUL Range (min)Task Type
CWRU1797 rpm/0 hpNormal120-Fault classification only
CWRU1772 rpm/1 hpInner race120-Fault classification only
CWRU1750 rpm/2 hpRolling element120-Fault classification only
CWRU1750 rpm/2 hpOuter race120-Fault classification only
CWRU1730 rpm/3 hpInner race120-Fault classification only
PHM20121800 rpm/4000 NFull life degradation800–280RUL prediction only
XJTU-SY2100 rpm/12 kNOuter race600–150RUL prediction only
IMS2000 rpm/6000 lbsInner + Rolling element400–164RUL prediction only
Table 2. Data preprocessing parameter configuration.
Table 2. Data preprocessing parameter configuration.
StepParameterValueDescription
SegmentationWindow length1024 ptsPoints per sample
SegmentationOverlap rate50%Overlap ratio of windows
NormalizationMethodZ-scoreZero mean, unit variance
SplitTraining ratio70%Training proportion
SplitValidation ratio15%Validation proportion
SplitTesting ratio15%Testing proportion; chronological split for PHM2012 and XJTU-SY
Table 3. Model training parameter configuration.
Table 3. Model training parameter configuration.
ModuleParameterValueDescription
Feature extractorConv1D blocks3Multi-scale feature extraction
Feature extractorKernel sizes/channels64/32/16; 32/64/128Local impulses to low-frequency trend features
Domain branchGRL schedulelambda_p = 2/(1 + exp(−10p)) − 1p is normalized training progress
Joint losslambda_adv/lambda_rul0.5/1.0Weights for condition alignment and RUL regression
Temporal branchLSTM layers/hidden units2/128Chronological degradation sequence modelling
TrainingOptimizer/LR/batch/epochsAdam/0.001/64/50Early stopping based on validation loss
Table 4. Detailed model architecture specifications with input and output dimensions.
Table 4. Detailed model architecture specifications with input and output dimensions.
Layer TypeInput SizeKernel SizeStridePaddingOutput SizeActivationDropout Rate
Conv1D (Layer 1)1024 × 1642same512 × 32ReLU0
BatchNorm1D512 × 32---512 × 32-0
MaxPool1D512 × 32220256 × 32-0
Conv1D (Layer 2)256 × 32322same128 × 64ReLU0
BatchNorm1D128 × 64---128 × 64-0
MaxPool1D128 × 6422064 × 64-0
Conv1D (Layer 3)64 × 64162same32 × 128ReLU0
BatchNorm1D32 × 128---32 × 128-0
MaxPool1D32 × 12822016 × 128-0
Flatten16 × 128---2048-0
Fully Connected (Cond Discriminator)2048---128ReLU0.5
Output (Cond Discriminator)128---2Softmax0
Fully Connected (Fault Classifier)2048---128ReLU0.5
Output (Fault Classifier)128---4Softmax0
LSTM (Layer 1)32 × 2048128--32 × 128Tanh0.2
LSTM (Layer 2)32 × 128128--32 × 128Tanh0.2
Fully Connected (RUL Regression)128---1Linear0
Note: “-” indicates that the parameter is not applicable to the corresponding layer.
Table 5. Fault classification performance statistics over 10 independent runs on the CWRU dataset (mean ± standard deviation).
Table 5. Fault classification performance statistics over 10 independent runs on the CWRU dataset (mean ± standard deviation).
Fault ClassPrecision (%)Recall (%)F1 Score (%)
Normal98.3 ± 0.298.2 ± 0.398.2 ± 0.2
Inner race98.0 ± 0.397.9 ± 0.297.9 ± 0.3
Outer race98.5 ± 0.298.4 ± 0.398.4 ± 0.2
Rolling element98.4 ± 0.398.7 ± 0.298.5 ± 0.3
Table 6. Fault diagnosis accuracy comparison across different condition adaptation methods for cross-condition transfer tasks.
Table 6. Fault diagnosis accuracy comparison across different condition adaptation methods for cross-condition transfer tasks.
Transfer TaskCNN-LSTM (Baseline)CDANWDGRLCORALProposed DANN-LSTM
0 hp→1 hp84.5%88.3%89.1%86.7%96.7%
0 hp→2 hp81.2%85.6%86.2%83.9%95.4%
1 hp→2 hp83.3%87.9%88.5%85.1%97.2%
2 hp→3 hp79.8%84.2%84.9%82.3%94.8%
1 hp→3 hp78.6%83.1%83.8%81.2%94.1%
Table 7. Cross-condition transfer classification accuracy (mean ± standard deviation over 10 runs) and paired t-test p-values against the CNN-LSTM baseline.
Table 7. Cross-condition transfer classification accuracy (mean ± standard deviation over 10 runs) and paired t-test p-values against the CNN-LSTM baseline.
Transfer TaskProposed DANN-LSTMCNN-LSTM Baselinep-Value
0 hp→1 hp96.7 ± 0.4%84.5 ± 1.2%<0.001
0 hp→2 hp95.4 ± 0.5%81.2 ± 1.3%<0.001
1 hp→2 hp97.2 ± 0.3%83.3 ± 1.1%<0.001
2 hp→3 hp94.8 ± 0.6%79.8 ± 1.4%<0.001
1 hp→3 hp94.1 ± 0.7%78.6 ± 1.5%<0.001
Table 8. Fault diagnosis accuracy comparison across different baseline methods for cross-condition transfer tasks.
Table 8. Fault diagnosis accuracy comparison across different baseline methods for cross-condition transfer tasks.
Transfer TaskCNN-LSTMStandalone DANNDomain-Regularized AEDomain GeneralizationProposed
0 hp→1 hp84.5%91.2%89.7%90.1%96.7%
0 hp→2 hp81.2%88.5%86.3%87.4%95.4%
1 hp→2 hp83.3%90.8%88.9%89.5%97.2%
2 hp→3 hp79.8%87.1%84.6%85.2%94.8%
1 hp→3 hp78.6%86.4%83.2%84.0%94.1%
Table 9. Performance comparison with component-isolated baselines on the 0 hp→1 hp transfer task (classification accuracy) and PHM2012 RUL prediction (RMSE).
Table 9. Performance comparison with component-isolated baselines on the 0 hp→1 hp transfer task (classification accuracy) and PHM2012 RUL prediction (RMSE).
Baseline ModelFault Classification Accuracy (0 hp→1 hp)RUL Prediction RMSE (PHM2012)
CNN only82.1%Not applicable
LSTM onlyNot applicable11.3
CNN-LSTM without DANN84.5%9.2
DANN-CNN without LSTM91.2%Not applicable
Single-task (RUL only)Not applicable9.8
Proposed DANN-LSTM96.7%7.9
Table 10. Direct comparison with closely related DANN-based and LSTM-based methods on the 0 hp→1 hp transfer task and PHM2012 RUL prediction. All comparison methods were re-implemented under identical experimental settings.
Table 10. Direct comparison with closely related DANN-based and LSTM-based methods on the 0 hp→1 hp transfer task and PHM2012 RUL prediction. All comparison methods were re-implemented under identical experimental settings.
MethodClassification Accuracy (%)RUL RMSE
DANN (standard implementation)87.3Not applicable
LSTM (standard implementation)Not applicable11.2
CNN-LSTM with domain adaptation90.59.1
Domain-regularized LSTM89.88.7
Proposed DANN-LSTM96.77.9
Table 11. Noise robustness test results with 95 percent confidence intervals over 10 runs.
Table 11. Noise robustness test results with 95 percent confidence intervals over 10 runs.
SNR (dB)Accuracy (%)95% CI for AccuracyRMSE95% CI for RMSE
3098.6[98.2, 99.0]7.9[7.6, 8.2]
2098.4[98.0, 98.8]8.1[7.8, 8.4]
1598.1[97.7, 98.5]8.3[8.0, 8.6]
1097.8[97.3, 98.3]8.6[8.2, 9.0]
597.5[96.9, 98.1]9.0[8.5, 9.5]
Table 12. Ablation study results (default: λ a d v = 1.0, λ r u l = 1.0).
Table 12. Ablation study results (default: λ a d v = 1.0, λ r u l = 1.0).
λ a d v λ_rulClassification Accuracy (%)RUL Prediction RMSE
0.11.096.28.1
1.01.098.77.9
10.01.095.88.3
1.00.198.59.4
1.010.097.98.8
Table 13. Hyperparameter sensitivity analysis results.
Table 13. Hyperparameter sensitivity analysis results.
Learning RateAccuracy (%)RMSEBatch SizeAccuracy (%)RMSENumber of LSTM Hidden UnitsAccuracy (%)RMSE
0.000196.88.91696.58.63296.28.7
0.000598.18.23297.88.36497.98.2
0.00198.77.96498.77.912898.77.9
0.00298.28.112898.08.225698.18.1
0.0196.49.2------
Note: “-” indicates that the corresponding parameter was not evaluated or is not applicable in that experiment.
Table 14. Performance comparison of component-isolated variants and joint training strategy.
Table 14. Performance comparison of component-isolated variants and joint training strategy.
Model VariantClassification Accuracy (%)RUL RMSE
Without domain adversarial module84.59.2
Without LSTM module96.312.4
Separate task training (classification + alignment first, then RUL)92.39.6
Without classification loss85.18.2
Full joint training (proposed)96.77.9
Table 15. Controlled ablation using the same feature extractor to examine whether condition-dependent degradation information is preserved for RUL prediction.
Table 15. Controlled ablation using the same feature extractor to examine whether condition-dependent degradation information is preserved for RUL prediction.
VariantCondition-Alignment SettingFault Accuracy (%)RUL RMSELatent Degradation-Order Correlation
CNN-LSTM, no condition alignmentlambda_adv = 084.59.20.84
DANN-LSTM, moderate alignmentlambda_adv = 1.096.77.90.87
DANN-LSTM, excessive alignmentlambda_adv = 10.095.88.80.61
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ji, Y.; Xia, R.; Peng, M. Condition-Aware DANN-LSTM for Rolling-Bearing Fault Diagnosis and Remaining Useful Life Prediction Under Operating Condition Shifts. Machines 2026, 14, 682. https://doi.org/10.3390/machines14060682

AMA Style

Ji Y, Xia R, Peng M. Condition-Aware DANN-LSTM for Rolling-Bearing Fault Diagnosis and Remaining Useful Life Prediction Under Operating Condition Shifts. Machines. 2026; 14(6):682. https://doi.org/10.3390/machines14060682

Chicago/Turabian Style

Ji, Yangfeng, Rongfei Xia, and Miaojiao Peng. 2026. "Condition-Aware DANN-LSTM for Rolling-Bearing Fault Diagnosis and Remaining Useful Life Prediction Under Operating Condition Shifts" Machines 14, no. 6: 682. https://doi.org/10.3390/machines14060682

APA Style

Ji, Y., Xia, R., & Peng, M. (2026). Condition-Aware DANN-LSTM for Rolling-Bearing Fault Diagnosis and Remaining Useful Life Prediction Under Operating Condition Shifts. Machines, 14(6), 682. https://doi.org/10.3390/machines14060682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop