2.1. Overall Model Framework Design
The proposed DANN-LSTM model combines cross-condition feature alignment for fault classification with temporal degradation modeling for RUL prediction. It receives a one-dimensional vibration signal and produces two outputs: a fault category and a remaining life estimate. The network contains a multi-scale CNN feature extractor, a condition-adversarial branch for classification-oriented alignment, and an LSTM regression branch for chronological degradation modeling. The condition discriminator is connected to the feature extractor through a gradient reversal layer, so classification-relevant features become less sensitive to speed/load shifts. This alignment is not imposed as a blanket constraint on the RUL branch. The LSTM receives the shared chronological features and is trained by lifetime regression loss without gradient reversal, allowing load- and time-dependent degradation cues to remain available for RUL estimation.
During training, the model jointly optimizes three types of loss functions: fault classification loss, condition alignment loss, and lifetime prediction loss. The overall objective function is shown in Equation (1).
in Formula (1), the classification loss measures the discrepancy between predicted and true fault labels, the condition-discrimination loss is optimized adversarially through the gradient reversal branch, and the RUL loss constrains the deviation between predicted and true remaining life. During backpropagation, the condition discriminator minimizes its own classification loss, whereas the feature extractor receives the reversed gradient from this branch. The LSTM regression branch propagates the RUL gradient normally. Condition invariance is therefore encouraged for classification-oriented alignment, not as a complete removal of condition-dependent degradation information.
Figure 1 illustrates the corrected DANN-LSTM workflow. The CNN extractor converts each vibration window into a high-dimensional feature vector. The upper branch uses a gradient reversal layer and a two-output condition discriminator for source/target operating condition labels. The middle branch is the four-class fault classifier, corresponding to normal, inner race, outer race, and rolling element faults. The lower branch sends chronological feature sequences to a two-layer LSTM and a regression layer for RUL prediction. The figure now matches the experimental class definition and separates the condition-adversarial path from the LSTM regression path.
2.2. Multi-Scale Feature Extraction Based on Convolutional Neural Networks
The original vibration signal of rotating machinery is a time-domain waveform, containing rich equipment status information. A convolutional neural network is constructed to perform local perception and multi-scale feature extraction on the input signal, mapping the one-dimensional time series to a high-dimensional feature space. The input vibration signal is represented as
, where
is the number of signal sampling points. The first layer of the network uses
a one-dimensional convolutional kernel with a width of, which slides along the time dimension to perform convolution operations, generating a primary feature map. This convolution operation is defined as:
In Formula (2), let “”represent the convolution operation , be the weight vector of the convolution kernel, be the corresponding bias term, be the non-linear activation function, and be the feature channel of the output . The primary feature map is extracted in parallel using multiple convolution kernels, with dimension , where is the number of kernels in the first layer and is the time length after convolution. The first layer kernel has a small width and is used to capture local transient impulses in the signal.
A second convolutional layer is stacked on top of the primary features, using
a convolutional kernel of width (
) to
perform secondary feature extraction. This convolutional operation is expressed as:
In Formula (3), represents the weight between the i- th convolutional kernel and the i-th input channel in the second layer, and represents the bias term. A wider convolutional kernel integrates local features over a wider time window, extracting mid-frequency modulation information from the signal. The feature map output from the second layer has a dimension of , where is the number of convolutional kernels in the second layer. Two consecutive convolutional layers gradually expand the network’s receptive field, enabling multi-scale information extraction from high-frequency impulses to low-frequency envelopes.
The third layer uses
a convolutional kernel of width (
) to
perform higher-level feature abstraction, and its operation form is:
In Formula (4), where
is the weight of the third convolutional kernel, and
is the output feature. This convolutional kernel covers a longer signal segment, capturing low-frequency feature components related to the overall degradation trend of the device. The feature map output
, the three-layer convolutional network, has a dimension of
, where
is the number of third-layer convolutional kernels. Each convolutional layer is followed by batch normalization and pooling operations. Batch normalization accelerates network convergence, and pooling reduces the feature dimension and enhances local translation invariance. After multi-layer convolution processing, the original vibration signal
is transformed from a time-domain waveform into a high-dimensional feature representation
, where
, and
. This feature vector
simultaneously encodes the frequency domain structure and time-domain evolution information of the signal at different time scales.
Figure 2 shows the correspondence between the original vibration signal and the multi-scale feature response generated after passing through the convolutional network.
For each sliding window of 1024 points, the CNN feature extractor outputs a feature vector of dimension D = C
3 × T
3. The value of C
3 is 128, the number of output channels of the third convolutional layer. The value of T
3 is the compressed time length after the third pooling operation, which is 16. Thus, each window produces a feature vector of length 2048. To form a chronological sequence for the LSTM, the feature vectors from consecutive sliding windows with 50 percent overlap are concatenated in time order. The sequence length L is fixed at 32 windows, corresponding to 32 consecutive windows covering 32 × 512 points of original vibration signal after accounting for the 50 percent overlap. The sliding step between consecutive windows is 512 points. Batches are formed by randomly sampling sequences from the training set. Each batch contains 64 sequences, each sequence consisting of 32 time steps of feature vectors. For the PHM2012 and XJTU-SY datasets, each bearing run provides a single continuous degradation sequence. After the chronological split described in
Section 3.2, windows from the same run are kept in strict chronological order within each training, validation, and testing segment. No mixing of windows from different runs occurs within a single sequence.
Figure 2 shows the multi-scale characteristic response map of the vibration signal, which adopts a four-sub-map structure arranged vertically from top to bottom. The topmost sub-map is the waveform of the original vibration signal, with the horizontal axis labeled with time (seconds) and the vertical axis labeled with amplitude (meters per square second). The waveform curve exhibits a non-stationary oscillation pattern, containing periodic impact pulses and background noise components. The second sub-map is the output feature map of the first convolution layer, with the horizontal axis labeled with time steps and the vertical axis labeled with feature channel indices. It is presented in the form of a grayscale heatmap, with grayscale values representing feature activation intensity. Multiple vertical bars are distributed along the time direction in the figure, and the bar positions correspond to the occurrence time of the impact event in the original signal. The bar width reflects the local transient response captured by the small-width convolution kernel. The third sub-map is the output feature map of the second convolution layer, with the same horizontal and vertical axis labels as the second sub-map. In the heatmap, the number of vertical bars is reduced compared to the previous sub-map, while the bar width is increased. The bar edges show a smooth transition, reflecting the integration effect of the larger-width convolution kernel on the intermediate frequency modulation information. The fourth sub-image is the output feature map of the third convolutional layer. In the heatmap, the vertical bars are further merged into sparse, wide bright areas. The dark areas between the bright areas correspond to low-energy periods in the signal, reflecting the low-frequency degradation trend components extracted by the maximum width convolutional kernel. The four sub-images are strictly aligned on the time axis, and the vertical connecting lines between the sub-images indicate the response relationship between the original signal and the feature maps of each layer within the same time window.
2.3. Feature Alignment Across Operating Conditions Based on Adversarial Learning
To eliminate the impact of data distribution differences under different working conditions on the generalization performance of the model, a condition discriminant module is embedded between the feature extractor and the fault classifier to construct an adversarial feature alignment mechanism. The condition discriminator receives the high-dimensional feature vector output by the feature extractor as input and determines the operating condition label of the feature through a binary classification network. This network consists of three fully connected layers and a gradient reversal layer (GRL) connected in series. The gradient reversal layer acts as an identity mapping during forward propagation, passing the feature to the condition discriminator; during backward propagation, it automatically inverts the gradient sign, making the optimization direction of the feature extractor opposite to that of the condition discriminator, forcing the feature extractor to learn condition-invariant features. The adversarial loss function of the condition discriminator is defined as cross-entropy, as shown in Formula (5).
In Formula (5),
represents the adversarial loss value of the condition discriminator,
is the number of samples in the source condition,
is the number of samples in the target condition, and
is the sample index.
is
, the true condition label of the i-th sample, which is 1 when the sample comes from the source condition and 0 when it comes from the target condition.
is the predicted probability output by the condition discriminator, representing
, which is the probability that the i-th sample is classified as a source condition sample. The overall adversarial objective is formulated as a min-max game. The condition discriminator minimizes the domain classification loss
to correctly distinguish source and target operating conditions. The feature extractor aims to maximize
through the gradient reversal layer, making the feature distributions from different operating conditions indistinguishable. This optimization is written as
, where
is the feature extractor,
is the fault classifier, and
is the condition discriminator. The gradient reversal layer multiplies the gradient from
by a negative coefficient −λ during backpropagation to
, while the gradient to
remains unchanged. The condition discriminator is updated with standard gradient descent to minimize
. The feature extractor is updated with gradient descent on the sum of
,
(with sign inversion), and
. During joint training, the gradient reversal layer multiplies the gradient returned by the condition discriminator by a negative coefficient
and then passes it to the feature extractor, where
is the hyperparameter for balancing the adversarial strength. This mechanism prompts the feature extractor to map the source and target condition data to a common feature space, making the feature distribution of the two types of data tend to be consistent. After adversarial training, the overlap of the projection regions of the source and target conditions in the feature space is significantly improved, and the condition discriminator cannot effectively distinguish the source of features, thus completing the cross-condition feature alignment. The trend of feature space distribution alignment before and after is shown in
Figure 3.
Figure 3 shows two scatter plots side-by-side. The left subplot is labeled “Before Alignment,” and the right subplot is labeled “After Alignment.” Both subplots share the same two-dimensional coordinate system, with the horizontal axis labeled “First Feature Dimension” and the vertical axis labeled “Second Feature Dimension.” In the left subplot, source condition sample points are represented by solid red circles, and target condition sample points are represented by solid blue triangles. The two sets of points are clearly separated in the two-dimensional plane, with the source condition points clustered in the upper left region of the coordinate system and the target condition points clustered in the lower right region, with a clear blank space between the two regions. In the right subplot, source and target condition sample points are marked with the same shape and color. The two sets of points are interspersed in the two-dimensional plane, with the solid red circles and solid blue triangles evenly mixed in the central region of the coordinate system, forming a highly overlapping single cluster structure. Both subplots are accompanied by legends at the bottom explaining the correspondence between the symbols and the domain categories.
2.4. LSTM-Based Degradation Process Modeling and Lifetime Prediction
The feature sequences aligned by the condition-adversarial mechanism still retain the evolutionary information of the equipment’s operating state over time. To characterize the performance degradation process of rotating machinery, this paper introduces a Long Short-Term Memory (LSTM) network to perform temporal modeling of the feature sequences. The aligned feature vectors form the sequence input according to the sampling time order, denoted as , where represents the time step length, represents the high-dimensional feature representation at time, and represents the feature dimension. This sequence serves as the input to an LSTM network, which selectively memorizes and updates historical degradation information through a gating mechanism, thereby establishing the dynamic dependency relationship of the equipment state over time.
The LSTM unit controls the flow of information through the forget gate, input gate, and output gate. Its state update process is described by the following relationship. The forget gate is used to adjust the degree to which the memory of the previous time step is retained for the current time step. Its calculation form is shown in Formula (6):
In Formula (6), represents the forget gate vector at time , represents the Sigmoid activation function, and represent the weight matrix of the input feature and the hidden state at the previous time step, respectively, represents the hidden state at the previous time step, and represents the bias term.
The input gate determines the extent to which new information is written into the memory cell at the current moment, and its expression is shown in Formula (7):
In Formula (7), represents the input gate vector, and represent the corresponding weight matrices respectively, represents the bias term, and the meanings of the other symbols are the same as those in the previous text.
Candidate memory units are used to generate potential state information at the current time, and their calculation method is shown in Formula (8):
In Formula (8), represents the candidate memory state, represents the hyperbolic tangent activation function, and and represent the corresponding weight matrix and bias term, respectively.
Combining the regulatory effects of the forget gate and the input gate, the state update relationship of the memory unit is shown in Formula (9):
In Formula (9), represents the current state of the memory unit, represents the previous state of the memory unit, and represents the element-wise multiplication operation.
The output gate controls the output of the memory unit to the hidden state, and its calculation form is shown in Formula (10):
In Formula (10), represents the output gate vector, and and represent the corresponding weight matrix and bias term, respectively.
The final hidden state is determined by the output gate and the memory cell state, and their updated relationship is shown in Formula (11):
In Formula (11), represents the hidden state at time t, and this vector conveys the device degradation and evolution information in the time dimension.
As the time step progresses, the hidden state sequence
forms a dynamic representation of the degradation process. To achieve remaining lifetime prediction, the hidden state at the last moment is mapped to the regression space to obtain the estimated remaining lifetime of the equipment. The mapping relationship is shown in Formula (12):
In Formula (12), represents the predicted remaining lifetime value, represents the regression layer weight matrix, represents the bias term, and represents the hidden state at the final time.
During time series propagation, the LSTM network continuously accumulates historical degradation information through a gating structure, causing the hidden state to exhibit a relatively smooth change trend over the device’s operating time. This results in a time-continuous degradation trajectory in the feature space as learned from the training data. After regression mapping, this trajectory yields a lifetime estimation curve that tends to decrease over time, reflecting the degradation trend present in the linear RUL labeling convention of the PHM2012 dataset. The model does not enforce strict monotonicity or physical consistency constraints; the observed smoothness is influenced by the linear RUL labels used for training. A stable mapping relationship is formed between the degradation state and the predicted lifetime, as shown in
Figure 4.
Figure 4 illustrates the degradation state evolution process obtained based on LSTM time series modeling. The horizontal axis represents the device’s operating time
, and the vertical axis represents the health state and lifetime estimate obtained from the hidden state mapping. The overall coordinates, from the upper left to the lower right, correspond to the gradual evolution of the device from the initial operating stage to the failure stage. The smoothly descending solid line in the figure represents the degradation trend curve formed by the LSTM network output. The curve shows a small change in the initial stage, then shows a continuous decline over time, and the rate of decline increases in the later stage, reflecting the non-linear evolution characteristics of device performance degradation. The discrete dots distributed along the curve represent the hidden state outputs at each time step. These states are continuously transmitted in the time dimension and together constitute a complete degradation trajectory. The vertical dashed line at the right end of the curve represents the predicted failure time
. The horizontal distance between the dashed line and the current time is marked by a double-headed arrow as the remaining lifetime interval, reflecting the mapping relationship from time series characteristics to lifetime prediction results. The shaded area below the curve represents the cumulative change trend of the degradation process over time, making the degradation trajectory visually a continuous state evolution path. The overall layout forms a monotonically changing degradation curve structure from time series input to lifetime prediction output, which intuitively reflects the dynamic modeling effect of the LSTM network on the device degradation process.
2.5. Joint Loss Function and Model Training Strategy
To achieve synergistic optimization of cross-condition feature learning and degradation trend modeling, a joint objective function consisting of fault classification loss, condition alignment loss, and lifetime prediction loss is constructed within a unified network framework, and parameter updates are completed through end-to-end backpropagation. Let the input vibration signal, after passing through the convolutional feature extractor, yield a feature representation as , where represents , the batch sample size, and represents the feature dimension. The feature vector is then input into the fault classifier, condition discriminator, and LSTM degradation modeling module, respectively, to achieve multi-task joint learning.
The fault classification branch uses Softmax to output class probabilities, and the classification loss is defined as the cross-entropy function, as shown in Formula (13).
In Formula (13), represents the fault classification loss (same as Formula (1)), represents the number of fault categories, represents the true label of the -th sample in the -th class, using one-hot encoding, represents the predicted probability output by the classifier, and represents the number of samples. This loss constrains the feature extractor to learn discriminative fault representations.
The condition alignment branch is connected to the condition discriminator through a gradient reversal layer. Samples from the source operating condition and the target operating condition are simultaneously input into the discriminator during training. The domain discrimination loss is defined as the binary classification cross-entropy, as shown in Formula (14).
In Formula (14), represents the condition alignment loss (same as Formula (1)), represents , the domain label of the i-th sample, the source operating condition is labeled as 1 and the target operating condition is labeled as 0, and represents the domain prediction probability output by the condition discriminator. The gradient reversal layer multiplies the gradient by a negative coefficient during the backpropagation stage, so that the feature extractor and the condition discriminator form an adversarial relationship, thereby promoting the convergence of the feature distributions of the source operating condition and the target operating condition in the latent space.
The lifetime prediction branch uses the final hidden state of the degradation state sequence output by LSTM and maps it to the remaining lifetime value through a fully connected layer. The regression loss uses the mean squared error function, as shown in Equation (15).
In Formula (15), represents the lifetime prediction loss (same as Formula (1)), represents , the true remaining lifetime value of the i-th sample, represents the predicted lifetime output by the model, and represents the number of samples. This loss constrains the LSTM module to learn the time dependency of the degradation process.
The joint optimization objective function consists of three weighted loss components, as explained in Formula (1).
The model training employs an end-to-end backpropagation strategy. The gradient reversal layer is placed only in the path from the feature extractor to the condition discriminator. The regression gradient from the LSTM branch backpropagates through the feature extractor without sign inversion, preserving condition-dependent degradation patterns. The raw vibration signal is input into the convolutional feature extractor to generate a feature representation. The classification and lifetime prediction branches directly utilize this feature for task learning. The condition discriminator receives the feature vector through a gradient inversion layer and participates in adversarial training. During the backpropagation phase, the classification loss and lifetime prediction loss generate positive gradients to the feature extractor, while the condition alignment loss generates a negative gradient after passing through the gradient inversion layer. These three gradients are superimposed in the feature space and jointly update the network parameters. The parameter update process uses an iterative optimization method based on mini-batch samples. The convolutional layer parameters, condition discriminator parameters, and LSTM parameters are updated synchronously within the same training cycle until the overall loss function converges. At each training iteration, the condition discriminator parameters are updated by descending the gradient of with respect to . The feature extractor parameters are updated by descending the gradient of + minus λ times the gradient of with respect to . The classifier parameters are updated by descending the gradient of . The LSTM parameters are updated by descending the gradient of . This training mechanism achieves synergistic constraints for cross-condition feature alignment and degradation trend modeling within a unified framework, thereby improving the model’s fault diagnosis and remaining lifetime prediction performance under complex operating conditions.