To address the above challenges, this paper proposes a deep learning framework based on shapelets that integrates local pattern matching and multi-scale spatiotemporal modeling to achieve robust bearing fault diagnosis. Unlike treating noise suppression as an independent preprocessing task, the proposed method embeds adaptive shapelet learning into the front-end representation stage, enabling the model to explicitly capture transient patterns that are insensitive to amplitude variations and pulse disturbances and have physical significance. These local representations are further optimized through gated convolution feature extraction and bidirectional temporal modeling to enhance stability under complex industrial noise conditions.
3.1. Adaptive Multi-Scale Shapelet Extraction
Shapelets are discriminative subsequences that capture essential local patterns for distinguishing between different fault classes. Unlike traditional black-box feature extraction methods, shapelets provide interpretable representations that can be directly related to physical fault characteristics. The architecture of the adaptive multi-scale shapelet extraction module is shown in
Figure 2.
The shapelet lengths are selected based on the characteristic periodicity of bearing fault impulses. Under the sampling rate of 12 kHz and a shaft speed of 1797 RPM, one revolution corresponds to approximately 400 samples. As a representative example, the ball-pass frequency outer race (BPFO) impulse period is about 80 samples. Therefore, the chosen length range spans roughly to the BPFO period. Short shapelets (16–64 samples) capture localized impulsive transients and their immediate decay, whereas longer shapelets (128–256 samples) provide sufficient context to represent repeated impact patterns over multiple BPFO cycles. This multi-scale design avoids committing to a single fault period and improves robustness when the effective impulse spacing varies with operating conditions and noise contamination.
The quality of initial shapelets affects the convergence and final performance of the model. Random initialization often leads to suboptimal solutions due to the non-convex nature of the optimization landscape. To address this issue, the K-means++ algorithm is employed for shapelet initialization. Using a sliding window approach, input signal subsequences are extracted at multiple lengths. K-means++ is used to cluster these subsequences, so as to obtain the cluster centers as the initial shapelet patterns. The K-means initialization ensures the initial shapelet candidates are diverse and also reflect the actual signal patterns, which makes the subsequent gradient-based optimization be able to converge more rapidly.
In this module, normalized Euclidean distance with local Z-score normalization is employed to achieve scale-invariant pattern matching. For each location in the input signal, calculate the local mean and standard deviation in the sliding window. Then, the normalized local subsequences are used to calculate the distance between them and the shapelet template using a distance formula based on correlation:
In Equation (1),
represents the length of the shape sequence, and
represents the Pearson correlation coefficient between the normalized subsequence at position
and the shape sequence
. The shapelet
is pre-normalized to zero mean and unit variance, so the correlation simplifies to an inner product form:
where
is the
-th element of the shapelet,
is the locally normalized input, and
corresponds to the convolution output at position
.
This formula can be efficiently calculated by performing convolution operations using shapelets as convolution kernels. This correlation-based distance measurement ensures robustness against amplitude scaling and offset variations, which is particularly effective for fault mode matching under different operating conditions of CNC machines.
Bearing fault characteristics exhibit different properties at different time scales. Short-term patterns and long-term patterns can respectively capture different fault characteristics. Short-term patterns can capture pulsed fault characteristics, while long-term patterns can reveal periodic fault characteristics. To capture these multi-scale characteristics, we adopted shapelets of different lengths. Distance sequences are calculated at each scale, and they were unified to the same length through adaptive pooling. Finally, these sequences were concatenated to form a shapelet representation.
The distinct features may manifest at different scales for various fault types. To make the model focus on the effective scales of each fault type, learnable scale weights are introduced. These weights are optimized together with other model parameters through backpropagation, allowing the model to adaptively emphasize the scales that are discriminative for the classification task.
To unify multi-scale distance sequences before fusion, adaptive average pooling is applied. Given
different shapelet lengths (
), the distance sequences at each scale
are first aligned to a common length through adaptive pooling. The learnable scale weights are computed using the softmax function to ensure they sum to one:
where
are learnable parameters initialized to equal values, ensuring all scales contribute equally at the beginning of training. The scale weights
in Equation (3) are global trainable parameters shared across all samples and remain fixed during inference. Fault-type adaptation is achieved by the input-dependent multi-scale shapelet responses and the subsequent gated CNN fusion in
Section 3.2 (Equation (8)), rather than by dynamically adjusting
per sample. These weights are jointly optimized with other model parameters through backpropagation. The weighted multi-scale features are then concatenated along the channel dimension:
This weighted concatenation mechanism allows the model to adaptively emphasize the scales that are most discriminative for each fault type.
Then the distance sequence is converted into similarity scores by using an exponential kernel as follows:
In this equation, represents a scaling parameter used to control the sensitivity to changes in distance. This transformation converts the distance values into similarity scores ranging from (0, 1], with higher values indicating a greater match between the signal and the shape pattern.
Time Complexity of Shapelet Matching. As shown in Equation (1), we utilize convolutional operations for efficient sliding-window matching. For an input signal of length with shapelets at each of different scales (shapelet lengths ), the naive sliding-window Euclidean distance computation would require , multiplications per sample, where is the average shapelet length. Our method reformulates the normalized distance computation using convolution operations. The key insight is that the squared normalized Euclidean distance can be decomposed as in Equation (1). The critical efficiency gains are:
The terms and (required for local mean and variance) are computed via convolution with a ones kernel, which is shared across all shapelets. This reduces the normalization overhead from to .
Each shapelet-signal correlation is computed via a single conv1d operation with complexity , but modern deep learning frameworks execute this as a highly parallelized matrix operation rather than sequential window enumeration.
3.2. Gated Parallel CNN
The gated parallel CNN module is designed to extract multi-scale spatial features from shapelet feature maps while adaptively suppressing noise interference. This module adopts a gating mechanism that selectively filters irrelevant information, thus enhancing the model’s robustness in a noisy environment.
The key point of the gating mechanism is the combination of the convolution path used for feature extraction and a gating path that generates values between 0 and 1 to control the flow of information. The gating convolution operation can be described as follows:
In this equation, and represent the main convolution kernels and biases and are the gate convolution kernels and biases denotes the Sigmoid activation function. Produced by the Sigmoid-activated gating path, values in the range of (0, 1) act as soft gates. This mechanism can help suppress the features related to noise while retaining the information related to faults.
We use multiple parallel branches with different convolution kernel sizes, each capturing patterns at different time scales. The outputs of these branches are concatenated to form the final feature map. To enable adaptive fusion of these multi-scale features based on input characteristics, we introduce learnable gate weights computed through global average pooling (GAP). Given
parallel branches with feature outputs
, we first concatenate them and apply global average pooling to obtain a compact representation:
where
contains the pooled features from all branches. The gate weights are then computed through a fully connected layer followed by sigmoid activation:
where
and
are learnable parameters, and
denotes the sigmoid activation function. The gate values
act as soft attention weights that control the contribution of each branch. The final fused features are obtained by applying the gate weights element-wise and concatenating:
This gated fusion mechanism enables the model to dynamically adjust the contribution of each scale based on the input signal characteristics. In noisy environments, scales that are more susceptible to noise contamination are automatically down-weighted, thereby enhancing the model’s robustness against different noise types.
3.3. Residual-Enhanced Bidirectional LSTM Network
After convolutional feature extraction, temporal dependencies are modeled using a BiLSTM with residual connections. This bidirectional design enables the model to capture patterns from both directions, which is useful for periodic fault signatures and repeated impacts.
The bidirectional architecture processes the sequence in both forward and backward directions simultaneously. Their hidden states are concatenated at each time step:
where
and
are the forward and backward hidden states, respectively. This bidirectional design enables the model to capture patterns dependent on both past and future context, which is particularly beneficial for identifying fault signatures that manifest as symmetric or periodic patterns.
Since the input dimension typically differs from the BiLSTM output dimension, we introduce a linear projection layer to enable residual connection. The enhanced output with residual connection and layer normalization is computed as:
where
is the projection matrix that aligns the input dimension with the BiLSTM output,
is the BiLSTM output, and
is the input sequence. This residual design facilitates gradient flow during training and helps preserve information from earlier processing stages, improving the overall stability and convergence of the deep network.
3.4. Composite Loss Function
Bearing fault signals are often similar across different fault categories. The cross-entropy loss function only focuses on the classification results and does not distinguish similar categories in the feature space. Therefore, a contrastive loss is added to enhance feature discriminability.
Since the model employs shapelet extraction, features at different scales should represent the same fault type, but they may learn inconsistent patterns. Therefore, a consistency loss is added to maintain feature consistency across different scales.
Weighted cross-entropy loss is the main classification loss. Classes are weighted according to their inverse frequencies. This helps handle cases where certain fault-type samples are scarce.
Supervised contrast loss pulls similar features together and pushes different features apart. Temperature-scaled cosine similarity is employed. This makes the learned features more distinguishable, especially in noisy environments.
Multi-scale consistency loss encourages the similarity of features at adjacent scales. This loss term employs cosine similarity between features at different scales. This can prevent the model from learning contradictory patterns at different scales.
The total loss is described as a weighted combination of these three components.
In this equation, represents the weighted cross-entropy loss, represents the supervised contrastive loss and represents the multi-scale consistency loss, while and are the hyperparameters that balance the contributions of each part of the losses.
The composite loss function involves two key hyperparameters:
for the supervised contrastive loss and
for the multi-scale consistency loss. We conducted a grid search on the validation set to determine the optimal values, with results shown in
Table 1.
Based on these results, we selected and for the following reasons:
Contrastive loss weight (): This value provides sufficient regularization to enhance inter-class separability without dominating the cross-entropy loss. When , training becomes unstable due to the competing gradients between classification and contrastive objectives. When , the contrastive regularization effect is negligible, resulting in less discriminative feature representations under noise.
Consistency loss weight (): The multi-scale consistency loss encourages coherent representations across different temporal scales. A moderate weight of 0.02 enforces this constraint without over-regularizing the model. Higher values () tend to force excessive similarity between scales, reducing the model’s ability to capture scale-specific fault patterns.