1. Introduction
Rolling bearings are subject to cumulative wear under demanding operating conditions, resulting in progressive performance degradation over time and posing risks to the safety and operational efficiency of mechanical systems [
1,
2]. As a key component in rotating machinery, rolling bearings are extensively employed across diverse sectors such as industrial manufacturing, transportation, energy production, automotive engineering, and aerospace. Consequently, timely and accurate detection of bearing faults is essential to ensure the safe and reliable operation of mechanical systems [
3,
4].
Vibration signal analysis is a widely adopted method for detecting faults in rolling bearings. Conventional techniques include the short-time Fourier transform (SFT) [
5], wavelet transform (WT) [
6], empirical mode decomposition (EMD) [
7], and related methods such as the Hilbert–Huang transform [
8]. These approaches enable a quantitative assessment of equipment operating states by incorporating established bearing fault indicators, including root mean square, kurtosis, peak value, and crest factor [
9]. Chen et al. [
10] proposed a method based on multiscale improved envelope spectrum entropy (MIESE), which combines multiscale decomposition with entropy calculation and uses JADE to eliminate feature redundancy, achieving both high efficiency and feature quality under complex operating conditions. For interpretable multiclass classification, Hou et al. [
11] extended the binary optimized weights spectrum (OWS) to handle multiple fault types, leveraging a softmax classifier to generate physically interpretable weighted spectra enabling trustworthy fault identification. However, the establishment and evaluation of fault features are significantly influenced by human factors.
With advancements in pattern recognition technology, increasingly intelligent approaches have been adopted for fault diagnosis. Methods such as support vector machines (SVMs) [
12], hidden Markov models (HMMs) [
13], back propagation (BP) neural networks [
14], and k-nearest neighbors (KNNs) [
15] have been successfully applied, significantly improving the accuracy of diagnostic outcomes. However, traditional pattern recognition methods rely heavily on manual feature engineering, which is inherently limited in extracting deep nonlinear features and lacks adaptive learning capabilities when processing large-scale datasets. As a result, their effectiveness in fault classification and prediction across varying operating conditions remains constrained.
Deep learning (DL) leverages multi-layer convolutional networks to automatically extract deep hierarchical features by progressively increasing network depth. This architecture enables end-to-end feature extraction, dimensionality reduction, and state recognition without human intervention, effectively capturing the complex nonlinear relationships between input signals and system states. It excels at autonomously identifying fault characteristics from raw data, thereby minimizing uncertainties introduced by manual processing. As a result, deep learning is particularly well-suited for fault diagnosis tasks involving high-dimensional, nonlinear, and heterogeneous data. Owing to its superior performance, an increasing number of diagnostic approaches have adopted deep learning models, among which VGG [
16], ResNet [
17], DenseNet [
18], and ConvNeXt [
19] are widely employed. While two-dimensional convolutional networks require transforming raw signals into image representations via continuous wavelet transform (CWT), Symmetrized Dot Pattern (SDP), Gramian angular field (GAF), and short-time Fourier transform (STFT) [
20]—a process that introduces additional preprocessing steps and increases computational cost—1D-CNNs offer a more direct and efficient alternative. By performing fault diagnosis directly on the original one-dimensional signals, 1D-CNNs preserve data integrity and avoid potential information loss during conversion, making them increasingly popular in practical applications [
21,
22,
23].
Nevertheless, deep learning-based fault diagnosis methods still face significant challenges under high noise levels and varying operating conditions. Guo et al. [
24] reported on the CWRU dataset that ResNet achieved a diagnostic accuracy of 99.822% at a Signal-to-Noise Ratio (SNR) of 2 and maintained a relatively high accuracy of 81.489% even when the SNR dropped to −8, with DenseNet exhibiting similar robustness. In contrast, ConvNeXt reached 99.822% accuracy at an SNR of 2 but showed a more pronounced decline, dropping to only 62.970% at an SNR of −8. Ling et al. [
25] observed in experiments on the Southeast University datasets that ResNet18 was highly sensitive to noise: its peak recognition accuracy reached 87% at an SNR of 6 dB yet decreased sharply with increasing noise, falling to just 26% at −2 dB. AlexNet, LeNet5, and MobileNet demonstrated comparable performance trends. Ni et al. [
26] further investigated model accuracy across diverse operating conditions and found substantial variation in diagnostic performance under different noise levels—ranging from a maximum of 94.38% to a minimum of 57.80%. In contrast, transfer accuracy under varying load and rotational speed conditions remained consistently above 87%. To address these limitations, feature fusion algorithms have been proposed to enhance the robustness and diagnostic accuracy of deep models by integrating multi-dimensional and multi-modal information. For example, Min et al. [
27] proposed a class-imbalanced fault diagnosis method for machinery based on heterogeneous data fusion and support tensor machine (STM), which effectively preserves the structural information of multi-source signals and improves diagnostic accuracy for bearing faults. Junior et al. [
28] developed a multi-head 1D-CNN with a parallel architecture, achieving effective detection and classification of six types of motor faults. Wang et al. [
29] implemented a 1D-CNN framework that fuses features from raw vibration and acoustic signals, demonstrating superior performance in terms of lower loss and higher accuracy compared to single-modal approaches across various signal-to-noise ratio conditions. Zhang et al. [
30] enhanced the input representation of a parallel ResNet network through a dual-input strategy that combines the short-time Fourier transform (SFT) and the continuous wavelet transform (CWT), thereby capturing diverse and complementary time–frequency characteristics more effectively.
Despite these advantages, the hidden layers in deep network architectures are often of considerable scale, and the incorporation of feature fusion algorithms leads to a sharp increase in the number of trainable parameters. This significantly increases computational load, training time, and model complexity. Training such networks from scratch typically involves random initialization of weights and biases, introducing uncertainty that may impair training efficiency and adversely affect model convergence. Consequently, feature fusion-based approaches tend to be computationally intensive, labor-intensive, and challenging to implement effectively.
The transfer learning (TL) strategy involves leveraging a pre-trained deep model—initially trained on large-scale sample data—and transferring knowledge acquired from previously solved, related tasks to improve performance on the current target problem [
31]. This approach not only enhances training efficiency but also increases the practicality and adaptability of deep learning methods [
32]. In the field of fault diagnosis, a widely adopted transfer learning paradigm is to first pre-train a model on large natural image datasets such as ImageNet, then fine-tune or freeze the convolutional layers, and finally retrain the fully connected layers using domain-specific mechanical vibration data. This methodology has demonstrated significant effectiveness in improving diagnostic accuracy and generalization under limited labeled data conditions [
33,
34].
This study proposes a fault diagnosis method based on a parallel heterogeneous deep network (PHDN). The dual-channel network simultaneously accepts the raw vibration signal and its time–frequency representation as inputs, effectively integrating visual features extracted from time–frequency images with temporal characteristics inherent in the raw signal. The parallel branches employ distinct network architectures to process one-dimensional vibration signals and two-dimensional time–frequency representations, respectively, ensuring optimal alignment between network design and data characteristics while enabling effective extraction of complementary fault features. This integration enhances the accuracy and robustness of bearing fault state monitoring. Comprehensive experimental evaluations on two datasets—including performance comparison, noise robustness analysis, and model transferability studies—demonstrate the superior effectiveness of the proposed PHDN in fault diagnosis. The main contributions of this work are summarized as follows:
A novel dual-branch parallel heterogeneous network is proposed, in which ConvNeXt is employed to process time–frequency images generated via continuous wavelet transform (CWT), while an improved 1D-ResNet processes the raw signal. This architecture enables automatic extraction of complex, high-dimensional fault features from both vibration signal segments and their corresponding time–frequency representations.
To support the dual-branch input structure, a hybrid diagnostic dataset is constructed by aligning raw signal segments, their time–frequency images, and fault labels. Signal segments are partitioned according to the fault cycle and converted into time–frequency images using continuous wavelet transform (CWT) based on the Morse wavelet.
A local transfer learning strategy is adopted to accelerate training convergence. Specifically, the ConvNeXt branch initializes its weights from a model pre-trained on the ImageNet dataset, while the 1D-ResNet branch is trained from scratch using the default parameter initialization scheme provided by the timm library.
The remainder of this paper is organized as follows.
Section 2 presents the construction of the proposed parallel heterogeneous deep network.
Section 3 outlines the implementation details of the fault diagnosis framework.
Section 4 evaluates the proposed method through comprehensive experiments on two benchmark datasets, including comparative analysis with alternative approaches.
Section 5 concludes the study and discusses potential future directions.
3. Proposed Methodology
3.1. Vibration Signal Segmentation
In mechanical fault diagnosis, each signal segment must capture all key characteristics of a complete fault cycle. For rolling bearings, common faults such as inner and outer race defects generate periodic impact signals that are synchronized with the bearing’s fundamental frequency. The minimum length of a vibration signal segment is determined by both the rotational speed and the sampling rate, and it is crucial that each segment contains at least one full fault cycle to preserve diagnostic integrity. This minimum length is defined as follows:
where
fs denotes the sampling frequency (in Hz), and
fr represents the fundamental rotational frequency of the bearing (in Hz), calculated as n/60, with n being the equipment’s rotational speed in revolutions per minute (r/min).
To assess the consistency between different segments, the Pearson correlation coefficient is computed between any two signal segments
Si(
t) and
Sj(
t):
where
and
are the means of the segments
Si(
t) and
Sj(
t) respectively. A correlation coefficient
r close to 1 indicates high similarity, reflecting strong repeatability in fault patterns across segments.
The original vibration signal is partitioned into non-overlapping segments using a fixed window length L. Segments shorter than L are discarded to avoid incomplete representations. Within the interval [Lmin, Lmax], the average correlation among segments is evaluated to identify the optimal segmentation length that maximizes temporal coherence. To balance feature completeness and sufficient sample size, Lmin ensures coverage of at least one full fault cycle, while Lmax considers computational efficiency and noise sensitivity. The search range for L is set to (1.2–2.5)·Lmin, with candidate lengths evaluated at intervals of 0.1 Lmin. This strategy prevents information loss from short segments while avoiding excessive data redundancy from overly long segments.
3.2. Continuous Wavelet Transform (CWT)
The Continuous Wavelet Transform (CWT) is a powerful time–frequency analysis tool that decomposes a signal into scaled and translated versions of a wavelet, enabling detailed characterization of transient features. Mathematically, the CWT is defined as [
39]:
where
x(t) is the input signal,
a > 0 is the scale parameter,
b ∈ ℝ is the translation parameter,
ψ is the wavelet function, and * denotes complex conjugation. The key aspect of the continuous wavelet transform (CWT) resides in the selection of an appropriate wavelet function. Generalized Morse wavelets constitute a family of exactly analytic wavelets. They are useful for analyzing modulated signals, which are signals with time-varying amplitude and frequency. The frequency-domain representation of the Morse wavelet is given by [
40]
where
αβ,γ is a normalization constant,
U(
ω) is the unit step function, and
β and
γ are two parameters controlling the wavelet form. The parameter
γ determines the symmetry of the Morse wavelet, while
P2 =
βγ defines the time-bandwidth product. By adjusting the time-bandwidth product and symmetry parameters of a Morse wavelet, it is possible to generate analytic wavelets with diverse characteristics and dynamic behaviors. This inherent customizability enables the method to operate without reliance on fixed waveform templates, allowing for flexible adaptation to the features of non-stationary signals such as mechanical vibrations. It can accurately extract transient components and effectively suppress noise interference, thus serving as a powerful and efficient tool for the analysis of complex time-varying signals. In this study, the complex Morse wavelet was selected as the basis function for the wavelet transform, with the specific parameter settings of
γ = 3 and the time—bandwidth product
P2 = 60.
3.3. Transfer Learning Strategies
Convolutional neural networks (CNNs) have become essential tools in signal processing due to their success in image analysis. However, training CNNs from scratch demands extensive computational resources and long training times. Transfer learning addresses these challenges by initializing the network with pre-trained weights, allowing knowledge transfer from large-scale source datasets to target tasks with limited data. This approach significantly accelerates convergence and improves model generalization.
For PHDN, transfer learning is applied exclusively to the ConvNeXt branch, where weights are initialized from a ConvNeXt-T model pre-trained on ImageNet. As illustrated in
Figure 1, the lower layers of ConvNeXt are frozen to retain their ability to extract generic spatial features. Only the higher-level layers are fine-tuned to adapt to the specific texture and pattern characteristics of time–frequency maps derived from mechanical vibration signals. In contrast, the 1D-ResNet branch—designed for one-dimensional raw signals and lacking an equivalent pre-trained counterpart—is trained from scratch to learn temporal fault dynamics effectively. This asymmetric training strategy promotes diverse feature representations across branches while accelerating overall convergence. This scheme is specifically designed as a feasible solution for scenarios with limited hardware resources: it avoids the high computational cost of training both branches from scratch or customizing pre-trained models for 1D signals, while still achieving effective performance.
3.4. Fault Diagnosis Process
The proposed methodology integrates continuous wavelet transform (CWT), a parallel heterogeneous deep network (PHDN), and transfer learning (TL) to achieve high-precision fault recognition in rotating machinery. The overall workflow is illustrated in
Figure 6 and consists of the following steps:
Data Acquisition: Publicly available bearing datasets are used in this study, consisting of one-dimensional vibration signals collected via sensors. In practical deployment, similar sensors should be installed on machinery to enable real-time data acquisition.
Time–Frequency Image Generation: Since the ConvNeXt branch processes three-channel 2D images, the raw vibration signals are first segmented. Each segment is then transformed into a time–frequency image using continuous wavelet transform (CWT).
Data Pair Construction and Partitioning: Given the dual-input nature of the PHDN, hybrid data pairs are constructed by associating signal segments, their corresponding time–frequency images, and class labels. In each pair, the signal segment is fed into the 1D-ResNet branch, the image into the ConvNeXt branch, and the output is supervised by the label. These pairs are randomly divided into training, validation, and test sets to ensure unbiased evaluation.
Pre-trained Model Initialization: In the ConvNeXt branch, pre-trained weights from the ConvNeXt-T model (loaded via the timm library) are adopted. The input layer and the first two stages of ConvNeXt blocks are frozen to preserve learned low-level image features. The remaining ConvNeXt blocks, along with the entire 1D-ResNet branch and the classification module, are initialized using the default initialization scheme in timm. This setup ensures a stable starting point for joint fine-tuning.
Model Training: The network is trained on the training set until convergence, monitored using the validation set. Upon convergence, all model parameters are saved for inference.
Performance Evaluation: The trained model is evaluated on the test set to assess its accuracy, robustness, and generalization performance in fault diagnosis tasks.
5. Conclusions
To deal with the challenges of rolling bearing fault diagnosis under complex operating conditions, this paper proposes a parallel heterogeneous deep network (PHDN) to enhance the accuracy and robustness of fault diagnosis. Experimental results demonstrate that the proposed model achieves superior performance on both the CWRU and KAIST datasets. Under varying signal-to-noise ratio (SNR) conditions, the model maintains a recognition accuracy above 90% when SNR is greater than −5 dB. It improves the diagnostic accuracy by 3.8% over the standalone model. In low-sample scenarios using 100%, 50%, 30%, and 15% of the training data, the diagnostic accuracy of the model improves by 3.24% compared to the standalone model under the minimum sample size condition, significantly outperforming the compared methods. Under variable load conditions, PHDN achieves a diagnostic accuracy improvement of at least 2.11 percentage points over the top-performing competitor CWT-ConvNeXt, and successful cross-load model transfer is achieved.
The proposed PHDN effectively enhances fault diagnosis accuracy and robustness through cross-modal feature learning, particularly in challenging environments involving noise interference, limited training samples, and load variations. The key innovation lies in its ability to leverage the complementary strengths of multimodal data via a parallel heterogeneous architecture and cross-modal feature fusion mechanism. This approach not only improves diagnostic accuracy but also strengthens the model’s adaptability and generalization capability under diverse operational conditions, demonstrating significant engineering application potential. Despite these advancements, this study has certain limitations. The model evaluation was conducted exclusively on two laboratory-based benchmark datasets characterized by controlled experimental conditions and isolated fault modes, which may not adequately represent the complexity of real-world industrial environments, including load and speed fluctuations, multi-source noise interference, and the coupled effects of compound faults. Regarding transfer learning approaches, we acknowledge this is not claimed to be the “optimal” approach, but rather a practical, hardware-friendly solution that balances computational efficiency and diagnostic performance. Consequently, the generalizability of the findings to practical industrial applications warrants further validation. Future work will focus on optimizing the network architecture to reduce computational overhead and improve real-time inference performance. Additionally, cross-device and cross-domain transfer learning strategies will be investigated to further validate and enhance the model’s generalization across different systems and environments.