Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection

Kumar, Prashant

doi:10.3390/machines14010113

Open AccessArticle

Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection

by

Prashant Kumar

Department of AI and Big Data, Woosong University, Daejeon 34606, Republic of Korea

Machines 2026, 14(1), 113; https://doi.org/10.3390/machines14010113

Submission received: 19 December 2025 / Revised: 6 January 2026 / Accepted: 16 January 2026 / Published: 19 January 2026

Download

Browse Figures

Versions Notes

Abstract

Electric motors are the workhorse of industries owing to their precise speed and torque control technologies. Despite their ruggedness, faults are inevitable due to wear and tear, their prolonged usage and multiple factors. Bearing faults are among the most frequently occurring faults in electric motors. Detecting faults at an early stage is crucial for avoiding complete shutdown. Deep learning has gained significant attention in the fault detection domain owing to its inherent advantages. This paper proposes a hybrid multi-scale convolutional neural network and Transformer model for bearing fault detection. The model combines the strengths of multi-scale convolutional front-ends for fine-grained feature extraction with Transformer encoder blocks for capturing long-range temporal dependencies. This approach combines the advantages of both models for effective bearing fault detection. The proposed method was tested on a bearing dataset to show its performance and efficacy. This method achieved high-performance accuracy in bearing fault detection.

Keywords:

convolutional neural network (CNN); bearing faults; electric motor; transformer

1. Introduction

Electric motors (EMs) are driving modern industries owing to their superior and smooth operational performance. EMs like induction motors (IMs) are widely used owing to their excellent torque characteristics and ruggedness. They are robust machines, but faults are inevitable. Among all faults, bearing fault (BF) is among the most frequently occurring faults in EMs. It is vital to detect faults effectively to avoid complete shutdown. Considering the dominance of bearing faults [1], it is vital to detect these faults effectively and in a timely manner. Artificial intelligence (AI) has gained significant attention in recent years in the fault domain due to its inherent advantages like higher accuracy, better adaptability and many others. These advantages translate directly into improved reliability, reduced downtime, and lower maintenance costs. Also, the availability of large datasets and affordable sensing has made data-driven approaches more practical and effective for fault monitoring. Early intelligent fault detection (FD) systems relied on conventional machine learning (ML) algorithms like support vector machine (SVM), k-nearest neighbor (kNN), decision trees (DTs), random forest (RF), shallow neural networks, etc., combined with feature engineering [2]. Feature engineering includes the extraction of features from vibrations, currents, acoustic signals, etc., and the selection of optimal features. In addition, these pipelines also require expert-designed signal processing techniques and often degrade when operating conditions change. With the availability of computational resources, deep learning (DL) has attracted researchers in the fault domain too [3,4]. DL methods were introduced to learn hierarchical representations directly from raw or minimally processed signals and images, leading to end-to-end FD frameworks. DL models, particularly convolutional neural networks (CNNs), auto-encoders, recurrent neural networks (RNNs), long-short term memory (LSTM), and many more, have been effectively used in the FD domain. CNNs can automatically extract discriminative fault features from raw time-series, spectrograms, or images [5,6,7,8]. They aid in avoiding labor-intensive manual feature design required by classical ML. DL can integrate multiple sensor modalities (vibration, current, temperature, and images) and capture nonlinear, time-dependent behavior using CNNs, RNNs, LSTMs, and hybrid models, which is challenging for shallow ML models without extensive feature engineering [9,10].

The authors of [11] developed a light-weight feature fusion CNN model for bearing fault diagnosis with the help of a Paderborn University (PU) dataset. A bearing FD technique was developed using the combination of a CNN and LSTM with fast Fourier transformation and singular value decomposition in [12]. The authors of [13] developed an interpretable CNN model for bearing fault diagnosis using learnable Gaussian/Sinc filters. The results demonstrated that this approach performed at par with baseline models. The authors of [14] proposed the combination of a CNN and empirical model decomposition with an SVM for bearing FD. Also, the gray wolf optimizer (GWO) was used to optimize the SVM parameters. The authors of [15] developed a bearing FD approach using acoustic and vibration data and feeding it to the optimized CNN model. A bearing FD method [16] was developed using the combination of a CNN and SVM with the help of a scattergram. The authors of [17] proposed a bearing FD method using an LSTM-CNN model and unsupervised domain-share CNN. The features are learned from time and frequency domains with the help of contrastive learning loss and utilized for bearing FD. The authors of [18] proposed a multi-FD using wavelet-assisted stacked image fusion with the help of a dual-branch CNN model. The authors of [19] proposed a motor bearing FD technique with the help of wavelet packet transform and empirical mode decomposition and a CNN-LSTM model. The authors of [20] proposed a bearing FD method with the help of a 2D CNN and hybrid kernel fuzzy SVM. The original bearing vibration signals are first transformed into 2D grayscale images, which are then fed into a 2D CNN for dimensionality reduction and feature extraction. The resulting feature vectors are then sent to a hybrid kernel fuzzy SVM for FD. The authors of [21] developed a bearing FD approach using the combination of a CNN, LSTM and gated recurrent unit (GRU) models. This approach was evaluated on a public domain bearing dataset. The authors of [22] developed an enhanced bearing FD approach using the combination of a CNN and LSTM network based on a public domain bearing dataset. The authors of [23] proposed a hybrid LSTM RF model for bearing FD and optimized it using GWO. The most pertinent characteristics are fed into the RF classifier for fault classification using the GWO algorithm, which optimizes feature selection from the LSTM outputs. The authors of [24] proposed a bearing FD method using an Elman neural network (ENN)-based LSTM. Fast Fourier transform (FFT) is used to process accelerometer raw vibration data to extract frequency-domain features. While LSTM detects temporal dependencies in the data, the ENN is used to detect clearing faults under different operating conditions. The authors of [25] proposed a multi-size wide kernel CNN model for bearing FD with the help of vibration data. The network’s ability to differentiate between healthy and defective bearings is enhanced by its wide-kernel design, which makes it possible to gather local and global features more efficiently. The authors of [26] proposed a bearing FD approach using a multi-objective optimized deep auto-encoder (AE). The multi-objective particle swarm optimization method determines the optimal network structure and hyperparameters. The authors of [27] developed a bearing FD method by utilizing an adaptive denoising autoencoder model. The encoder uses convolutional layers to learn data representations, and the decoder uses deconvolutional layers to reconstruct the data. Adaptive shrinkage units are used in both the encoder and decoder to mimic denoising operations, which successfully eliminate interfering data while maintaining delicate fault characteristics.

In recent times, Transformer and self-attention mechanisms have also been used in the FD domain. Time-series and vision Transformers have shown good performance on a benchmark dataset. This is due to their ability to explicitly model global temporal correlations, which helps in achieving high accuracy across variable operating conditions. Also, hybrid CNN–attention models that combine convolutional front-ends with multi-head self-attention have also been developed for FD [28,29]. Despite progress in the FD domain, the majority of existing techniques rely on either a CNN or Transformer in isolation. There is relatively scarce work that tightly couple a multi-scale CNN front-end with a time-series Transformer. The rich multi-scale structure of motor vibration signals (e.g., the co-existence of high-frequency bearing impulses and low-frequency load modulations) may not be completely exploited by existing Transformer techniques, which usually employ straightforward linear projections or single-scale convolutions for embedding. Although there are multi-scale CNN models, they rarely combine multi-scale temporal data for IMs with global self-attention.

This paper proposes a hybrid multi-scale CNN combined with a time-series Transformer (HMSCT) model for bearing FD to synchronize the advantages of both CNN and Transformer models. While the Transformer models long-range dependencies across a refined feature sequence, CNN branches capture complementary local patterns at various temporal scales. This hybrid architecture goes beyond the CNN-only or Transformer-only structures found in most previous work by utilizing the advantages of both paradigms: local precision from CNNs and global context modeling from self-attention. The proposed framework was evaluated on a bearing dataset to demonstrate its effectiveness. The main contributions of this paper are listed below:

A hybrid multi-scale CNN combined with a time-series Transformer (HMSCT) model for bearing FD has been developed.
The proposed structure offers the advantages of both CNN and Transformer architectures.
The model captures both long-range dependencies and complementary local patterns at various temporal scales.

This paper is divided into multiple sections. Section 2 discusses details of the proposed methodology. Section 3 includes details of the experimental setup. The results and discussion are elaborated in Section 4. Section 5 provides the limitations and future scope of the proposed work. Section 6 concludes the proposed work.

2. Proposed Methodology

2.1. Convolutional Neural Networks (CNNs)

CNNs have gained significant attention in recent years owing to their excellent and autonomous feature learning capabilities [30,31]. They are widely used in vibration-based FD as they can automatically learn shift-invariant local features from raw time-series data. A 1-D convolution operates by sliding a kernel across the temporal axis. For an input signal, the CNN performs convolutional filtering, which is mathematically defined as

h_{t} = Σ_{i = 0}^{k - 1} Σ_{c = 1}^{C} w_{i, c} x_{t + i, c} + b

(1)

where

x

denotes the input sequence,

C

denotes the number of channels,

k

represents the kernel size,

w_{i, c}

represents the learnable filter weight and

b

denotes the bias term. This is followed by nonlinear activation like a rectified linear unit (ReLU), which introduces nonlinearity, enabling the network to approximate non-linear features. Mathematically, it can be defined as

z_{t} = m a x (0, h_{t})

(2)

EM vibration signals contain multi-frequency components owing to the rotating elements. The components span across different temporal scales and a single filter is not able to efficiently capture the full range of fault-related patterns. The proposed HMSCT model utilizes multi-scale CNNs that employ a parallel convolutional filter with varying receptive fields. For kernel sizes

s \in {3, 5, 7}

, the outputs are

z^{(s)} = σ (x * w^{(s)} + b^{(s)})

(3)

This method ensures efficient detailed captures, mid-range periodicity capture and long-range harmonic capture. The concatenation forms a multi-resolution feature pace essential for efficient FD. Mathematically, it can be expressed as

Z = C o n c a t (z^{(3)}, z^{(5)}, z^{(7)})

(4)

Pooling operations (e.g., max pooling) highlight the most noticeable fault signals, lower noise, and further compress the temporal resolution. By layer-wise normalizing feature distributions, batch normalization stabilizes training.

2.2. Transformer and Attention Mechanisms

Transformers have gained significant attention since their introduction in 2017 [32]. They were originally proposed for natural language processing but have recently achieved efficient performance in time-series analysis. Transformers employ self-attention methods that enable every time index to attend to every other index, in contrast to CNNs, which concentrate on local receptive fields. To comprehend long-term temporal connections, global context modeling is essential. Given the input sequence,

X

, the Transformer computes

Q = X W_{Q}

(5)

K = X W_{K}

(6)

V = X W_{V}

(7)

where

W_{Q}

,

W_{K}

, and

W_{V} \in R^{d \times d_{k}}

are learnable matrices. The attention scores between time steps

i

and

j

are computed as

α_{i, j} = (Q_{i} \cdot K_{j}) / s q r t (d_{k})

(8)

Scaled dot-product attention is then formulated as

A t t e n t i o n (Q, K, V) = s o f t m a x ({Q K}^{T} / s q r t (d_{k})) V

(9)

SoftMax converts similarity scores into a probability distribution indicating the importance of each position. Multi-head attention enhances representational power by projecting the input into H different subspaces:

h e a d_{h} = A t t e n t i o n (Q W_{Q}^{h}, K W_{K}^{h}, V W_{V}^{h})

(10)

The final output is aggregated:

M H A (X) = C o n c a t (h e a d_{1}, \dots, h e a d_{H}) W_{O}

(11)

Each head captures different temporal relationships. Multiple heads can focus on repetitive bearing impacts, frequency-modulated harmonics and long-term degradation patterns. The Transformer encoder block consists of multi-head self-attention, a feed-forward network and other components. This structure ensures stable gradient flow vis residual learning, improved generalization, and non-linear transformation via feed-forward network. Moreover, the Transformer can model long-range sequences efficiently without recurrence, making them highly suitable for vibration diagnostics where patterns span multiple temporal scales.

2.3. Hybrid Multi-Scale CNN Transformer (HMSCT) Model

The proposed HMSCT model amalgamates the strengths of CNNs and Transformers to form a unified architecture capable of handling the complex nature of EM vibration signals. The vibration signals exhibit highly complex temporal patterns originating from different mechanical components and occur across a wide spectrum of frequencies. Conventional CNN models have trouble learning long-range dependencies, but they are good at capturing localized features like rapid impulses or spatially limited oscillations. On the other hand, Transformers are very good at simulating global temporal linkages, but they do not have the inductive bias needed to effectively extract fine-grained local structures. By combining a temporal Transformer encoder with a multi-scale CNN feature extractor, the HMSCT fills this gap.

The input vibration segment is processed via several parallel 1-D CNN filters with various kernel sizes (e.g., 3, 5, 7) in the first step of the HMSCT, known as multi-scale convolution. The model can capture fault signatures at different temporal scales owing to this architecture. Larger kernels identify wider harmonic motions, whereas smaller kernels identify brief transient occurrences, such as the impact from a localized bearing defect. Multi-resolution feature representation is created by concatenating the outputs from each convolutional branch. By lowering noise, compressing temporal fluctuations, and stabilizing the training process, pooling and batch normalization improve these features. To match the dimensionality of the fused multi-scale CNN output with the Transformer input requirements, a linear projection layer is applied in the second step. This produces structured feature embedding in which the CNN front-end extracts rich local contextual information at each time step. The third stage consists of one or several Transformer encoder blocks. Multi-head self-attention calculates pairwise correlations between all-time steps in the sequence within each block, enabling the model to capture global vibration characteristics like long-range periodicity, repetitive fault impulses, and modulation patterns brought on by different loads. By dynamically allocating importance to various temporal positions, the attention mechanism allows the Transformer to suppress irrelevant noise and concentrate on the informative parts. Each encoder block’s feed-forward network further nonlinearly modifies the features, and layer normalization and residual connections guarantee steady gradient propagation and reliable optimization. The long-term behavior of the vibration sequence is finally summarized by a global average pooling layer, which combines the temporal data with a fixed-length feature vector. The predicted motor health state, such as normal operation or particular bearing defects, is produced by a fully linked SoftMax classifier. The HMSCT successfully combines the long-range dependence modeling of Transformers with the localized feature learning capacity of CNNs, producing high generalization across various motor speeds and load conditions, superior precision, and robustness to noise. It is perfectly suited for practical industrial applications due to its hybrid design. The overall block diagram of the proposed methodology is given in Figure 1 and the architecture of the proposed HMSCT model is given in Figure 2.

3. Experimental Setup

The proposed architecture was evaluated on the Case Western Reserve University bearing dataset [33]. The vibration data that was collected at 12,000 samples/second was used for the analysis. The different bearing states like outer race fault, inner race fault, ball defect and normal were considered for the analysis. The test setup (Figure 3) comprises a 2 hp motor, a torque/encoder (center), a dynamometer (right), and control electronics [33]. Electro-discharge machining (EDM) was used to introduce defects into EM bearings. At the inner raceway, rolling element (ball), and outer raceway, faults with diameters ranging from 0.007 inches to 0.040 inches were introduced independently. Vibration data was acquired for motor loads of 0 to 3 horsepower (motor speeds of 1797 to 1720 RPM) after defective bearings were placed into the test motor. The drive end accelerometer data was considered for the analysis. The complete analysis was carried out with a computer that has an AMD Ryzen AI processor with 24 GB of RAM and RTX 5050 GPU.

4. Results and Discussion

4.1. Results

The proposed HMSCT model was tested on the CWRU bearing dataset for its evaluation. The dataset contains vibration signals collected from a drive end under different operating conditions and fault types, including normal (N), fault in the inner race (FIR), fault in the outer race (FOR), and fault in the bearing ball (FBB) with varying fault severities. Raw signals were segmented using a sliding window approach with a fixed window length and overlap to generate sufficient training samples. To ensure that all fault classes were fairly represented, the dataset was split into training and testing sets using a stratified split. The Adam optimizer with categorical cross-entropy loss and label smoothing was used to train the HMSCT model. To ensure steady convergence and avoid overfitting, early stopping and learning rate scheduling were used. The training and validation accuracy and loss curves acquired during model training are shown in Figure 4 and Figure 5, respectively. The model achieved an average accuracy of more than 98%. In the early epoch, the HMSCT model shows quick convergence and good classification accuracy. Its strong generalization ability was indicated by its validation accuracy closely matching its training accuracy. The efficacy of the proposed regularization technique, which incorporates dropout, L2 weight regularization, label smoothing, and adaptive learning rate reduction, is confirmed by the steady and smooth validation curves. The HMSCT model performs consistently during training in contrast to many deep CNN-based models that exhibit oscillatory validation behavior on the CWRU dataset.

Figure 6 displays the test dataset’s normalized confusion matrix. In every bearing state, the HMSCT model achieves almost flawless classification performance. While fault classes including the inner race, outer race, and ball defects show very little misclassification, normal operating circumstances are accurately detected with very high precision. Due to overlapping frequency components in the CWRU dataset, there is a slight degree of confusion between some fault types with comparable vibration characteristics. On the other hand, the model’s capacity to extract discriminative defect signals is demonstrated by the overall low misclassification rate. To obtain the optimum outcome, the designed framework is trained and tested several times. Additionally, performance indicators such as precision (p), recall (r), and F-1 score (F1) are computed for the in-depth study. The model performs well because the values of p, r, and F1 in Table 1 are reasonable.

The HMSCT model successfully distinguishes bearing states in the learned feature space, as seen by the high diagonal dominance of the confusion matrix. A multi-class Receiver Operating Characteristic (ROC) curve (Figure 7) was created for every bearing scenario in order to further evaluate the proposed model’s capacity for classification. Excellent sensitivity and specificity are indicated by the consistently high area under the curve (AUC) values for all classes. The durability of the HMSCT model in differentiating between healthy and defective bearing states, even in noisy training circumstances, is demonstrated by its good ROC performance. For real-world prognostics and health management (PHM) systems, where false alarms and missed detections must be reduced, this is especially crucial. Also, the proposed HMSCT model is compared with other methods. Table 2 presents a comparison of the accuracy of the proposed HMSCT model with other established models that employ different approaches, including a hierarchical CNN (HCNN) [34], an adaptive deep CNN (ADCNN) [35], a deep belief network (DBN) [36], multidomain feature extraction with an SVM (MFE-SVM) [37], a multi-channel CNN (MCNN) [38] and an autoencoder CNN (AE-CNN) [39]. It is evident from Table 2 that the proposed model has the best performance among the other methods.

4.2. Discussion

The experimental findings demonstrate the high efficacy of the proposed HMSCT model for detecting bearing faults on the CWRU dataset. The model can capture both long-range temporal patterns and localized fault impulses thanks to the combination of Transformer encoders and multi-scale CNNs. Its robustness is further enhanced by noise-aware training, and stable training and robust generalization are guaranteed by the well-thought-out regularization technique. From the perspective of training behavior, the learning curves shown in Figure 4 and Figure 5 show close alignment between the training and validation accuracy and quick convergence during the first epochs. Its strong generalization capacity is demonstrated by this behavior, which also validates that the proposed regularization strategy, which combines dropout, L2 weight regularization, label smoothing, early stopping, and adaptive learning rate scheduling, effectively reduces overfitting. The HMSCT model’s discriminative ability is further supported by the normalized confusion matrix in Figure 6. All bearing health states are classified efficiently. In particular, the model’s capacity to distinguish between healthy and defective operating conditions is demonstrated by the accurate classification of FIR and normal conditions. The computed precision, recall, and F1 score values shown in Table 1 offer a thorough evaluation of its performance and accuracy. The model does not favor dominant classes at the expense of minority fault types, as confirmed by the consistently high values across all fault categories, which show balanced categorization behavior. This is especially important for real-world condition monitoring systems, since failing to detect uncommon but crucial fault events could have dire operational repercussions. The proposed HMSCT architecture is not dataset-specific and is intended to generalize to various fault diagnostic scenarios, even if the experimental evaluation is carried out on the CWRU bearing dataset. The model can easily be adapted to other datasets with varying sampling frequencies, noise levels, and operating circumstances because it works directly on raw vibration signals without the need for manually created features. While the Transformer encoder simulates long-range temporal dependencies typical of rotating machinery signals, the multi-scale CNN module captures fault characteristics over several frequency bands. With minimal alterations, the HMSCT can be expanded to other bearing and motor fault diagnosis tasks when combined with strong regularization and noise-aware training. Overall, the experimental results show that the proposed HMSCT model is resilient, stable, and interpretable in terms of fault separability and is extremely accurate. Its usefulness for real-world industrial condition monitoring and predictive maintenance applications, where accurate defect identification under various noise levels and operating conditions is crucial, is greatly supported by these qualities.

5. Limitations and Future Work

The proposed HMSCT model has a number of limitations requiring more research despite its excellent performance. The CWRU bearing dataset, which is obtained under controlled laboratory conditions while being extensively used, is the main focus of the current investigation. The CWRU dataset does not adequately capture the diverse noise sources, varying speeds, and fluctuating loads that are common in real industrial contexts. To evaluate the HMSCT model’s capacity for generalization in practical situations, further research should test it on more varied and realistic datasets, such as variable-speed and variable-load settings. Because Transformer encoder blocks are included, the HMSCT architecture adds more computational complexity than CNN-only models. For real-time or resource-constrained applications, more optimization is necessary even if the multi-scale CNN front-end shortens the sequence and minimizes this overhead. Future studies could investigate sparse attention, lightweight attention processes, or model compression methods such as knowledge distillation and pruning. Furthermore, the model’s decision-making process is still mostly opaque even though attention techniques offer some interpretability. In order to increase transparency and reliability, future research could use explainable AI techniques like saliency-based approaches or attention visualization, particularly for safety-critical industrial applications. Lastly, a fixed window length for input segmentation is assumed by the current approach. To further increase sensitivity to defects occurring at various temporal scales, adaptive or multi-resolution windowing techniques should be investigated. In nutshell, future research should concentrate on increasing its computing efficiency, expanding the model to prognostic tasks, improving its interpretability, and validating its performance under realistic operating settings, even though the HMSCT model exhibits high diagnostic performance and robustness.

6. Conclusions

This study presented a hybrid multi-scale CNN Transformer (HMSCT) model for vibration-based bearing fault diagnosis, aiming to overcome the limitations of conventional CNN-only and Transformer-only approaches. The proposed approach successfully captures both long-range temporal dependencies present in vibration signals and localized fault-induced impulses by combining multi-scale convolutional feature extraction with Transformer-based global temporal modeling. The model is sensitive to a variety of fault characteristics, from short-duration impact events to long-term harmonic patterns, thanks to the multi-scale CNN front-end’s ability to extract discriminative features across several temporal resolutions. By using multi-head self-attention to learn global relationships across temporal segments, the Transformer encoder further improves feature representation by enabling the model to dynamically focus on informative parts of the vibration signal. This combination produces a strong and expressive representation that is ideal for diagnosing bearing faults. The HMSCT model shows great classification accuracy, steady training behavior, and strong generalization performance across many bearing fault categories, according to the experimental evaluation on the CWRU bearing dataset. To sum up, the proposed HMSCT architecture presents a viable path for predictive maintenance and advanced condition monitoring systems. It can be used for practical industrial applications due to its hybrid design, noise resistance, and robust diagnostic performance.

Funding

This research was funded by Woosong University Academic Research 2025.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The author declares no conflicts of interest.

References

Pei, T.; Zhang, H.; Hua, W.; Zhang, F. Comprehensive Review of Bearing Currents in Electrical Machines: Mechanisms, Impacts, and Mitigation Techniques. Energies 2025, 18, 517. [Google Scholar] [CrossRef]
Kumar, P.; Hati, A.S. Review on Machine Learning Algorithm Based Fault Detection in Induction Motors. Arch. Comput. Methods Eng. 2021, 28, 1929–1940. [Google Scholar] [CrossRef]
Peng, D.; Yazdanianasr, M.; Mauricio, A.; Verwimp, T.; Desmet, W.; Gryllias, K. Physics-Driven Cross Domain Digital Twin Framework for Bearing Fault Diagnosis in Non-Stationary Conditions. Mech. Syst. Signal Process. 2025, 228, 112266. [Google Scholar] [CrossRef]
Peng, D.; Desmet, W.; Gryllias, K. Reconstruction-Based Deep Unsupervised Adaptive Threshold Support Vector Data Description for Wind Turbine Anomaly Detection. Reliab. Eng. Syst. Saf. 2025, 260, 110995. [Google Scholar] [CrossRef]
Zhao, J.; Wang, W.; Huang, J.; Ma, X. A Comprehensive Review of Deep Learning-Based Fault Diagnosis Approaches for Rolling Bearings: Advancements and Challenges. AIP Adv. 2025, 15, 020702. [Google Scholar] [CrossRef]
Zhu, Z.; Lei, Y.; Qi, G.; Chai, Y.; Mazur, N.; An, Y.; Huang, X. A Review of the Application of Deep Learning in Intelligent Fault Diagnosis of Rotating Machinery. Measurement 2023, 206, 112346. [Google Scholar] [CrossRef]
Wang, L.; Wu, M. Research on Bearing Fault Diagnosis Based on Machine Learning and SHAP Interpretability Analysis. Sci. Rep. 2025, 15, 41242. [Google Scholar] [CrossRef]
Kumar, P. Transfer Learning for Induction Motor Health Monitoring: A Brief Review. Energies 2025, 18, 3823. [Google Scholar] [CrossRef]
Han, Y.; Zhang, F.; Li, Z.; Wang, Q.; Li, C.; Lai, P.; Li, T.; Teng, F.; Jin, Z. MT-ConvFormer: A Multitask Bearing Fault Diagnosis Method Using a Combination of CNN and Transformer. IEEE Trans. Instrum. Meas. 2025, 74, 3501816. [Google Scholar] [CrossRef]
Fang, B.; Hu, Y.; Zheng, G.; Zhang, X.; Xie, L. Multiscale Time-Frequency CNN-Transformer Model for Bearing Fault Diagnosis: A Comprehensive Feature Extraction Approach. Eng. Fail. Anal. 2026, 184, 110312. [Google Scholar] [CrossRef]
Li, H.; Wang, G.; Shi, N.; Li, Y.; Hao, W.; Pang, C. A Lightweight Multi-Angle Feature Fusion CNN for Bearing Fault Diagnosis. Electronics 2025, 14, 2774. [Google Scholar] [CrossRef]
Xu, M.; Yu, Q.; Chen, S.; Lin, J. Rolling Bearing Fault Diagnosis Based on CNN-LSTM with FFT and SVD. Information 2024, 15, 399. [Google Scholar] [CrossRef]
Biswas, S.; Mamun, A.A.; Islam, M.S.; Bappy, M.M. Interpretable CNN Models for Computationally Efficient Bearing Fault Diagnosis Using Learnable Gaussian/Sinc Filters. Manuf. Lett. 2025, 44, 110–120. [Google Scholar] [CrossRef]
Shi, L.; Liu, W.; You, D.; Yang, S. Rolling Bearing Fault Diagnosis Based on CEEMDAN and CNN-SVM. Appl. Sci. 2024, 14, 5847. [Google Scholar] [CrossRef]
Saufi, M.S.R.M.; Isham, M.F.; Talib, M.H.A.; Zain, M.Z.M. Extremely Low-Speed Bearing Fault Diagnosis Based on Raw Signal Fusion and DE-1D-CNN Network. J. Vib. Eng. Technol. 2024, 12, 5935–5951. [Google Scholar] [CrossRef]
Mitra, S.; Koley, C. Real-Time Robust Bearing Fault Detection Using Scattergram-Driven Hybrid CNN-SVM. Electr. Eng. 2024, 106, 3615–3625. [Google Scholar] [CrossRef]
Zhang, Y.; Hua, J.; Zhang, D.; He, J.; Fang, X. Train Bearing Fault Diagnosis Based on Time–Frequency Signal Contrastive Domain Share CNN. IEEE Sens. J. 2024, 24, 33669–33681. [Google Scholar] [CrossRef]
Mishra, R.K.; Choudhary, A.; Fatima, S.; Mohanty, A.R.; Panigrahi, B.K. Multi-Fault Diagnosis with Wavelet Assisted Stacked Image Fusion and Dual Branch CNN. Appl. Soft Comput. 2025, 176, 113183. [Google Scholar] [CrossRef]
Shang, X.; Li, W.; Yuan, F.; Zhi, H.; Gao, Z.; Guo, M.; Xin, B. Research on Fault Diagnosis of UAV Rotor Motor Bearings Based on WPT-CEEMD-CNN-LSTM. Machines 2025, 13, 287. [Google Scholar] [CrossRef]
Zhang, Q.; Ju, Z. Rolling Bearing Fault Diagnosis Based on 2D CNN and Hybrid Kernel Fuzzy SVM. Adv. Theory Simul. 2025, 8, 2400793. [Google Scholar] [CrossRef]
Han, K.; Wang, W.; Guo, J. Research on a Bearing Fault Diagnosis Method Based on a CNN-LSTM-GRU Model. Machines 2024, 12, 927. [Google Scholar] [CrossRef]
Wei, L.; Peng, X.; Cao, Y. Enhanced Fault Diagnosis of Rolling Bearings Using an Improved Inception-Lstm Network. Nondestruct. Test. Eval. 2025, 40, 3274–3293. [Google Scholar] [CrossRef]
Djaballah, S.; Saidi, L.; Meftah, K.; Hechifa, A.; Bajaj, M.; Zaitsev, I. A Hybrid LSTM Random Forest Model with Grey Wolf Optimization for Enhanced Detection of Multiple Bearing Faults. Sci. Rep. 2024, 14, 23997. [Google Scholar] [CrossRef]
Salunkhe, V.G.; Khot, S.M.; Yelve, N.P.; Jagadeesha, T.; Desavale, R.G. Rolling Element Bearing Fault Diagnosis by the Implementation of Elman Neural Networks With Long Short-Term Memory Strategy. J. Tribol. 2025, 147, 084301. [Google Scholar] [CrossRef]
Kumar, P.; Raouf, I.; Song, J.; Prince; Kim, H.S. Multi-Size Wide Kernel Convolutional Neural Network for Bearing Fault Diagnosis. Adv. Eng. Softw. 2024, 198, 103799. [Google Scholar] [CrossRef]
Chang, X.; Yang, S.; Li, S.; Gu, X. Rolling Element Bearing Fault Diagnosis Based on Multi-Objective Optimized Deep Auto-Encoder. Meas. Sci. Technol. 2024, 35, 096007. [Google Scholar] [CrossRef]
Lu, H.; Zhou, K.; He, L. Bearing Fault Vibration Signal Denoising Based on Adaptive Denoising Autoencoder. Electronics 2024, 13, 2403. [Google Scholar] [CrossRef]
Hou, P.; Zhang, J.; Jiang, Z.; Tang, Y.; Lin, Y. A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism. Appl. Sci. 2023, 13, 12770. [Google Scholar] [CrossRef]
Wang, C.; Wang, M. A Fault Diagnosis Method for Rotating Machinery Based on Spatiotemporal Feature Fusion. J. Mech. Sci. Technol. 2025, 39, 4389–4405. [Google Scholar] [CrossRef]
Prince; Yoon, B.; Kumar, P. Enhanced Fault Diagnosis of Drive-Fed Induction Motors Using a Multi-Scale Wide-Kernel CNN. Mathematics 2025, 13, 2963. [Google Scholar] [CrossRef]
Raouf, I.; Kumar, P.; Soo Kim, H. Deep Learning-Based Fault Diagnosis of Servo Motor Bearing Using the Attention-Guided Feature Aggregation Network. Expert Syst. Appl. 2024, 258, 125137. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Welcome to the Case Western Reserve University Bearing Data Center Website|Case School of Engineering. Available online: https://engineering.case.edu/bearingdatacenter/welcome (accessed on 11 December 2025).
Lu, C.; Wang, Z.; Zhou, B. Intelligent Fault Diagnosis of Rolling Bearing Using Hierarchical Convolutional Network Based Health State Classification. Adv. Eng. Inform. 2017, 32, 139–151. [Google Scholar] [CrossRef]
Guo, X.; Chen, L.; Shen, C. Hierarchical Adaptive Deep Convolution Neural Network and Its Application to Bearing Fault Diagnosis. Measurement 2016, 93, 490–502. [Google Scholar] [CrossRef]
Gan, M.; Wang, C.; Zhu, C. Construction of Hierarchical Diagnosis Network Based on Deep Learning and Its Application in the Fault Pattern Recognition of Rolling Element Bearings. Mech. Syst. Signal Process. 2016, 72–73, 92–104. [Google Scholar] [CrossRef]
Pacheco-Chérrez, J.; Fortoul-Díaz, J.A.; Cortés-Santacruz, F.; María Aloso-Valerdi, L.; Ibarra-Zarate, D.I. Bearing Fault Detection with Vibration and Acoustic Signals: Comparison among Different Machine Leaning Classification Methods. Eng. Fail. Anal. 2022, 139, 106515. [Google Scholar] [CrossRef]
Wang, M.-H.; Lu, S.-D.; Hsieh, C.-C.; Hung, C.-C.; Wang, M.-H.; Lu, S.-D.; Hsieh, C.-C.; Hung, C.-C. Fault Detection of Wind Turbine Blades Using Multi-Channel CNN. Sustainability 2022, 14, 1781. [Google Scholar] [CrossRef]
Liu, X.; Zhou, Q.; Zhao, J.; Shen, H.; Xiong, X.; Liu, X.; Zhou, Q.; Zhao, J.; Shen, H.; Xiong, X. Fault Diagnosis of Rotating Machinery under Noisy Environment Conditions Based on a 1-D Convolutional Autoencoder and 1-D Convolutional Neural Network. Sensors 2019, 19, 972. [Google Scholar] [CrossRef]

Figure 1. A block diagram of the proposed methodology.

Figure 2. The architecture of the proposed HMSCT model.

Figure 3. Test setup for emulating bearing faults and data acquisition.

Figure 4. A training and validation accuracy curve for the proposed HMSCT model.

Figure 5. A training and validation loss curve for the proposed HMSCT model.

Figure 6. The normalized confusion matrix for the proposed HMSCT model.

Figure 7. The ROC curve for the proposed model.

Table 1. Performance indices (%) of the proposed HMSCT model.

State	FIR	FOR	FBB	N
p	100	97.57	99.17	100
r	100	99.18	97.53	100
F1	100	98.37	98.34	100

Table 2. A comparison of the proposed method with other methods.

Methods	Mean Accuracy (%)
HMSCT	99.15
HCNN	92.60
ADCNN	98.1
DBN	99.03
MFE-SVM	96.66
MCNN	87.8
AE-CNN	92.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kumar, P. Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection. Machines 2026, 14, 113. https://doi.org/10.3390/machines14010113

AMA Style

Kumar P. Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection. Machines. 2026; 14(1):113. https://doi.org/10.3390/machines14010113

Chicago/Turabian Style

Kumar, Prashant. 2026. "Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection" Machines 14, no. 1: 113. https://doi.org/10.3390/machines14010113

APA Style

Kumar, P. (2026). Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection. Machines, 14(1), 113. https://doi.org/10.3390/machines14010113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Multi-Scale CNN and Transformer Model for Motor Fault Detection

Abstract

1. Introduction

2. Proposed Methodology

2.1. Convolutional Neural Networks (CNNs)

2.2. Transformer and Attention Mechanisms

2.3. Hybrid Multi-Scale CNN Transformer (HMSCT) Model

3. Experimental Setup

4. Results and Discussion

4.1. Results

4.2. Discussion

5. Limitations and Future Work

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI