Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding

Luo, Xudong; Wang, Minghui; Zhang, Zhijie

doi:10.3390/app15137260

Open AccessArticle

Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding

by

Xudong Luo

¹,

Minghui Wang

^2,* and

Zhijie Zhang

¹

College of Biomedical Engineering, Sichuan University, Chengdu 610065, China

²

Institute of Regulatory Science for Medical Devices, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7260; https://doi.org/10.3390/app15137260

Submission received: 26 May 2025 / Revised: 22 June 2025 / Accepted: 25 June 2025 / Published: 27 June 2025

Download

Browse Figures

Versions Notes

Abstract

The stability of gearbox bearings is crucial to the operational efficiency and safety of industrial equipment, as their faults can lead to downtime, economic losses, and safety risks. Traditional models face difficulties in handling complex industrial time-series data due to insufficient feature extraction capabilities and poor training stability. Although transformers show advantages in fault diagnosis, their ability to model local dependencies is limited. To improve feature extraction from time-series data and enhance model robustness, this paper proposes an innovative method based on the ViT. Time-series data were converted into two-dimensional images using polar coordinate transformation and Gramian matrices to enhance classification stability. A lightweight front-end encoder and depthwise feature extractor, combined with multi-scale depthwise separable convolution modules, were designed to enhance fine-grained features, while two-dimensional rotary position encoding preserved temporal information and captured temporal dependencies. The constructed RoPE-DWTrans model implemented a unified feature extraction process, significantly improving cross-dataset adaptability and model performance. Experimental results demonstrated that the RoPE-DWTrans model achieved excellent classification performance on the combined MCC5 and HUST gearbox datasets. In the fault category diagnosis task, classification accuracy reached 0.953, with precision at 0.959, recall at 0.973, and an F1 score of 0.961; in the fault category and severity diagnosis task, classification accuracy reached 0.923, with precision at 0.932, recall at 0.928, and an F1 score of 0.928. Compared with existing methods, the proposed model showed significant advantages in robustness and generalization ability, validating its effectiveness and application potential in industrial fault diagnosis.

Keywords:

industrial equipment; vit transformer; 2D-RoPE; fault diagnosis; feature extraction

1. Introduction

As a core transmission component of large-scale industrial machinery, gearboxes operate under complex and variable environments with heavy loads, making them prone to various faults. Common fault types include surface fatigue phenomena, surface deterioration, cracks, permanent deformation, and overheating and surface damage caused by inadequate lubrication [1,2]. These faults not only severely affect the operational efficiency and lifespan of gearboxes but can also lead to equipment downtime, resulting in significant economic losses and posing safety risks to operators [3]. Studies have shown that the breakdown of lubricating oil film was the primary cause of sliding wear, corrosion, scratches, and surface fatigue. Therefore, the proper selection of lubricant types, timely replacement, and maintenance of the lubrication system significantly reduces frictional wear and delayed fatigue crack propagation, thereby extending the service life of gearboxes and their bearings [1,2]. Based on this background, fault type prediction technology for gearbox bearings has become an important research direction in the field of industrial equipment health management. By deeply analyzing bearing operation data, potential fault types have been identified, improving the scientific basis and efficiency of equipment maintenance [4,5]. The application of fault prediction technology helps reduce unplanned downtime, lower maintenance costs, and improve production stability and economic benefits [6,7,8].

As industrial equipment operating conditions have become increasingly complex, deep learning-based fault prediction methods, known for their high accuracy and compatibility with diverse data types, have emerged as a core research direction, driving many researchers to adopt classic neural network models for fault diagnosis tasks [9,10]. Xu et al. [11] employed a deep learning approach based on the InceptionV3 model [12], incorporating contrast-limited adaptive histogram equalization preprocessing and the Squeeze and Excitation (SE) channel attention mechanism [13], combined with support vector machine [14] classification, to achieve high-precision detection of motor faults. Tung et al. [15] proposed a dual-pipeline deep learning model combining 1D convolutional neural networks (CNNs) [16] and recurrent neural networks (RNNs) [17], which extract spatial and temporal features of induction motor signals through a multi-head mechanism, effectively avoiding issues such as the CNN neglecting long-term dependencies. Lv et al. [18] proposed a high-performance rolling bearing fault diagnosis method using adaptive feature pattern decomposition and a transformer [19]. To address the issue of traditional feature pattern decomposition parameters being easily affected, they introduced the Scorpion Optimization Algorithm [20] to adaptively optimize key parameters. Luo et al. [21] proposed a transformer framework based on the Fast Fourier Transform (FFT) [22] for mechanical fault diagnosis. The FFT-Trans innovatively extended the transformer global information exchange mechanism from the time domain to the frequency domain, employing a Global Frequency Encoding Layer [23] to mine potential fault features. You et al. [24] proposed a deep learning method guided by sound-vibration physical-information fusion constraints, designing a lightweight transformer model. By constructing a bearing fault dynamics model and particle filter calibration parameters, they achieved the weighted fusion of multi-physical information from sound and vibration.

However, traditional models still face numerous challenges when handling time-series data, particularly issues such as limited feature extraction ability and training instability, which lead to suboptimal performance when dealing with complex and diverse industrial data. This is especially true for deep networks, where gradient issues can cause training failure or performance degradation [25,26]. Although transformers show certain advantages in fault diagnosis tasks, they have limitations in modeling local dependencies and struggle to efficiently learn local features [27,28]. Therefore, to better handle time-series data and improve model performance, converting time-series data into 2D images has become particularly necessary. This transformation process effectively improves the feature extraction ability of time-series data and increases adaptability across different scenarios, thus solving the stability issues of traditional models in industrial applications. Research has shown that converting time-series signals into 2D images is an effective strategy that significantly improves the generalization performance of deep learning methods [29,30,31,32,33]. Image-based fault diagnosis methods typically involve converting time-series signals into image forms and utilizing deep learning models to extract relevant fault features from them. Common signal transformation methods include the symmetric dot pattern [34], Gramian Angular Field (GAF) [35], and Markov Transition Field [36], which employed various mathematical transformations to convert time-series data into 2D images, enabling effective feature extraction and fault diagnosis. Tang [37] proposed a bearing fault diagnosis method based on the minimal unscented Kalman filter-assisted [38] deep belief network, converting multi-sensor signals into 2D feature maps using Gramian angles and summation fields and combining adaptive noise scaling and minimal unscented transformation for dynamic parameter adjustment. Wang [39] and others proposed an integrated deep learning network. The 1D channel processed raw one-dimensional data using Long Short-Term Memory (LSTM) [40] and multi-head self-attention mechanisms, while the 2D channel obtained two-dimensional images through the recurrence attention map method and extracted features. The feature information from both channels was combined through feature fusion methods. To improve the limitations of traditional methods in bearing fault classification and better extract features, this paper proposed a method based on the ViT [41] model, which converts time-series data into 2D images for deep learning, aiming to enhance classification accuracy and model robustness.

This paper innovatively proposes a method combining polar coordinate transformation with the Gram matrix to convert time-series data into two-dimensional images, thereby fully extracting data features and enhancing the stability of information representation, which significantly improves the reliability and robustness of fault classification. The designed lightweight deep feature extractor (DWFE) employs multi-scale depthwise separable convolution modules, which not only strengthen the capability to capture fine-grained features but also effectively reduce computational complexity through modular design. For the first time, two-dimensional rotary position encoding (2D-RoPE) [42] is introduced to preserve temporal dependencies during feature extraction, compensating for the shortcomings of traditional transformer models in modeling local dependencies, and a unified feature extraction RoPE-DWTrans model is constructed to address the insufficient cross-dataset adaptability at the overall architecture level. Compared with methods based on InceptionV3 and SE channel attention mechanisms [11], this work combines a lightweight design with multi-scale feature enhancement techniques, significantly improving classification accuracy while effectively reducing computational resource consumption. Compared to methods modeled directly in the frequency domain [21,43], the proposed approach demonstrates stronger adaptability to complex working conditions and better generalization of data distributions through a unified feature extraction and two-dimensional image conversion process. By introducing 2D-RoPE, a close integration of time and frequency domains is achieved, showing higher efficiency and robustness in local feature modeling and global information fusion. Experimental results show that the proposed method exhibits excellent robustness and generalization capability on the combined gearbox datasets of MCC5 and HUST, validating its effectiveness and reliability in practical industrial scenarios.

The main contributions of this paper are as follows:

1. It explores the impact of different time segment extraction strategies on data feature extraction, revealing the balance between sampling intervals and segment lengths in feature information redundancy and model training effectiveness and providing a reference for selecting sampling strategies in practical engineering.

2. A lightweight front-end encoder, the DWFE, is designed, which enhances the fine-grained representation and global pattern description of the data at the feature level by incorporating multi-scale depthwise separable convolution modules, providing richer features for input into the ViT model.

3. An optimized self-attention module is proposed, which embeds relative position encoding into the query and key vectors through 2D-RoPE, allowing for more effective capture of relative positional relationships. At the same time, the ReZero mechanism is incorporated, enhancing the stability and convergence speed of training and thus improving the performance of the self-attention mechanism in visual tasks.

4. A general model is constructed that, through a unified feature extraction and processing workflow, is able to adapt to feature differences across different datasets, enabling cross-dataset fault diagnosis and prediction.

2. Materials and Methods

2.1. Materials

Dataset

(1): MCC5 gearbox dataset

The experimental setup of the MCC5 gearbox dataset [44] included a 2.2 kW three-phase asynchronous motor as the main power source. The torque on the gearbox input shaft was measured using an S2001 precision torque sensor with an overall accuracy of ±0.5% F.S. A two-stage parallel gearbox and a magnetic powder brake were used to simulate realistic operating conditions. Data acquisition was performed using the CMS-ONE-DAQ 16 system, featuring a 24-bit analog-to-digital conversion accuracy. Meanwhile, TES 001 V model three-axis vibration acceleration sensors measured the vibration signals of the motor and the gearbox intermediate shaft at a sampling frequency of 12.8 kHz. The experimental environment temperature was strictly controlled within 20 °C. The recorded data included the motor output shaft key-phasor signal, input shaft torque signal, and three-axis vibration signals from both the motor and the gearbox, which were used to analyze and diagnose various fault modes of the 36-tooth gear on the intermediate shaft and its adjacent supporting bearings. The detailed specifications of the MCC5 gearbox dataset rolling bearing and gear were as follows: the rolling bearing model was ER16K, with an inner diameter of 1 inch, an outer diameter of 2.0472 inches, a width of 0.749 inches, ball diameter of 0.3125 inches, 9 balls in total, and a pitch diameter of 1.516 inches. The gear parameters included a module of 1.5, a tooth width of 10 mm, and 36 teeth, as shown in Figure 1.

This dataset contained vibration signals collected from a two-stage parallel gearbox under various working conditions. It included multiple states such as healthy, single-fault, and compound fault scenarios, making it suitable for fault diagnosis research under variable operating conditions. The recorded signals included tri-axial vibration acceleration signals from the motor and the gearbox intermediate shaft bearing seat, torque signals, and motor output shaft key-phase signals, with a sampling frequency of 12.8 kHz. The dataset covered 12 operating conditions, encompassing constant and time-varying speed and load scenarios. The rotational speed ranged from 1000 to 3000 rpm, and the load ranged from 10 to 20 Nm. Fault types included single faults such as “cracks” and “tooth fracture,” as well as compound faults like “teeth fracture with inner bearing fault” and “teeth fracture with outer bearing fault.” Each fault was further categorized into three severity levels: light, medium, and high. The dataset covered various operating conditions and provided abundant experimental data for different fault types.

(2): HUST gearbox dataset

The experimental setup for the HUST gearbox dataset [45] utilized the Spectra-Quest mechanical fault simulator. The test bench was composed sequentially of a speed controller, motor, accelerometer, gearbox, and data acquisition board. The gearbox model used was the Hub City M2, with a transmission ratio of 1.5:1. Both the gear and pinion were made of forged steel, with pitch angles of 56°19′ and 33°41′, a pressure angle of 20°, and backlash tolerance controlled between 0.001 and 0.005 inches. The gear and pinion had 27 and 18 teeth, respectively, with a pinion pitch diameter of 1.125 inches and a gear pitch circle diameter of 1.6875 inches. The bearing configuration included one NSK 6202 deep-groove ball bearing for the pinion and two NSK 6205 deep-groove ball bearings for the gear. The experiment was conducted under 30 different operating conditions, covering 5 load levels (0 to 0.452 Nm) and 6 speed levels (20 to 40 Hz and a time-varying speed from 0 to 40 to 0 Hz). The data sampling frequency was 25.6 kHz, with each sample containing 262,144 data points, corresponding to approximately 10.2 s of vibration signal. The setup and acquisition parameters ensured high precision and reliability of the data.

The dataset included three typical operating conditions: normal state, tooth fracture, and tooth missing. Tooth fracture referred to partial breakage of a single gear tooth, resulting in discontinuity in the gear meshing surface and affecting transmission smoothness and stability. Tooth missing referred to the complete absence of a gear tooth, which could cause power interruptions or intermittent abnormalities during transmission. The gears and bearings used in the experiment were carefully selected based on the key performance indicators and operating conditions of actual industrial gearboxes, aiming to realistically simulate fault characteristics and mechanical behaviors under real-world conditions. Figure 2 illustrates the visualization of both the MCC5 and HUST gearbox datasets.

2.2. Methods

2.2.1. Overview

The proposed method is schematically illustrated in Figure 3, providing a clear overview of its fundamental framework and operational flow. It highlights the key stages, including data preprocessing, feature extraction, and classification. By transforming time-series data into two-dimensional images, the method integrates deep learning models and feature fusion techniques, effectively improving fault classification accuracy and enhancing model robustness.

The schematic illustration of the proposed method comprehensively demonstrates its fundamental framework and operational flow, presenting the key stages of data preprocessing, feature extraction, and classification in a clear and intuitive manner. By converting time-series data into two-dimensional images, the method leverages deep learning models and feature fusion techniques, significantly enhancing the accuracy and robustness of fault classification. The model first preprocesses the input raw time-series signals. Specifically, the signal values are normalized to the range [0, 1] and segmented using Piecewise Aggregate Approximation (PPA) for averaging, reducing the impact of redundant information. Then, the model uses the GAF method to convert the time-series data into a two-dimensional image representation, including Gramian Angular Summation Field (GASF) and Gramian Angular Difference Field (GADF) forms, to capture global features and patterns in the time-series data. The preprocessed data takes the form of multi-channel images, laying the foundation for subsequent feature extraction. The preprocessed image data are first input into a lightweight convolutional module for feature extraction. This module consists of depthwise separable convolution and multi-scale convolution kernels (3 × 3 and 5 × 5), which effectively reduce the number of parameters while extracting feature patterns at different receptive fields. Additionally, the H-swish activation function is used to enhance the model’s nonlinear expression capability, as shown in Figure 4. After the convolutional module, the data are mapped to a high-dimensional linear feature space through the Patch Embedding module, and class embedding and position encoding are introduced to enable the model to perceive category information and spatial relationships. The embedded features then enter the feature learning stage, where a self-attention mechanism based on Rotary Positional Embedding is used to capture long-range dependencies between sequence blocks. Attention computes information interaction in different subspaces in parallel through a multi-head attention mechanism, with each attention head independently capturing interaction patterns between different features. Furthermore, dynamic weight adjustment is optimized through the ReZero mechanism, which automatically adjusts the focus on features at different stages, capturing important areas of the signal, thus enhancing the model’s ability to extract key features. By combining the lightweight CNN and the improved ViT architecture, the model maintains low computational complexity while extracting deep features from time-series data. Experiments demonstrate that the proposed method not only significantly improves the accuracy of gearbox fault diagnosis but also provides an effective solution for the intelligent analysis of time-series signals. The model pseudocode is shown in Algorithm 1.

2.2.2. Data Preprocessing

To standardize the processing workflow for different datasets, this paper applies consistent data preprocessing methods to the MCC5 and HUST gearbox datasets. Each sample in the MCC5 gearbox dataset contains 768,000 data points (corresponding to a sampling duration of 60 s with a sampling rate of 12.8 kHz), while each sample in the HUST gearbox dataset contains 256,001 data points (with a sampling rate of 25.6 kHz). During the data preprocessing phase, to explore the impact of different sampling intervals on fault signal feature extraction, two-time segment extraction strategies were designed: one strategy extracted a 6 s segment every 4 s, and the other extracted the same-length segment every 10 s. Based on the original 60 s duration of the signal (corresponding to 768,000 points), the 4 s extraction interval strategy could extract 14 sub-segments from each sample, and the 10 s extraction interval strategy could extract 6 sub-segments. In actual statistics, the MCC5 gearbox dataset (with 240 original csv files) generates approximately 3360 samples using the 4 s extraction interval strategy and approximately 1440 samples using the 10 s extraction interval strategy; the HUST gearbox dataset (with 90 original txt files) generates approximately 180 samples and 90 samples, respectively. The total numbers of samples generated by the two strategies are 3540 and 1530. By inputting the data constructed with different sampling intervals into the fault diagnosis model and performing experimental comparisons, this paper further analyzes the impact of sampling frequency and segment interval on the model’s feature extraction capability and fault identification performance, thereby providing a theoretical basis and practical reference for the selection of sampling strategies in time-series modeling in practical engineering, as shown in Table 1 and Figure 5.

Algorithm 1 Applying proposed RoPE-DWTrans model to classification
Input: a set of Dataset sample S = {(X₁, label₁, type₁), …, (X_t, label_n, type_i)}, where X represented the features of the GAF images, label was the class label of the GAF images (which can either be 14 classes or 23 classes), and type represented the type of the image (with two possible types). The S was classified into a training set (train_X, train_label, train_type), a validation set (val_X, val_label, val_type), a testing set (test_X, test_label, test_type) in a ratio of 7.6:1.2:1.2. The number of learning epochs was denoted as M. num represented num hidden layers. Output: the optimal Model and its classification statistics.
	Load the Dataset Begin Initialize all wights and biases For m = 1, 2, …, M do Extract features through DWCNN model (DWFE) → X_DWFE Input X_DWFE to Patch Embedding (patch size = 16, channels = 16) for k in range num Ln1 = RMSNorm (X_PE) Ln2 = 2D RoPE Attention (Ln1) Ln3 = RMSNorm (Ln2) Ln4 = MLP (Ln3) Logit layers = Linear (Ln4) return Logits Model Fit (AdamW, (train_X, train_label, train_type)) → M(m) Model Evaluate (M(m), (val_X, val_label, val_type)) → R_acc(m) End For Save the optimal model which has max R_acc in M epochs End Load the testing set Load the optimal model in terms of classification performances

As shown in Figure 6, during the signal preprocessing phase, for the data of each channel in each segment, the Min–Max normalization method is first applied to scale the data to the range of [0, 1]. Then, for the time-series data of each channel, the PPA method is used to reduce the dimensionality to 256. Next, the GASF and GADF methods are used to convert the data into 512 × 512 images. Finally, the GASF and GADF images corresponding to each channel are concatenated along the channel dimension, forming a 16 × 512 × 512 three-dimensional tensors, which is saved as a npy format file to be used as input data for the subsequent deep learning model. The data preprocessing pseudocode was shown in Algorithm 2.

Algorithm 2 Data preprocessing
Input: Raw vibration data files in CSV format. N was the length of the CSV format. Output: GAF image features in NPY format.
	Global parameters initialization: scaler, paa, gasf, gadf, counter; function get_gasf (X, sample_hz = 12,800, label = 0) X ← X [1:N] X_norm ← scaler.fit_transform (X) sample_split ← [[i, i + 6] for i ∈ {0, 4, …, 52}] for sample ∈ sample_split do counter ← counter + 1 (start, end) ← (sample [0] × sample_hz, sample [1] × sample_hz) segment ← X_norm [start:end] for ch ∈ {0, 1, …, 7} do segment_ch ← segment[:, ch].reshape(1, −1) segment_red ← paa.fit_transform(segment_ch) combined ← concat(_gasf, _gadf, axis = 0) gaf_feature ← concat(gaf_features, axis = 0) np.save(npy_file, gaf_feature) function get_files() return [f for f ∈ os.listdir(directory) if f.endswith(end_s)] function read_csv_as_floats(file_path) with open(file_path, ‘r’) as file: for row ∈ reader do row ← row[:8] if len(row) < 8 then row ← row + [0.0] × (8 − len(row)) data.append([float(x) for x ∈ row]) return np.array(data)

2.2.3. Depthwise Feature Extractor

To further enhance the model’s ability to model key fault features in time-series signals from complex operating conditions, the DWFE is introduced before the ViT encoding module in the transformer backbone structure. The input to DWFE are the GAF images generated by merging the GASF- and GADF-processed MCC and HUST gearbox dataset operating condition data. The DWFE utilizes depthwise convolution [46] to perform local feature extraction and high-frequency enhancement on GAF images, effectively capturing fine-grained feature representations of the input signals at different spatial scales. It also employs a channel dimensionality reduction mechanism to achieve feature compression and redundancy removal. The output of this module is used as the input to the ViT, significantly improving the transformer encoder’s global modeling capability of complex time-series information. On this basis, the ViT encoder further models the long-term dependencies in the input sequence using its global self-attention mechanism, enabling the collaborative learning of local and global features and enhancing the model’s ability to discern the operational state and potential fault behaviors of the equipment. The derivation of the following formulas was referenced from literature [46].

As shown in Figure 7, the first layer of the CNN encoder module uses a 3 × 3 convolution to capture the local spatial features of the GAF image X ∈ R^W^×H×C, followed by a batch normalization layer to stabilize the training process.

X_{g a f} = H s w i s h (B N ({C o n v}_{3 \times 3} (X_{g a f})))

(1)

H s w i s h (X) = X \cdot \frac{R e L U 6 (x + 3)}{6}

(2)

R e L U 6 = m i n (\max (0, x), 6)

(3)

Hswish [47], a lightweight improvement based on the Swish function, is able to better capture the nonlinear characteristics of complex data compared to traditional activation functions. Hswish gradually approaches zero for smaller input values instead of being directly truncated to zero. This property makes it smoother than ReLU and more effective at handling gradient information.

To achieve effective multi-scale feature extraction while maintaining low computational overhead, a modular structure based on DW Conv was designed in this paper. Convolution kernel sizes of 3 × 3 and 5 × 5 were used to construct two feature extraction modules to capture feature information at different scales. The 3 × 3 convolution kernel excelled at extracting fine-grained local features, such as minor fluctuations and initial anomalies, while the 5 × 5 convolution kernel was better suited for capturing broad fault evolution trends and multidimensional interactive characteristics. The combination of the two enabled collaborative learning of local and global features.

{B l o c k}_{3 \times 3} (X_{i}) = {C o n v}_{1 \times 1} ({D W C o n v}_{3 \times 3} ({C o n v}_{1 \times 1} (X_{i})))

(4)

{B l o c k}_{5 \times 5} (X_{i}) = {C o n v}_{1 \times 1} ({D W C o n v}_{5 \times 5} ({C o n v}_{1 \times 1} (X_{i})))

(5)

X_{2} = {B l o c k}_{3 \times 3} (X_{g a f})

(6)

X_{o u t} = {B l o c k}_{5 \times 5} (X_{2})

(7)

Specifically, each module extracted spatial dimension features through 3 × 3 or 5 × 5 depthwise convolution, combined with batch normalization and ReLU to enhance the model’s stability and nonlinear representation capability. To further improve feature extraction efficiency and information fusion capability, 1 × 1 pointwise convolution was introduced before and after the depthwise convolution operation to perform feature channel upscaling and downscaling. This design not only effectively reduces the number of parameters and computational overhead but also significantly enhances the perception of features at different scales.

2.2.4. Self-Attention with 2D RoPE

A.: Preliminaries

In the ViT, the input RGB image is assumed to be square, with both its width and height equal to w. The image is uniformly divided into fixed-size patches of

p \times p

.

n_{p} = \frac{w}{p} \times \frac{w}{p}

(8)

n_p represents the total number of image patches obtained after dividing the entire image into patches of size

p \times p

. Each

p \times p \times c

patch is flattened into a

c \times p^{2}

dimensional vector and projected linearly into R^h, resulting in an h dimensional Patch Token vector.

In the ViT, an additional CLS Token is introduced. A zero-initialized h dimensional classification token is added to the beginning of the Patch Token sequence, extending the shape of the input matrix to

({1 + n}_{p}) \times h

. A zero-initialized positional encoding is added to the matrix, keeping its dimensions unchanged at

({1 + n}_{p}) \times h

.

The input

X \in R^{({1 + n}_{p}) \times h}

is mapped to query, key, and value vectors through linear transformations.

Q = X W_{Q} \in R^{(1 + n_{p}) \times d_{k}}

(9)

K = X W_{K} \in R^{(1 + n_{p}) \times d_{k}}

(10)

V = X W_{V} \in R^{(1 + n_{p}) \times d_{k}}

(11)

W_{Q}, {W_{K}, W}_{V} \in R^{(1 + n_{p}) \times d_{k}}

is a learnable parameter matrix.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) v

(12)

SoftMax normalizes the matrix row-wise, where d_k is the dimension of the key vector. The above attention operation is independently repeated multiple times. Assuming each independent attention operation is denoted by head_h, and h represents the number of attention heads, the output of the multi-head attention mechanism is as follows:

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{n}) {\cdot W}^{o}

(13)

{h e a d}_{h} = A t t e n t i o n (Q W_{q}^{h}, K W_{k}^{h}, V W_{k}^{v})

(14)

W_{Q}^{h}, W_{K}^{h}, W_{V}^{h} \in R^{(1 + n_{p}) \times d_{k}}

is also a learnable parameter matrix, used for the projection of each head in multi-head attention and the final output fusion.

B.: Two-dimensional Rotary Position Embedding

RoPE [48] introduces the multiplication mechanism of Euler’s formula

e^{i θ}

to directly embed relative position encoding into the query vector q_n and the key vector k_m. Specifically, RoPE does not simply adjust the attention matrix after the query and key similarity calculation; instead, it directly acts on the query and key vectors, allowing position encoding to play a role in the core process of similarity computation. The derivation of the following formulas was referenced from literature [42]. Assuming the n-th query vector and the m-th key vector are

q_{m}, k_{m} \in R^{1 \times d_{h e a d}}

, the application form of RoPE is

q_{n}^{'} = q_{n} e^{i n θ}

(15)

{k_{m}^{'} = k}_{m} e^{i m θ}

(16)

For images, a common method of extending one-dimensional position embeddings to two-dimensional embeddings is to apply embedding operations separately on each axis. This approach captures the positional relationships in the two-dimensional space by applying position encoding separately to the x-axis and y-axis. At the same time, representing the token’s positional information as a two-dimensional vector helps the model more accurately capture the spatial correlations of the input data.

p_{n} = p_{n}^{x}, p_{n}^{y}

(17)

p_{n}^{x} \in \{0,1, \dots, W\}

(18)

p_{n}^{y} \in \{0,1, \dots, H\}

(19)

The range of position indices

p_{n}^{x}, p_{n}^{y}

in the spatial dimension needs to be scaled to ensure the frequency’s validity. Therefore, the frequency

θ_{t}

in RoPE is accordingly scaled down by the square root. This scaling operation effectively adjusts the frequency scale to accommodate positional information in the two-dimensional space, thereby enhancing the model’s expressive power and generalization performance when processing two-dimensional inputs.

θ_{t} = 100^{- t / (d_{h e a d} / 4)}, t \in \{0,1, \dots, d_{h e a d} / 4\}

(20)

The axial frequency evenly divides the original embedding dimension into two parts, with one part used to represent the position on the x-axis and the other part used to represent the position on the y-axis.

R e (n, 2 t) = e^{i θ_{t} p_{n}^{x}}

(21)

R e (n, 2 t + 1) = e^{i θ_{t} p_{n}^{y}}

(22)

Here, Re[⋅] denotes taking the real part of the attention score to avoid the influence of the imaginary part, and

e^{i (\dots)}

represents the relative position rotation embedding introduced through the complex exponential form.

Furthermore, by introducing mixed frequencies, the limitations of axial frequencies are overcome.

R (n, t) = e^{i (θ_{t}^{x} p_{n}^{x} + θ_{t}^{y} p_{n}^{y})}

(23)

In the process of introducing relative positional information, the relative positional information is represented by calculating the position difference p_n − p_m.

∆ p = p_{n} - p_{m} = (p_{n}^{x} - p_{m}^{x}, p_{n}^{y} - p_{m}^{y})

(24)

p_{m} = (p_{m}^{x}, p_{m}^{y})

(25)

p_{n}^{x} - p_{m}^{x}

and

p_{n}^{y} - p_{m}^{y}

represents the differences between positions n and m along the x-axis and y-axis, respectively. The differences are substituted into the rotation matrix formula to model the relative positional information between the query vector and the key vector.

R e (n, m, t) = e^{i (θ_{t}^{x} (p_{n}^{x} - p_{m}^{x}) + θ_{t}^{y} (p_{n}^{y} - p_{m}^{y}))}

(26)

By applying the rotation encoding directly to the dot product between the query vector and the key vector, relative positional information is effectively introduced.

q_{n}, k_{m} \in C^{d_{h e a d}}

(27)

In this formula,

q_{n} \cdot k_{m}^{*}

represents the dot product in the complex domain, where

k_{m}^{*}

is the complex conjugate of the key vector. C represents the complex number field, used for handling complex number operations.

According to the relative position encoding principle of RoPE, the rotated positional information is integrated into the calculation of attention scores in a specific manner.

A_{(n, m)}^{'} = R e [q_{n} \cdot k_{m}^{*} \cdot e^{i (θ_{t}^{x} (p_{n}^{x} - p_{m}^{x}) + θ_{t}^{y} (p_{n}^{y} - p_{m}^{y}))}]

(28)

θ_{t}^{x}

and

θ_{t}^{y}

control the frequency of positional differences along the x-axis and y-axis, respectively.

t \in {0,1, \dots, d_{h e a d / 2}}

, where d_head represents the dimension of each head. By learning these parameters, the model is able to adapt to positional information along different axes, thereby improving its ability to model relative positional information in the input sequence.

By applying the SoftMax operation to each row of

A_{(n, m)}^{'}

, normalization is performed to obtain the final attention distribution

a_{n, m}

.

a_{n, m} = S o f t m a x (A_{(n, m)}^{'}) = \frac{e x p (A_{(n, m)}^{'})}{\sum_{m = 1}^{M} A_{(n, m)}^{'}}

(29)

The normalized weights are used to weight the corresponding value vectors v_m. The 2D RoPE pseudocode is shown in Algorithm 3.

z_{n} = \sum_{m = 1}^{N} a_{n, m} \cdot v_{m}

(30)

M u l t i H e a d O u t p u t = C o n c a t (z_{n}^{(1)}, z_{n}^{(2)}, \dots, z_{n}^{(n)}) \cdot W^{o}

(31)

Algorithm 3 The 2D relative position encoding
Input: Query vector qₙ, Key vector kₘ ∈ ℝ^{1 × d_head}, Position (xₙ, yₙ), (xₘ, yₘ) Output: Attention score with 2D relative position encoding
	θ_x ← [θ_{x, 1}, …, θ_{x, d_head/2}] θ_y ← [θ_{y, 1}, …, θ_{y, d_head/2}] function Apply_rotary_embedding (v, Δx, Δy, θ_x, θ_y): v_complex ← view_as_complex(v.reshape(−1, 2)) freqs ← concatenate([θ_x, θ_y], dim = 0) rotation_angles ← Δx·θ_x + Δy·θ_y rotation_factors ← exp(i·rotation_angles) v_rotated ← v_complex ⊙ rotation_factors return view_as_real(v_rotated).flatten() function Computer_attention(qₙ, kₘ, Δx, Δy): Δx ← xₙ − xₘ Δy ← yₙ − yₘ q_rotated ←Apply_rotary_embedding (qₙ, Δx, Δy, θ_x, θ_y) k_rotated ←Apply_rotary_embedding (kₘ, −Δx, −Δy, θ_x, θ_y) return attention_score function Multihead_attention (Q, K, V, pos_x, pos_y): for h in 1 to num_heads: for n in 1 to N: scores ← [] for m in 1 to M: score ←Computer_attention () head_out ← head_out + ∑(attn_weights [m]·V [h,m] for m in 1 to M) outputs.append(head_out) return concatenate(outputs, dim = −1)

The outputs of all heads were concatenated and processed through a linear transformation layer to generate the final output. The complete structure diagram is shown in Figure 8.

C.: Residual with Zero initialization

The proposed framework uses the ReZero mechanism [49] to address the gradient vanishing and explosion problems at different layers. Traditional residual connections propagate information by directly adding the input x to the output after nonlinear transformation. In contrast, the ReZero method introduces a learnable scaling factor, allowing the network to rely solely on the input x during the early stages of training, thus avoiding excessive interference from initial residual information.

The residual connection formula for ReZero is as follows: for an input x, the output after processing by a sublayer is denoted as F(x).

X_{O u t p u t} = X_{I n p u t} + α \times F (X_{I n p u t})

(32)

α is a learnable scaling factor, initially set to 0 and gradually optimized during training. X is the input tensor, and F(⋅) represents a layer. This method is particularly suitable for deep neural networks, especially in the early stages of the network, as it effectively improves training stability and convergence speed.

3. Results

3.1. Training Environment and Parameter Settings

The experimental platform ran on the Windows 10 operating system, equipped with 64 GB of RAM and an RTX 3090 GPU. The code was written in the Python 3.12 programming language and implemented using the PyTorch 2.6 framework, diagnosing gearbox faults in the MCC5 and HUST gearbox datasets. During training, the CrossEntropyLoss loss function and AdamW optimizer were used, and the specific configuration of key hyperparameters is detailed in Table 2.

3.2. Evaluation Criteria

In the experiments, several common evaluation metrics were selected to comprehensively assess the performance of the proposed model in gearbox bearing fault diagnosis, including accuracy, precision, recall, and F1-score. Accuracy was used to measure the proportion of correctly predicted fault categories. Precision evaluated the proportion of actual faults among all samples predicted as faults. Recall reflected the proportion of successfully identified faults in all actual fault samples. The F1-score is a composite metric that combines the performance of precision and recall.

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(33)

P r e c i s i o n = \frac{T P}{T P + F P}

(34)

R e c a l l = \frac{T P}{T P + F N}

(35)

F 1 s c o r e = \frac{2 P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(36)

3.3. Performance

3.3.1. Classification Performance

This paper designed experiments based on the MCC and HUST gearbox datasets to analyze 14 class and 23 class classifications. In the 14-class experiment, the HUST gearbox dataset was divided into three categories: normal gears, broken gears, and missing gears. The MCC dataset included one healthy state and six fault types: gear tooth missing, gear wear, tooth cracking, tooth fracture, tooth fracture with an inner bearing ring fault, and tooth fracture with an outer bearing ring fault. To further explore the mechanisms of bearing faults under different operating conditions and their signal characteristics, as well as the impact of fault severity on signal performance, the gear pitting category was used as an example. It was subdivided into four subcategories based on varying speed and torque conditions and fault severity levels, resulting in a total of 11 fault categories. These 11 MCC fault categories comprehensively covered all characteristics of the dataset, including different fault types, fault causes, and three fault severity levels: light, moderate, and severe. In the 23-class experiment, the classification method of the HUST gearbox dataset remained consistent with that of the 14-class experiment, dividing it into normal gears, broken gears, and missing gears. The MCC dataset included one healthy state and one missing gear state, with six typical fault types further subdivided into 20 categories based on light, moderate, and severe fault severity levels. This refined classification approach provided a more comprehensive representation of the fault characteristics and severity distribution in the dataset, offering critical insights for studying gear fault features and developing diagnostic methods. The method is shown in Figure 9.

In this paper, a combined dataset comprising the MCC dataset and the HUST gearbox dataset was used as experimental data. These two datasets were independent of each other, with no overlap or interaction between them. The dataset was divided into a training set and a test set at a ratio of 8.8:1.2, ensuring no duplicate samples existed between the two sets to guarantee the fairness and reliability of model evaluation. The detailed configuration of the key hyperparameters for the experiments is presented in Table 2. The experimental results are shown in Table 3 and Table 4.

This paper conducted a comparative experiment between the proposed method and the baseline ViT model. As shown in the tables and figures, the proposed method outperformed the baseline ViT model in four key metrics, indicating a significant performance improvement. The results are shown in Table 5 and Figure 10.

Additionally, we recorded the training fitting time for the gearbox fault diagnosis task. After introducing 2D RoPE, the model’s training process showed significant acceleration. As shown in Figure 11, with the inclusion of 2D RoPE, the model converged more quickly, reducing the time and number of iterations required for training. Accuracy increased rapidly in the early stages of training. This indicates that 2D RoPE played a crucial role in enhancing the model’s training efficiency and optimization process, significantly accelerating the model’s adaptation to the training data and achieving better performance in a shorter amount of time.

3.3.2. Feature Extraction Performance Evaluation Under Multiple Time Windows

To systematically evaluate the performance of different filters in feature extraction, this paper conducted comparative experiments with 4 s and 10 s data extraction windows, exploring the impact of different filters on feature extraction effectiveness. The experimental results are detailed in Table 6.

3.3.3. Classification of Gearbox Fault Types and Severity Levels

This paper further explored the accuracy of classifying the severity of each gearbox fault in the 23-class classification task. Accurately classifying gearbox fault types and their severity is crucial for improving fault diagnosis accuracy. The fault types are shown in Table 7. By accurately identifying the nature and development stage of the faults, engineers are able to take targeted maintenance actions in a timely manner, effectively avoiding the risks of over-maintenance or maintenance delays, optimizing equipment maintenance strategies, and reducing unnecessary costs. The experimental results are shown in Table 8 and Figure 12, Figure 13 and Figure 14.

Compared to the ViT, the proposed model demonstrated significant performance advantages in the industrial gearbox fault diagnosis task. Within the 4 s extraction feature extraction window, the proposed model improved the classification accuracy for Fault 1 and Fault 2 by 3.7% and 3.3%, respectively, and increased the accuracy for Fault 4 and Fault 6 by 3.5% and 5.4%, respectively. For the 10 s extraction window, the proposed model achieved the same classification accuracy as the ViT method for Fault 1, both reaching 100%, but outperformed it by improving the accuracy for Fault 2 and Fault 6 by 4.2% and 5.0%, respectively. These results indicate that the proposed method was effective in distinguishing between different fault types and their severity, providing strong technical support for equipment fault diagnosis in practical applications. This research not only improved fault diagnosis accuracy but also provided a theoretical basis for the construction and optimization of intelligent maintenance systems.

3.3.4. Comparison of Different Modules

To further validate the generality and effectiveness of the RoPE-DWTrans method, we conducted comparative experiments with multiple mainstream deep learning models. During the model training process, the RoPE-DWTrans hyperparameter settings listed in Table 2 were strictly followed. The experiments evaluated the classification of one healthy class and five fault classes from the MCC5 gearbox dataset, and the detailed classification results of each model are presented in Table 9. The experimental results demonstrated that the RoPE-DWTrans exhibited superior classification capability in multi-class fault diagnosis tasks.

4. Discussion

To effectively achieve bearing fault diagnosis, this paper proposed an innovative method based on the ViT model. First, the polar coordinate transformation and Gramian matrix methods were used to convert time-series data into two-dimensional images, fully extracting feature information, improving data representation efficiency, and enhancing classification stability. Subsequently, a lightweight front-end encoder, the DWFE, was designed. This encoder refined the granularity of feature representation by integrating multi-scale depthwise separable convolution modules, while effectively preserving global structural information, enabling more comprehensive feature representation and more precise feature analysis. Additionally, the self-attention mechanism incorporated the 2D-RoPE, which further enhanced the model’s ability to capture temporal dependencies. The introduction of the ReZero mechanism significantly improved the stability and convergence speed of the training process, thus enhancing the performance of the self-attention mechanism in visual tasks.

This paper conducted a comprehensive validation of the proposed method using a combined dataset based on the MCC5 and HUST gearbox datasets, with the experimental results presented in Table 3 and Table 4. In the 14-class fault classification task, the RoPE-DWTrans model exhibited excellent performance, achieving a classification accuracy of 0.953, a precision of 0.959, a recall of 0.973, and an F1 score of 0.961, significantly outperforming other comparative models. Among these, the classification accuracy of ConvNeXtV2 reached 0.940, slightly lower than that of the RoPE-DWTrans, while the accuracy of InceptionV3 was only 0.859, indicating relatively weaker performance. As the task complexity increased, with fault categories expanding from 14 to 23 classes, the performance of all models declined, with the classification accuracy of the five comparative models decreasing by 10.36%, 4.58%, 14.67%, 3.72%, and 3.15%, respectively. For instance, the classification accuracy of GRU + ShuffleNet and InceptionV3 dropped significantly to 0.831 and 0.733, respectively, while SeNetTrans and ConvNeXtV2 showed classification accuracies of 0.895 and 0.905, indicating notable declines compared to the 14-class task. In contrast, the RoPE-DWTrans model maintained high stability in the 23-class task, achieving a classification accuracy of 0.923, a precision of 0.932, a recall of 0.928, and an F1 score of 0.928, with the smallest performance decline among all models. These results demonstrate that the RoPE-DWTrans model effectively enhances feature representation by transforming time-series data into two-dimensional image forms. Further ablation experiment results (Table 5) validated the effectiveness of the proposed improvements. By progressively removing key modules of the model and analyzing their impact on performance, the results showed that the multi-scale convolution module and 2D-RoPE encoding play pivotal roles in improving classification performance, providing essential support for practical applications in real-world industrial scenarios.

The experimental results in Table 6 clearly demonstrate the impact of different feature extraction window lengths on the performance of multi-class classification tasks. In the 14-class classification task, when the feature extraction window was set to 4 s, the model achieved a classification accuracy of 0.953 and a precision of 0.959. When the window length was extended to 10 s extraction, the classification accuracy and precision improved to 0.964 and 0.976, representing an increase of approximately 1.15% in classification accuracy and 1.77% in precision compared to the 4 s extraction window. In the 23-class classification task, the classification accuracy and precision for the 4 s extraction window were 0.923 and 0.932, respectively, while extending the window length to 10 s extraction improved these metrics to 0.950 and 0.949, corresponding to an increase of approximately 2.93% in classification accuracy and 1.82% in precision. Moreover, the 10 s extraction window significantly optimized training efficiency. In the 14-class task, the training time for the 10 s extraction window was 33 s, reducing by approximately 35.39% compared to the 4 s extraction window. In the 23-class task, the training time for the 10 s extraction window was 95 s, representing a reduction of about 22.13% compared to the 4 s extraction window. These results indicate that extending the feature extraction window appropriately not only captures the critical features of time-series data more comprehensively and reduces redundant information but also significantly shortens training time, thereby enhancing both model performance and computational efficiency. This finding provides an important basis for optimizing feature extraction strategies in multi-class classification tasks. It highlights that selecting an appropriate time window length during model design is crucial for balancing classification performance and computational efficiency. By considering the complexity of classification tasks and computational resource constraints, the model’s applicability and robustness in practical industrial scenarios can be further improved.

The experiment further explored bearing fault diagnosis performance under different proportions of training data. The experimental results are detailed in Table 10. Through experiments with varying training data proportions, we analyzed the trends in model performance. Additionally, we revealed the model’s dependence on data volume, providing a basis for selecting models in specific scenarios and balancing training costs with performance improvements.

The experimental results in Table 10 show that as the proportion of training data decreased from 88.7% to 56.5%, the model performance declined significantly. In the 14-class classification task, the classification accuracy of the 4 s extraction window dropped from 0.953 to 0.828, a decrease of approximately 13.13%, while the 10 s extraction window dropped from 0.964 to 0.839, a decrease of about 12.94%. In the 23-class classification task, the classification accuracy of the 4 s extraction window decreased by approximately 13.13%, and the 10 s extraction window decreased by about 14.53%. Overall, as the amount of training data was reduced, the model’s performance, including its classification accuracy and F1 score, exhibited a notable decline, particularly in tasks with higher complexity. This result highlights the critical impact of training data quantity on model performance. Notably, the 10 s extraction window demonstrated greater robustness under small-sample conditions. For example, at a training data proportion of 56.5%, the classification accuracy and F1 score of the 10 s extraction window improved by approximately 1.33% and 1.82%, respectively, compared to the 4 s extraction window. These findings indicate that a longer time window can capture key features of time-series data more comprehensively and mitigate the performance degradation caused by insufficient training data. This emphasizes the research value and application potential of small-sample learning methods in industrial scenarios. In practical industrial applications, data collection often faces high costs and limited conditions, resulting in insufficient training data. Small-sample learning methods reduce reliance on large amounts of labeled data, ensuring the stability and accuracy of the model under data-scarce conditions. Combined with optimized feature extraction strategies, small-sample learning networks not only improve the generalization ability of models but also enable more efficient and universal solutions for the fault diagnosis and stability assessment of industrial equipment. Future research will focus on further improving and applying small-sample learning methods to enhance model performance under data-scarce conditions. For instance, Zhao et al. [53] proposed a novel method that achieved remarkable results on small datasets with limited operational conditions, demonstrating the potential of small-sample learning in industrial fault diagnosis. Incorporating such methods is expected to further enhance the robustness and practicality of models, providing more reliable technical support for industrial equipment health management and driving the development of related fields.

5. Conclusions

This paper proposed a new industrial fault diagnosis method based on the ViT model and advanced data preprocessing techniques. A lightweight front-end encoder, the DWFE, was designed, which enhanced the fine-grained representation of 2D image features through multi-scale depthwise separable convolutions. Subsequently, a self-attention optimization scheme incorporating 2D-RoPE and the ReZero mechanism was proposed, significantly improving the model’s ability to capture relative positions and temporal dependencies, while also accelerating training convergence. This study systematically explored the impact of different time-period extraction strategies (sampling intervals and segment lengths) on feature redundancy and training efficiency, providing quantitative evidence for sampling design in engineering practice. The method achieved seamless adaptation across multiple datasets, such as the MCC5 and HUST gearbox datasets, and validated its generalization performance and practical value in complex industrial scenarios.

Author Contributions

Conceptualization, X.L. and M.W.; methodology, X.L. and M.W.; software, X.L. and Z.Z.; formal analysis, X.L. and Z.Z.; investigation, X.L. and M.W.; resources, M.W.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and M.W.; visualization, X.L. and Z.Z.; supervision, M.W.; project administration, M.W.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Project No. 2022YFC2407600, Project No. 2022YFC3601000).

Data Availability Statement

Data were derived from public domain resources. The data presented in this paper are available from the MCC5-THU Gearbox Datasets at https://github.com/liuzy0708/MCC5-THU-Gearbox-Benchmark-Datasets (accessed on 2 May 2025) and the HUST Gearbox Dataset at https://github.com/CHAOZHAO-1/HUSTgearbox-dataset (accessed on 2 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional neural network
RNN	Recurrent neural network
SE	Squeeze and excitation
FFT	Fast Fourier Transform
GAF	Gramian Angular Field
LSTM	Long Short-Term Memory
ViT	Vision Transformer
DWFE	Deep feature extractor
2D-RoPE	Two-dimensional rotary position encoding
PPA	Piecewise Aggregate Approximation
GASF	Gramian Angular Summation Field
GADF	Gramian Angular Difference Field

References

Mikić, D.; Desnica, E.; Kiss, I.; Mikić, V. Reliability analysis of rolling ball bearings considering the bearing radial clearance and operating temperature. Adv. Eng. Lett. 2022, 1, 16–22. [Google Scholar] [CrossRef]
Vasic, M.; Stojanovic, B.; Blagojevic, M. Fault analysis of gearboxes in open pit mine. Appl. Eng. Lett. 2020, 5, 50–61. [Google Scholar] [CrossRef]
Molęda, M.; Małysiak, M.; Sunderam, V.; Ding, W.; Mrozek, D. From corrective to predictive maintenance—A review of maintenance approaches for the power industry. Sensors 2023, 23, 5970. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, T.; Kosasih, B. Compound Faults Diagnosis in Wind Turbine Gearbox Based on Deep Learning Methods: A Review. In Proceedings of the 2024 Global Reliability and Prognostics and Health Management Conference (PHM-Beijing), Beijing, China, 11–13 October 2024. [Google Scholar]
Seo, M.; Yun, W. Gearbox Condition Monitoring and Diagnosis of Unlabeled Vibration Signals Using a Supervised Learning Classifier. Machines 2024, 12, 127. [Google Scholar] [CrossRef]
Mohad, F.; Gomes, L.; Tortorella, G.; Lermen, F.H. Operational excellence in total productive maintenance: Statistical reliability as support for planned maintenance pillar. Int. J. Qual. Reliab. Manag. 2025, 42, 1274–1296. [Google Scholar] [CrossRef]
Khalil, A.; Rostam, S. Machine learning-based predictive maintenance for fault detection in rotating machinery: A case study. Eng. Technol. Appl. Sci. Res. 2024, 14, 13181–13189. [Google Scholar] [CrossRef]
Chukwunweike, J.; Anang, A.; Dike, J.; Adeniran, A.A. Enhancing manufacturing efficiency and quality through automation and deep learning: Addressing redundancy, defects, vibration analysis, and material strength optimization. World J. Adv. Res. Rev. 2024, 23, 1272–1295. [Google Scholar] [CrossRef]
Li, X.; Wang, Y.; Yao, J.; Li, M.; Gao, Z. Multi-sensor fusion fault diagnosis method of wind turbine bearing based on adaptive convergent viewable neural networks. Reliab. Eng. Syst. Saf. 2024, 245, 109980. [Google Scholar] [CrossRef]
Mian, Z.; Deng, X.; Dong, X.; Tian, Y.; Cao, T.; Chen, K.; Al Jaber, T. A literature review of fault diagnosis based on ensemble learning. Eng. Appl. Artif. Intell. 2024, 127, 107357. [Google Scholar] [CrossRef]
Xu, L.; Teoh, S.; Ibrahim, H. A deep learning approach for electric motor fault diagnosis based on modified InceptionV3. Sci. Rep. 2024, 14, 12344. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Vo, T.; Liu, M.; Tran, M. Harnessing attention mechanisms in a comprehensive deep learning approach for induction motor fault diagnosis using raw electrical signals. Eng. Appl. Artif. Intell. 2024, 129, 107643. [Google Scholar] [CrossRef]
O’shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
Medsker, L.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
Lv, J.; Xiao, Q.; Zhai, X. A high-performance rolling bearing fault diagnosis method based on adaptive feature mode decomposition and Transformer. Appl. Acoust. 2024, 224, 110156. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Singh, A.; Mousavi, S.; Gaurav, K. SHS: Scorpion Hunting Strategy Swarm Algorithm. arXiv 2024, arXiv:2407.14202. [Google Scholar]
Luo, X.; Wang, H.; Han, T.; Zhang, Y. FFT-trans: Enhancing robustness in mechanical fault diagnosis with Fourier transform-based transformer under noisy conditions. IEEE Trans. Instrum. Meas. 2024, 73, 2515112. [Google Scholar] [CrossRef]
Duhamel, P.; Vetterli, M. Fast Fourier transforms: A tutorial review and a state of the art. Signal Process. 1990, 19, 259–299. [Google Scholar] [CrossRef]
Xie, S.; Zhou, S.; Sakurada, K.; Ishikawa, R.; Onishi, M.; Oishi, T. G²fR: Frequency Regularization in Grid-Based Feature Encoding Neural Radiance Fields. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 186–203. [Google Scholar]
You, K.; Wang, P.; Huang, P.; Gu, Y. A sound-vibration physical-information fusion constraint-guided deep learning method for rolling bearing fault diagnosis. Reliab. Eng. Syst. Saf. 2025, 253, 110556. [Google Scholar]
Liu, M.; Chen, L.; Du, X. Activated gradients for deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 2156–2168. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Lao, D. Surprising instabilities in training deep networks and a theoretical analysis. Adv. Neural Inf. Process. Syst. 2022, 35, 19567–19578. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L. Are transformers effective for time series forecasting. Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]
Zhao, B.; Xing, H.; Wang, X.; Song, F.; Xiao, Z. Rethinking attention mechanism in time series classification. Inf. Sci. 2023, 627, 97–114. [Google Scholar] [CrossRef]
Garcia, G.; Michau, G.; Ducoffe, M.; Gupta, J.S.; Fink, O. Temporal signals to images: Monitoring the condition of industrial assets with deep learning image processing algorithms. Proc. Inst. Mech. Eng. Part O J. Risk Reliab. 2022, 236, 617–627. [Google Scholar] [CrossRef]
Yi, K.; Zhang, Q.; Cao, L. A survey on deep learning-based time series analysis with frequency transformation. arXiv 2023, arXiv:2302.02173. [Google Scholar]
Qiu, S.; Cui, X.; Ping, Z.; Shan, N.; Li, Z.; Bao, X.; Xu, X. Deep learning techniques in intelligent fault diagnosis and prognosis for industrial systems: A review. Sensors 2023, 23, 1305. [Google Scholar] [CrossRef]
Wu, G.; Ji, X.; Yang, G.; Jia, Y.; Cao, C. Signal-to-image: Rolling bearing fault diagnosis using ResNet family deep-learning models. Processes 2023, 11, 1527. [Google Scholar] [CrossRef]
Li, Z.; Fan, R.; Tu, J. Tdanet: A novel temporal denoise convolutional neural network with attention for fault diagnosis. arXiv 2024, arXiv:2403.19943. [Google Scholar]
Sun, Y.; Li, S.; Wang, Y.; Wang, X. Fault diagnosis of rolling bearing based on empirical mode decomposition and improved manhattan distance in symmetrized dot pattern image. Mech. Syst. Signal Process. 2021, 159, 107817. [Google Scholar] [CrossRef]
Zhou, Y.; Long, X.; Sun, M.; Chen, Z. Bearing fault diagnosis based on Gramian angular field and DenseNet. Math. Biosci. Eng. 2022, 19, 14086–14101. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Wang, W.; Zhang, X.; Iu, H.H.-C. A new fault diagnosis of rolling bearing based on Markov transition field and CNN. Entropy 2022, 24, 751. [Google Scholar] [CrossRef] [PubMed]
Tang, H.; Tang, Y.; Su, Y.; Feng, W.; Wang, B.; Chen, P.; Zuo, D. Feature extraction of multi-sensors for early bearing fault diagnosis using deep learning based on minimum unscented kalman filter. Eng. Appl. Artif. Intell. 2024, 127, 107138. [Google Scholar] [CrossRef]
Julier, S.; Uhlmann, J. New extension of the Kalman filter to nonlinear systems. Signal Process. Sens. Fusion Target Recognit. VI 1997, 3068, 182–193. [Google Scholar]
Wang, L.; Zhao, W. An ensemble deep learning network based on 2D convolutional neural network and 1D LSTM with self-attention for bearing fault diagnosis. Appl. Soft Comput. 2025, 172, 112889. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Heo, B.; Park, S.; Han, D.; Yun, S. Rotary position embedding for vision transformer. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 289–305. [Google Scholar]
Sun, S.; Xia, X.; Zhou, H. A graph representation learning-based method for fault diagnosis of rotating machinery under time-varying speed conditions. Nonlinear Dyn. 2025, 113, 17449–17475. [Google Scholar] [CrossRef]
Chen, S.; Liu, Z.; He, X.; Zou, D.; Zhou, D. Multi-mode Fault Diagnosis Datasets of Gearbox Under Variable Working Conditions. Data Brief 2024, 54, 110453. [Google Scholar] [CrossRef]
Zhao, C.; Zio, E.; Shen, W. Domain generalization for cross-domain fault diagnosis: An application-oriented perspective and a benchmark study. Reliab. Eng. Syst. Saf. 2024, 245, 109964. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Su, J.; Ahmed, M.; Lu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Bachlechner, T.; Majumder, B.; Mao, H.; Cottrell, G.; McAuley, J. Rezero is all you need: Fast convergence at large depth. Uncertain. Artif. Intell. 2021, 161, 1352–1361. [Google Scholar]
Li, T.; Zhao, Z.; Sun, C.; Yan, R.; Chen, X. Multireceptive field graph convolutional networks for machine fault diagnosis. IEEE Trans. Ind. Electron. 2020, 3, 12739–12749. [Google Scholar] [CrossRef]
Wang, Z.; Wu, Z.; Li, X.; Shao, H.; Han, T.; Xie, M. Attention-aware temporal–spatial graph neural network with multi-sensor information fusion for fault diagnosis. Knowl.-Based Syst. 2023, 278, 110891. [Google Scholar] [CrossRef]
Jiang, Z.; Zheng, W.; Men, D. Research on gearbox fault diagnosis method under variable working conditions based on HHO-MLP neural network. Manuf. Technol. Mach. Tool 2025, 2, 29–35. [Google Scholar]
Zhao, X.; Zhu, X.; Liu, J.; Hu, Y.; Gao, T.; Zhao, L.; Yao, J.; Liu, Z. Model-assisted multi-source fusion hypergraph convolutional neural networks for intelligent few-shot fault diagnosis to electro-hydrostatic actuator. Inf. Fusion 2024, 104, 102186. [Google Scholar] [CrossRef]

Figure 1. Test rig of MCC5 gearbox dataset. Adapted from [44].

Figure 2. Visualization diagram of the dataset. (a) MCC5 gearbox dataset of “health speed circulation”, “gear wear H speed circulation”, and “gear pitting H speed circulation”; (b) HSUT gearbox dataset of “Normal”, “Missing tooth”, and “Broken tooth”.

Figure 3. Schematic illustration of the proposed method.

Figure 4. The RoPE-DWTrans model.

Figure 5. Time segment extraction.

Figure 6. Data preprocessing.

Figure 7. The DWFE model.

Figure 8. Optimized self-attention model structure.

Figure 9. Fault categories and grouping: (a) 14-class classification experiments and (b) 23-class classification experiments.

Figure 10. Performance comparison chart on precision and recall. (a) Evaluated metrics (4 s extraction); (b) evaluated metrics (10 s extraction).

Figure 11. Accuracy improvement with different models in gearbox fault diagnosis. (a) Accuracy over epochs for 23-class 4 s extraction. (b) Accuracy over epochs for 23-class 10 s extraction. (c) Accuracy over epochs for 14-class 4 s extraction of Dataset B. (d) Accuracy over epochs for 14-class 10 s extraction.

Figure 12. Evaluated metrics Improvement with different models in gearbox fault diagnosis (4 s extraction).

Figure 13. Evaluated metrics’ improvement with different models in gearbox fault diagnosis (10 s extraction).

Figure 14. Classification results of gearbox fault types and severity levels. (a) 23c4s ViT, (b) 23c4s RoPE-DWTrans, (c) 23c10s ViT, (d) and 23c10s RoPE-DWTrans.

Table 1. Comparison of sample counts generated by different sampling strategies.

Dataset	MCC5	HUST	Total
4 s extraction	3360	180	3540
10 s extraction	1440	90	1530

Table 2. Detailed configuration of key hyperparameters.

-	Parameter	Setting
Data preprocessing	Sample split	[i, i + 6], i ∈ {0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52}; [i, i + 6], i ∈ {0, 10, 20, 30, 40, 50}
	Window size for PAA dimensionality reduction	150
	Number of signal channels	8
	Output GASF/GADF image resolution	512
	Shape of GAF (merges all channels of GASF/GADF)	(16, 512, 512)
	Patch size	16
Network	Kernel size of depthwise convolution	3 × 3, 5 × 5
	Kernel number of depthwise convolution	32
	Number of attention heads	{1, 2, 3, 4}
	Numerical stability parameters for LayerNorm	1 × 10⁻¹²
	The number of hidden layers in the transformer module	6
	FFN hidden size	2048
Others	Dropout	0.1
	Loss Function	CrossEntropyLoss
	Batch size	16
	Epochs	256
	Learning rate	2 × 10⁻⁶
	Optimizer	AdamW

Table 3. Results of four metrics using different classifiers on the 14-class dataset (4 s extraction).

Model	Evaluated Metrics
Model	Accuracy	Precision	Recall	F1 Score
GRU + ShuffleNet	0.927	0.932	0.925	0.919
SeNetTrans	0.938	0.944	0.954	0.943
InceptionV3	0.859	0.841	0.858	0.843
ConvNeXtV2	0.940	0.953	0.959	0.946
RoPE-DWTrans (Ours)	0.953	0.959	0.973	0.961

Table 4. Results of four metrics using different classifiers on the 23-class dataset (4 s extraction).

Model	Evaluated Metrics
Model	Accuracy	Precision	Recall	F1 Score
GRU + ShuffleNet	0.831	0.834	0.831	0.827
SeNetTrans	0.895	0.899	0.892	0.893
InceptionV3	0.733	0.754	0.729	0.729
ConvNeXtV2	0.905	0.919	0.908	0.911
RoPE-DWTrans (Ours)	0.923	0.932	0.928	0.928

Table 5. Ablation experiment of RoPE-DWTrans.

Model	Evaluated Metrics	4 s Extraction		10 s Extraction
Model	Evaluated Metrics	14 Class	23 Class	14 Class	23 Class
Vit	Accuracy	0.929	0.893	0.947	0.922
Vit	F1 score	0.932	0.897	0.962	0.912
DWFE + Vit	Accuracy	0.937	0.905	0.964	0.933
DWFE + Vit	F1 score	0.948	0.906	0.967	0.943
RoPE-DWTrans (Ours)	Accuracy	0.953	0.923	0.964	0.950
RoPE-DWTrans (Ours)	F1 score	0.961	0.928	0.967	0.951

Table 6. Results of different feature extraction method performance.

Fature Extraction	Datasets	Evaluated Metrics				Training Time
Fature Extraction	Datasets	Accuracy	Precision	Recall	F1 Score	Training Time
4 s extraction	14 class	0.953	0.959	0.973	0.961	51s
4 s extraction	23 class	0.923	0.932	0.928	0.928	122 s
10 s extraction	14 class	0.964	0.976	0.971	0.967	33 s
10 s extraction	23 class	0.950	0.949	0.966	0.951	95 s

Table 7. Gearbox fault types.

-	Fault Categories	-	Fault Categories
Fault 1	Gear pitting	Fault 4	Teeth break and bearing outer
Fault 2	Gear wear	Fault 5	Teeth break
Fault 3	Teeth break and bearing inner	Fault 6	Teeth crack

Table 8. Classification results of gearbox fault types and severity levels (accuracy).

Feature Extraction	Fault	Fault 1	Fault 2	Fault 3	Fault 4	Fault 5	Fault 6
Feature Extraction	Class	0–2	3–5	8–10	11–13	14–16	17–19
4 s extraction	ViT	0.833	0.854	0.897	0.923	0.873	0.915
4 s extraction	Ours	0.870	0.887	0.870	0.958	0.943	0.969
10 s extraction	ViT	1.000	0.958	0.933	0.933	0.857	0.833
10 s extraction	Ours	1.000	1.000	1.000	0.933	0.928	0.883

Table 9. Comparison of classification performance on MCC5 gearbox datasets.

Methods	Evaluated Metrics
Methods	Accuracy	F1-Score
FFT-SGCN [43]	0.962	0.967
Multireceptive-GCN [50]	0.971	0.968
Attention-TSGNN [51]	0.975	0.976
HHO-MLP [52]	0.975	-
Ours	0.981	0.978

Table 10. Performance corresponding to different percentages of training data.

Performance Corresponding to Different Percentages of Training Data (ACC and F1)
Extraction	Class	56.5%		68.2%		88.7%
Extraction	Class	Accuracy	F1 Score	Accuracy	F1 Score	Accuracy	F1 Score
4 s	14	0.828	0.831	0.862	0.862	0.953	0.961
4 s	23	0.802	0.812	0.843	0.854	0.923	0.928
10 s	14	0.839	0.846	0.897	0.890	0.964	0.967
10 s	23	0.813	0.828	0.883	0.886	0.950	0.951

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, X.; Wang, M.; Zhang, Z. Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding. Appl. Sci. 2025, 15, 7260. https://doi.org/10.3390/app15137260

AMA Style

Luo X, Wang M, Zhang Z. Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding. Applied Sciences. 2025; 15(13):7260. https://doi.org/10.3390/app15137260

Chicago/Turabian Style

Luo, Xudong, Minghui Wang, and Zhijie Zhang. 2025. "Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding" Applied Sciences 15, no. 13: 7260. https://doi.org/10.3390/app15137260

APA Style

Luo, X., Wang, M., & Zhang, Z. (2025). Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding. Applied Sciences, 15(13), 7260. https://doi.org/10.3390/app15137260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of Gearbox Bearing Fault Diagnosis Method Based on 2D Image Transformation and 2D-RoPE Encoding

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

Dataset

2.2. Methods

2.2.1. Overview

2.2.2. Data Preprocessing

2.2.3. Depthwise Feature Extractor

2.2.4. Self-Attention with 2D RoPE

3. Results

3.1. Training Environment and Parameter Settings

3.2. Evaluation Criteria

3.3. Performance

3.3.1. Classification Performance

3.3.2. Feature Extraction Performance Evaluation Under Multiple Time Windows

3.3.3. Classification of Gearbox Fault Types and Severity Levels

3.3.4. Comparison of Different Modules

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI