Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network

Huang, Chiheng; Yang, Wenxian; Graja, Oussama; Duan, Fang; Wei, Zeqi; Zhang, Liuyang

doi:10.3390/app152312726

Open AccessArticle

Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network

by

Chiheng Huang

¹

,

Wenxian Yang

^1,*

,

Oussama Graja

¹

,

Fang Duan

²,

Zeqi Wei

³ and

Liuyang Zhang

³

¹

School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK

²

Department of Electronic and Electrical Engineering, University of Bath, Bath BA2 7AY, UK

³

School of Mechanical Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12726; https://doi.org/10.3390/app152312726

Submission received: 30 October 2025 / Revised: 23 November 2025 / Accepted: 27 November 2025 / Published: 1 December 2025

(This article belongs to the Special Issue Vibration Control of On- and Off-Shore Wind Turbines)

Download

Browse Figures

Versions Notes

Abstract

Traditional wind turbine drivetrain health assessment generally depends on feature extraction guided by expert experience and prior knowledge. However, the effectiveness of this approach is often limited when such knowledge is insufficient or when fault features are obscured by high levels of ambient noise. In response to these issues, this study proposes a new data-driven framework that combines intelligent frequency band identification with a deep learning architecture. In the proposed approach, vibration signals from the bearings are transformed into their spectral representation, and the frequency spectrum is divided into multiple frequency bands. The relative importance of each band is evaluated and ranked using XGBoost, enabling the selection of the most informative features and significant dimensionality reduction. A hybrid CNN–Transformer model is then employed to combine local feature extraction with global attention mechanisms for accurate fault classification. Experimental evaluations using two open-source datasets indicate that the proposed framework achieves high classification accuracy and rapid convergence, offering a robust and computationally efficient solution for wind turbine drivetrain fault diagnosis.

Keywords:

wind turbine; fault diagnosis; XGBoost; drive train; deep learning

1. Introduction

Wind turbines serve as essential components in modern renewable energy infrastructure, especially in offshore and coastal regions where wind resources are plentiful. However, their drivetrains, comprising gearboxes, generators, and other rotating elements, operate under highly dynamic loads and harsh marine conditions that accelerate wear and promote fault development [1]. Faults in wind turbine drivetrains are generally grouped into three categories: (a) electrical faults, including generator winding failures and insulation breakdowns; (b) mechanical faults, which are the most frequent in practice and typically arise in gearboxes, bearings, and shafts; (c) environmental and control-related faults, such as those caused by wet and corrosive offshore environments or control system malfunctions [2].

Among these, mechanical faults in drivetrain rotating components are especially critical because they often escalate into severe damage and are associated with long downtime and high maintenance costs if not identified at an early stage [3]. As these faults typically manifest through changes in dynamic behaviour, they can be detected most effectively through vibration analysis [4]. Therefore, vibration signals remain one of the most informative indicators of such problems due to their sensitivity to structural and dynamic changes [5].

Traditional vibration-based condition monitoring methods typically rely on signal processing techniques, and the main purpose of signal processing is to cancel the noise contained in the vibration signals and then extract the fault-related features based on prior knowledge. For example, the study in [6] employed a hybrid strategy combining variational mode decomposition (VMD), correlation evaluation, and wavelet-threshold filtering to suppress noise present in the vibration data. In [7], the proposed method extracts instantaneous frequency features from vibration signals to detect different fault types of rolling bearings when the rotational speed varies over time. Usually, such signal processing workflows are not only laborious and time-consuming but also heavily reliant on the operator’s expertise and experience.

In recent years, artificial intelligence (AI) techniques have been increasingly integrated into machine condition monitoring. Compared with traditional vibration-based condition monitoring methods that rely on professional knowledge on feature extraction and fault diagnosis, AI-assisted approaches can automatically learn patterns or features from vibration signals and distinguish between different types of faults [8]. In previous years, a variety of machine learning algorithms, such as Support Vector Machines (SVMs), Logistic Regression, and K-Nearest Neighbours (KNNs), have been widely used for fault detection based on extracted features from vibration signals [9]. Decision tree, as an example, has been one of the most common and widely used traditional machine learning models used since 1980s [10]. In [11], the authors apply decision tree algorithms for wind turbine structure condition monitoring by leveraging their fast learning, ease of interpretation, ability to support clear fault tracing, and strong performance when both error rate and training speed are considered. A further example of decision tree’s application is presented in [12], where it is used for planetary gearbox condition monitoring, which highlights its capability for motor-driven system condition monitoring using statistical features. Recent research has also demonstrated the effectiveness of decision trees for lightweight fault classification tasks implemented on edge devices [13]; a fine decision tree classifier was deployed on a microcontroller to enable immediate detection of abnormalities in rotating machinery using extracted features. However, despite their popularity and advantages, decision tree-assisted condition monitoring methods have some obvious drawbacks. For example, decision trees tend to overfit easily during training, resulting in poor generalization [14], and this is especially problematic in the cases where noisy data is used, because these kinds of models are excessively sensitive to slight changes in training sets [15]. To address these weaknesses, Extreme Gradient Boosting (XGBoost) has been proposed as an enhanced tree-based ensemble learning method. In [16], the XGBoost model is used in conjunction with the Mel Frequency Spectral Coefficient features extracted from vibration data for the classification of roller bearing faults. In [17], the performance of the XGBoost model was assessed against that of other classic machine learning models and showed the highest accuracy for bearing faults.

In addition to performance, interpretability is another key advantage that decision tree-based models have in condition monitoring applications [18]. These tree-based models offer transparent and rule-based reasoning by constructing explicit paths from input features to output targets [19]. This makes them particularly useful in applications where interpretability and fault traceability are important. The most common scenario where the interpretability of tree-based models is feature selection, where the importance of different features will be calculated and ranked, allowing only the features of high importance to be used for further analysis. In [20], the authors applied XGBoost with feature importance ranking to reduce redundant sensor data and improve fault classification performance. And in [21], the authors used XGBoost model to rank the importance of more than 300 statistical features extracted from vibration and cutting force signal and selected only 14 features in the training stage, significantly improving the training efficiency and prediction accuracy. However, tree-based models including XGBoost still have a key drawback: they rely heavily on artificially designed indicators, such as statistical metrics. These indicators cannot guarantee that they can effectively distinguish the health status and fault type of the machine.

More recently, deep learning (DL) models have been increasingly used attributed to their ability to learn complex patterns from raw or processed vibration signals. For example, Convolutional Neural Networks (CNNs) are typically efficient in capturing spatial patterns across different vibration signal domains [22]; Recurrent Neural Networks (RNNs) are well suited for handling sequential inputs, such as time-domain vibration signals [23]. Transformers, with their multi-head attention mechanisms, have proven effective for handling complex signals and capturing dependencies from large data segments [24]. To overcome the reliance on artificially designed features, recent studies have increasingly turned to DL approaches that are capable of identifying meaningful patterns directly from complex or noisy data with high generalizability [25]. Among these architectures, CNNs are particularly effective at identifying informative structures within raw signal data. One-dimensional (1D) CNNs usually operate directly on raw time-series data. For example, in [26], a 1D CNN was proposed to process raw vibration and phase current signals for detecting faults in bearings. In [27], a 1D CNN was used as the feature extractor for vibration signals in a Zero-Shot Learning (ZSL) framework, where Semantic Feature Space mapping was adopted for intelligent fault detection of unseen faults. In contrast, 2D CNNs are usually applied to transformed representations of raw signals, such as frequency maps, time-frequency maps, or images generated by self-defined approaches from raw signals. In [28], a 2D CNN was employed to extract grayscale images from enhanced frequency maps of vibration signals, achieving high classification accuracy using undersampled signals. On the other hand, the Transformer model, which was originally proposed in [29], has also demonstrated high performance to capture long-range dependencies through self-attention mechanisms. In [30], a transformer-based model was proposed to estimate the remaining useful life of the bearing lubricant using the frequency domain of the vibration signal. In [31], a transformer model was used for rolling bearing fault diagnosis, using time–frequency representations of wavelet transforms to capture non-stationary signal patterns and long-range dependencies.

By integrating the strength of CNN in capturing local patterns with the ability of modelling global dependency in transformer-based architectures, recent studies have proposed hybrid models, CNN–Transformer models. In [32], a CNN–Transformer multitask model was proposed for simultaneously performing both bearing fault diagnosis and severity assessment, utilizing the local feature extraction ability of CNNs and global sequence modelling ability of Transformer to improve DL model robustness. In [33], the authors developed a CNN–Transformer model to enable fault identification in rotating equipment across a range of operating states. Similarly, CNNs were used for multi-scale feature extraction, and Transformer blocks were introduced to link the relationship between fault patterns and fault types.

While DL models and their combined architectures have achieved popularity in vibration-based condition monitoring, they also present limitations. To extract meaningful patterns from a complex and noisy raw vibration signal or its representations, these models are usually high in complexity, involving large number of parameters and layers, that makes them difficult to design, train, and fine-tune [34]. In addition, the scarcity of useful data and their reliance on large training data and computationally intensive operations can pose significant challenges in terms of resource availability, resource consumption, and training time. These factors may restrict their real-world applicability, especially on real-time or resources-limited platforms such as edge devices [35]. Therefore, it is necessary to find a practical way to utilise the powerful capabilities of DL models while keeping them structurally simple and computationally efficient, as well as enhance their interpretability. This motivates the research reported below.

The primary contribution of this paper is the introduction of a novel vibration-based fault classification framework, which integrates interpretable spectral analysis and XGBoost-based frequency band selection, following with a CNN–Transformer hybrid model for efficient and accurate fault recognition. In addition, the proposed method introduces two key innovations that differentiate it from existing hybrid frequency band selection and deep learning approaches. First, the framework adopts a fully data-driven strategy to automatically identify the most discriminative frequency bands in the spectral domain, allowing the model to concentrate on the specific portions of the spectrum that contribute most to classification. This avoids the need for extensive manual feature engineering and enhances the interpretability of the selected frequency regions. Second, unlike previous studies that compute a large set of statistical features and then apply interpretable models for feature ranking, our approach directly applies XGBoost to the raw FFT magnitude spectrum to locate informative frequency bands. This direct frequency-domain selection preserves the physical meaning of the features, and avoids information loss from feature compression.

The capability of the proposed approach is examined using two well-established publicly available datasets, the Case Western Reserve University (CWRU) bearing dataset and the Beijing Jiaotong University (BJTU) planetary gearbox dataset, demonstrating its robustness across different fault types and mechanical systems. Furthermore, the proposed frequency band selection strategy significantly reduces the input feature dimensionality for deep learning model training, thereby improving computational efficiency without compromising classification performance.

The organisation of the remaining sections is as described below: Section 2 depicts the proposed fault classification framework, covering the frequency band selection using XGBoost and the CNN–Transformer classifier; Section 3 describes two open-access datasets used in this study. Section 4 demonstrates and discusses the training results and performance analysis of proposed method on both datasets. Section 5 summarises the findings, discusses the limitations of the proposed approach, and highlights potential improvements for further research.

2. Methodology

As shown in Figure 1, the proposed framework integrates XGBoost-assisted frequency band selection with a DL model to achieve high classification accuracy while maintaining computational efficiency. This method is designed to leverage the interpretability of tree-based models and the powerful pattern recognition capability of hybrid DL models. In the DL architecture, both the CNN and Transformer blocks can be stacked to form multi-layer structures, where the depth of each component is determined based on the complexity of the vibration signals being analysed and the concrete requirements of feature extraction. The proposed approach is structured into three core phases: (1) Raw signal preprocessing, vibration signals are divided into multiple segments and transformed into frequency spectra. (2) XGBoost-assisted frequency band selection, where the frequency spectrum obtained from the signal is subsequently divided into several frequency bands, and XGBoost is then employed to evaluate and rank the importance of each band. Only those bands of high importance will be retained and concatenated. (3) DL-based fault classification, i.e., the prepared frequency features will be used as input to a CNN–Transformer model, for performing fault classification.

2.1. Raw Signal Segmentation and Preprocessing

In the initial step of the method, the original vibration data are segmented with the aid of a fixed-length window to facilitate frequency-domain transformation. The window will slide along the signal and the stride of window movement at each time can be adjusted depending on different applications. Normally, using a smaller stride increases the total amount of samples to be extracted from the raw signal, but it also introduces high computational demand since more data is used for training. In this study, the stride is initially set equal to the window length to create non-overlapping segments, and it can be adjusted depending on the sampling rate and data length of the signals in the dataset. The non-overlapping setup avoids duplicate content between samples and reduces computational cost. The data in a segment can be expressed by the following equation:

x_{n} = x [(n - 1) S + 1 : (n - 1) S + N_{w}],

(1)

where x is the full vibration signal in one-dimension array,

x_{n}

stands for the n-th segment,

N_{w}

represents the window length, S denotes the stride between two consecutive segments, and

n = 1, 2, 3, \dots

is the segment index.

Each

x_{n}

is then transformed into the frequency domain using the Discrete Fourier Transform (DFT), which is implemented efficiently using the Fast Fourier Transform (FFT) algorithm. The DFT of the n-th segment is computed as [36]:

X_{n} [f] = \sum_{k = 0}^{N_{w} - 1} x_{n} [k] \cdot e^{- j 2 π f k / N_{w}},

(2)

where

X_{n} [f]

represents the complex DFT coefficient at Frequency bin f,

x_{n} [k]

is the k-th data sample within segment

x_{n}

.

To convert the complex output into real-valued features, the magnitude spectrum is computed as:

| X_{n} [f] | = \sqrt{{(Re X_{n} [f])}^{2} + {(Im X_{n} [f])}^{2}},

(3)

where

R e X_{n} [f]

and

I m X_{n} [f]

are the real and imaginary parts of

X_{n} [f]

, respectively.

This stage transforms the segmented sets of raw vibration signals into a frequency domain representation that will be used as input for the frequency bands selection stage described in the next stage.

2.2. XGBoost-Assisted Identification of Discriminative Frequency Bands

Following the transformation of each segmented vibration signal into the frequency spectrum, the resulting magnitude spectra are typically high-dimensional, consisting of magnitude values across frequencies. However, these frequency components in the spectrum never contribute equally to fault classification. To identify the most discriminative frequency components for classification, a frequency band identification strategy based on the XGBoost algorithm is proposed. Herein, it is necessary to note that XGBoost is not a dimensionality-reduction method, nor is it used for conventional feature selection in this study. In the literature, gain-based importance from XGBoost has been used primarily to rank manually extracted statistical features. In this study, however, XGBoost is employed solely to obtain an importance ranking of frequency bands, allowing us to identify the most discriminative bands within the spectra. Subsequent machine-learning models can then focus on the signal characteristics within these selected bands rather than analyzing the entire spectrum. In principle, this should help improve fault-detection accuracy.

To simplify the calculation, the one-dimensional frequency spectrum is divided into multiple parts first, referred to as frequency bands. Each frequency band contains the same number of consecutive frequency components. To organize the frequency spectrum into frequency bands, the following equations are used:

N_{b i n s} = \frac{N_{w}}{2},

(4)

N_{b a n d s} = ⌊\frac{N_{b i n s} - M}{S_{b a n d s}}⌋ + 1,

(5)

where

N_{b i n s}

is the total number of data in the frequency spectrum of each signal segment. M denotes the number of frequency components included in each band.

N_{b a n d s}

indicates the total number of frequency bands created.

S_{b a n d s}

is the stride used when sliding window along the spectrum.

The Equation (5) is used for two purposes: when the

S_{b a n d s} = M

, it gives the maximum number of non-overlapping bands from the spectrum. When

S_{b a n d s} < M

, it indicates the total number of band importance scores that will be computed during the frequency band selection stage.

XGBoost is an efficient gradient-boosting–based ensemble method that builds a powerful classifier by iteratively combining many weak learners, typically decision trees. Each tree is trained to correct the residual errors made by the previous trees, and the entire process minimizes a regularized objective function to ensure generalization and prevent over-fitting. The basic framework of XGBoost algorithm can be expressed by the following equation [37]:

{\hat{y}}_{i} = ϕ (x_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F,

(6)

where

y_{i}

denotes the predicted output for the i-th of sample x, the

F

is the function space which represents all regression functions.

f_{k}

is an individual weak learner. K is the number of weak learners added sequentially.

To formalize the training objective, the total loss function minimized by the XGBoost algorithm combines both the training loss and a regularization term that penalizes model complexity. This is defined as:

L = \sum_{i = 1}^{n} ℓ (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k}),

(7)

where

L

represents the total loss,

ℓ (y_{i}, {\hat{y}}_{i})

is the loss between the actual value

y_{i}

and the predicted value

{\hat{y}}_{i}

, and

Ω (f_{k})

indicates the regularization term for each weak learner, which quantifies the complexity of the model.

During training, the

L

at the t-th iteration can be expressed as:

L^{(t)} = \sum_{i = 1}^{n} ℓ (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}),

(8)

where

{\hat{y}}_{i}^{(t - 1)}

is the prediction from the last tree, and

f_{t} (x_{i})

is the tree at current step t.

In multi-class settings, the model is trained using the categorical cross-entropy objective, also know as the multi-class logarithmic loss is used to quantify the classification error during XGBoost model training, it is defined as [38]:

L (Y, P) = - \frac{1}{N_{x}} \sum_{i = 1}^{N_{x}} \sum_{c = 1}^{C} y_{i, c} log p_{i, c},

(9)

p_{i, c} = \frac{e^{s_{i, c}}}{\sum_{c^{'} = 1}^{C} e^{s_{i, c^{'}}}},

(10)

where

N_{x}

indicates the size of dataset used for training, C indicates the number of categories,

y_{i, c}

is a binary label indicating whether the sample i is associated with class c. And

p_{i, c}

is the predicted probability of i belongs to class k which is calculated using Softmax function in Equation (10) where

s_{i, c}

denotes the raw score output by the model for class c.

To identify those high-importance frequency bands in the spectrum, we leverage the gain-based feature importance measure from the XGBoost framework. In this study, each input sample consists of the magnitudes of frequency components. During training, XGBoost evaluates potential splits in decision trees using the expected improvement in the loss function. This improvement is referred to as “gain”. The gain for a split at a decision node is calculated as [37]:

Gain = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}] - γ,

(11)

where

G_{L}

and

G_{R}

denote the aggregated first-order gradient terms for the left and right branches, respectively, and

H_{L}

and

H_{R}

represent the corresponding sums of the second-order gradients.

λ

is the regularization term on leaf weights, and

γ

represents the complexity penalty for adding a new leaf.

By summing the gain across all nodes and all trees where a frequency component is used for splitting, XGBoost computes a total importance score for each frequency component. However, rather than focus on individual components, this study emphasizes the identification of the most important frequency bands for fault classification. To achieve this, importance scores are aggregated within fixed-width frequency bands. This approach reflects the fact that fault-induced spectral patterns often span across multiple adjacent frequency components, making it more meaningful to evaluate importance at the frequency band level. Therefore, the gain score of each fixed-length band can be computed using the following equation:

S_{m} = \sum_{j = m}^{m + N_{1} - 1} I (j), for m = 0, 1, \dots, N_{b a n d s}^{'} - N_{1},

(12)

where

S_{m}

is the total importance score of the frequency band starting at component m,

N_{b a n d s}^{'}

is the calculated number of bands when the

S_{b a n d s} = 1

,

N_{1}

denotes the number of consecutive frequency components in each frequency band, and

I (j)

is the importance score of the j-th frequency component.

Once the importance scores of all frequency bands are computed, they will be ranked according to their importance scores. To prevent redundancy, only non-overlapping bands are selected during ranking. Also, the number of frequency bands selected for DL model training is flexible, frequency bands are added until their cumulative importance accounts for at least 75% of the total importance score, ensuring that the majority of useful spectral information is preserved. This stage is designed to identify the most discriminative frequency bands for classification, ensuring that the DL model focuses on learning the most relevant frequency components while achieving substantial reduction in dimensionality.

2.3. Deep Learning Architecture

Following the band selection process by XGBoost algorithm, only the most discriminative frequency bands are retained while other are removed. These frequency bands are merged to form a single feature vector representing the spectral content of each time window. This dimensionally reduced dataset is then used to train a DL classifier based on a hybrid CNN–Transformer architecture. As shown in Figure 1, the model has two blocks, a CNN block to learn localised features and a transformer block that captures long range dependencies. In the CNN block, each input vector which consists of the magnitudes of the frequency components in the selected frequency band is processed by a one-dimensional convolutional layer, and then standardised using batch normalisation before applying a ReLU activation function. The operation performed by the CNN block can be expressed by the following equation [39]:

z_{i} = MaxPool (σ (BN (\sum_{j} x_{j} * k_{i, j} + b_{i}))),

(13)

where

x_{j}

is j-th input channel, ∗ represents the one-dimension convolution operator where the kernel

k_{i, j}

is applied to the input feature,

b_{i}

denotes the learnable bias term,

B N

indicates batch normalization, and

σ

is the ReLU activation function. And since in implementation, the input consists of a single-channel FFT vector, the expression can be simplified to:

z_{i} = MaxPool (σ (BN (x * k_{i} + b_{i}))) .

(14)

The resulting output feature maps are then reshaped with learnable positional encodings before being passed into the Transformer block. The Transformer module operates on this sequence to capture relationships across the full input, which is based on the self-attention mechanism, and it allows each position in the input to compute a weighted representation of all other positions [29].

The scaled dot-product attention is defined as:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(15)

where

Q = X W^{Q}

,

K = X W^{K}

, and

V = X W^{V}

are respectively the query, key, and value matrices projected from the input X, and

d_{k}

is the dimensionality of the key vectors.

Rather than relying on a single set of linear projections, the model benefits from learning multiple representations of the input using multi-head attention, in which multiple self-attention operations are computed in parallel. The multi-head attention mechanism is defined as:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O},

(16)

where each attention operator is computed independently as:

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),

(17)

After processing through the Transformer module, the output is reshaped and forwarded to a fully connected layer for class score prediction. The model is trained with a loss function against one-hot encoded labels, and class prediction is performed by selecting the index of the maximum output score. In practice, both the CNN and Transformer components can be deepened by stacking multiple instances of their respective blocks, allowing the model to better capture important information from the input features. However, in this study, to maintain a simple and light-weight architecture, a single CNN block and two stacked Transformer blocks are used. This design leverages the strengths of DL algorithms while keeping the computational demand as low as possible, making it suitable for practical applications. For clarity, the pseudocode describing the overall workflow of the proposed method is presented in Algorithm 1.

Algorithm 1 XGBoost–Based Frequency Band Selection and CNN–Transformer Classification

Input: Single-channel vibration signals

x^{(c)} (t)

for

c = 1, \dots, 5

representing the five gear conditions; window length

N_{w}

; stride S; band width M; band stride

S_{bands}

; number of selected bands K; number of folds k.
Output: Trained classifier

f_{θ}

; fold-wise test metrics; mean and standard deviation of accuracy; confusion matrix and learning curves of the best fold.

1:: Segment $x^{(c)} (t)$ into windows $x_{n}$ of length $N_{w}$ with stride S.
2:: Compute the magnitude spectrum $m_{n}$ of each segment via FFT.
3:: Divide each $m_{n}$ into frequency bands of width M (using stride $S_{bands}$ ) and form feature vectors ${\tilde{z}}_{n}$ with labels $y_{n}$ .
4:: Perform stratified 80/20 split: $(\tilde{Z}, y) \to ({\tilde{Z}}_{temp}, y_{temp})$ and $({\tilde{Z}}_{test}, y_{test})$ .
5:: Train XGBoost on $({\tilde{Z}}_{temp}, y_{temp})$ to obtain component-wise importance scores $I (j)$ .
6:: Compute band importance scores $S_{m}$ by summing $I (j)$ within each band; rank all $S_{m}$ and select the top K non-overlapping bands.
7:: Retain only the selected frequency components to obtain ${\tilde{Z}}_{temp}^{*}$ and ${\tilde{Z}}_{test}^{*}$ .
8:: Normalize each reduced feature vector ${\tilde{z}}_{n}^{*}$ .
9:: Split $({\tilde{Z}}_{temp}^{*}, y_{temp})$ into k stratified folds ${D_{1}, \dots, D_{k}}$ .
10:: For each fold i: train the CNN–Transformer model $f_{θ^{(i)}}$ , select the parameters with the highest validation accuracy, and evaluate the model on $({\tilde{Z}}_{test}^{*}, y_{test})$ to obtain fold-wise performance.
11:: Compute the mean and standard deviation of test accuracy ( $μ_{acc}, σ_{acc}$ ) and report the confusion matrix and learning curves of the best fold $i^{*}$ .

3. Open-Access Dataset

In this section, two publicly available datasets are used to evaluate the effectiveness of the proposed method: They are the CWRU bearing dataset [40] and the BJTU planetary gearbox dataset [41]. The former is a widely adopted benchmark in the literature for rotating machinery fault diagnosis, while the latter is a more recently published resource that, to the best of the author’s knowledge, is one of the few open-access datasets available for planetary gearbox fault classification.

These datasets were chosen because they represent two critical drivetrain components commonly found in wind turbines: bearings and planetary gearboxes. The bearing data offer a relatively simple vibration scenario, suitable for assessing the baseline behaviour of the proposed method, whereas the planetary gearbox data capture more complex, modulated vibration responses typical of gear stages in wind turbine drivetrains. By evaluating the model on both types of components, the study examines not only its performance under increasing signal complexity but also its adaptability to different mechanical fault mechanisms when using vibration signals, thereby supporting its potential applicability to wind-turbine condition monitoring.

3.1. CWRU Bearing Dataset

This dataset was created by the Case Western Reserve University on the test rig shown in Figure 2, which is a publicly available dataset in the area of rotating equipment fault detection and provides bearing vibration signals recorded under multiple machine operating states [40]. Since its release, the CWRU dataset has been widely used in the literature due to its consistency, accessibility, and relevance to real-world fault scenarios. Compared to the vibration signals obtained from more complex rotating machinery, bearing vibration signals are relatively clean and easy to interpret due to the system structure and controlled test conditions. This makes them suitable for initially test our proposed method. Following this initial test, the proposed method will be further evaluated on the data from a more complex rotating system to test its efficiency and generality.

The experimental arrangement associated with CWRU bearing dataset comprises a 2-horsepower induction motor coupled to a dynamometer via a shaft and torque transducer [40]. The vibration signals were collected by two accelerometers mounted at drive-end and fan-end respectively. The load levels were set to 0 to 3 horsepower and the motor speed was within the range of 1720 to 1797 RPM.

The segmented vibration signals and their frequency spectra are given in Figure 3.

In this study, the vibration data were obtained using an accelerometer installed close to the motor’s drive end. sampled at 48 kHz. Since the duration of the recordings varies across fault conditions, the drive-end signal in each condition is truncated to 480,000 samples to ensure that an equal amount of data is used for all classes. This truncated signal is then used for the subsequent segmentation and frequency-domain processing steps.

The motor operated at approximately 1730 RPM under the maximum load condition of 3 horsepower. The dataset includes three fault types, e.g., inner race fault, rolling element fault, and outer race fault with each type represented by three fault sizes, 7 mils, 14 mils, and 21 mils in diameter (1 mil = 0.001 inch). For the outer race fault, the defect location was set at the 6 o’clock position. In total, nine faulty conditions were included, along with one set of data from a healthy bearing to serve as the baseline. The mapping between fault index labels and corresponding fault conditions used in this study is listed in Table 1. From each dataset, 480,000 data points were extracted from the original file, the signals were then divided into overlapping segments using a window size of 2000 points and a stride of 1000. The spectral analysis was then applied to each segment, resulting 1000 frequency components were retained as input features. Figure 3 provides a representative time-domain segment and the corresponding frequency spectrum under each bearing condition.

3.2. BJTU Planetary Gearbox Dataset

While most publicly available datasets in the field of machinery fault diagnosis are bearing-related, it is essential to evaluate the proposed method using vibration signals from more complex rotating equipment. For this purpose, the planetary gearbox dataset developed by Beijing Jiaotong University (BJTU) was selected. Published in 2024 [41], the BJTU planetary gearbox dataset provides several advantages for DL research. For example, it offers long-duration recordings for each condition, includes multisensory measurements, spans a wide range of operating speeds, and captures each condition both before and after reinstallation. These features make the BJTU dataset a suitable benchmark for evaluating the proposed method under more sophisticated mechanical conditions.

The experimental setup used to generate the dataset consists of an electric motor, a planetary gearbox, a fixed-shaft gearbox, and a load device, as shown in Figure 4. The planetary gearbox contains four planet gears rotating around a central sun gear, which serves as the primary fault target in this study. A vibration sensor is mounted on the gearbox casing, and an encoder records the rotational velocity of the motor during operation. All signals were collected using a sampling frequency of 48 kHz. Each scenario in Table 2 was tested at eight speed settings, from 1200 rpm to 3300 rpm, in increments of 300 rpm, and the analysis in this study uses only the data acquired at 1200 rpm. Likewise, some examples of segmented vibration signals and their frequency spectra are illustrated in Figure 5 to ease understanding.

It is worth noting that although vibration signals in both two orthogonal directions were collected during the test, only the signals collected in the vertical direction were used for analysis in this study. From each test, 2,880,000 data points were collected, which is sufficient for frequency-domain analysis. The signals were segmented into overlapping segments of 2000 data with a stride of 1000 data, and the spectral analysis was applied to each segment to obtain frequency spectra for subsequent classification.

4. Results and Discussion

During model training, the Cross-Entropy (CE) loss is adopted, as it is the standard objective for multi-class classification. It is defined as [38]:

L_{CE} = - \sum_{c = 1}^{C} y_{c} log p_{c},

(18)

where

y_{c}

is the one-hot ground-truth label for class c, and

p_{c}

is the softmax predicted probability of that class.

The classification accuracy is computed using the following equation, where

N_{correct}

is the number of evaluated samples whose predicted class matches the actual label:

Accuracy = \frac{N_{correct}}{N} \times 100 % .

(19)

All experiments were conducted using the laboratory PC at the Centre for Efficiency and Performance Engineering (CEPE), University of Huddersfield. The system was equipped with an Intel(R) Core(TM) i7-14700 CPU (Intel, Santa Clara, CA, USA), 32 GB of RAM, and an NVIDIA GeForce RTX 4060 GPU (NVIDIA, Santa Clara, CA, USA). The development environment was based on PyTorch version 2.5.1 and Python version 3.12.7. All model training tasks were accelerated using GPU computation via the CUDA toolkit (version 12.4). In this study, the same XGBoost model configuration was applied to both datasets for feature selection purposes. The model was trained using 100 decision trees for multi-class classification, during training, performance was evaluated using the multi-class logarithmic loss (log-loss) metric (Equation (9)). The importance of individual frequency bands was calculated using the gain importance, which is defined in the Equation (11).

For both datasets, the initial learning rate (LR) for each case was variable, and a LR decay factor of 0.90 was applied using a performance-based scheduling strategy. To ensure a sufficient number of samples for both training and evaluation, the full dataset was first split into a development set (80%) and an independent test set (20%) using stratified sampling. The development set was then further divided using 5-fold stratified cross-validation, where in each fold, four partitions were used for training and one for validation. During each fold, the model checkpoint that achieved the highest validation accuracy was retained and subsequently evaluated on the fixed 20% test set. The final results were reported as the mean and standard deviation of the test accuracy across the five folds. For each case, the model performance curves during training were plotted using the best fold model, and the confusion matrix was also generated using this best fold model on the independent testing dataset.

4.1. Training Results

4.1.1. Training Results on CWRU Bearing Dataset

For the tests performed on the CWRU bearing dataset, the k-fold cross-validation strategy was adopted. This approach ensures that all samples in the development set are used for both training and validation across different folds, offers a more dependable assessment of of model performance, and mitigates the risk of bias arising from a single fixed train–validation split. The model architecture and hyperparameter settings for case 1 are shown in Table 3.

Attributed to the proposed discriminative frequency band selection strategy, the model exhibited fast convergence during training. Therefore, the training process was carried out for a total of 50 epochs, with a fixed LR of 0.0008. The model architecture employed in this study, as illustrated in Figure 1, includes one convolutional block and two transformer encoder layers. A single CNN block was used as the signals in the CWRU dataset present distinct vibration patterns that are easy to interpret.

The training and validation accuracy and loss curves as well as the corresponding confusion matrix are shown in the Figure 6. The Figure 6a demonstrates both classification accuracy and loss over 50 training epochs, the model exhibited rapid convergence within the first epochs, indicating the effectiveness of the proposed frequency band selection strategy. Both the training accuracy and validation accuracy stabilized above 99% after 10 epochs. Meanwhile, both the training and validation loss values decreased rapidly during early training and remained low throughout, indicating good convergence and minimal over-fitting. Figure 6b shows the final classification results based on the confusion matrix of the best fold model on the independent test set. No misclassification occurred, indicating that the model successfully recognised every fault class. The mean classification accuracy across the five folds reached 99.94%, demonstrating highly reliable fault recognition performance. The detailed k-fold accuracy statistics over the independent testing dataset are summarised in Table 4, which demonstrate the reliability and accuracy of the proposed method in diagnosing the health status of ball bearings.

4.1.2. Training Results on BJTU Wind Turbine Planetary Gearbox Dataset

For the experiments conducted on the BJTU planetary gearbox dataset, the same cross-validation framework described earlier was applied to ensure robust and unbiased performance evaluation. This approach is also suitable for this dataset, which contains a large number of samples across multiple operating conditions, enabling reliable model optimisation while preventing any fixed split which might bias the evaluation. In this case, a total of 200 epochs was used during training, and the LR was initialized at 0.00002. As illustrated in Figure 1, the model architecture consisted of two convolutional blocks and two transformer encoder layers. In this case, an additional CNN block was introduced to better capture the more complex vibration characteristics of the signals in the BJTU dataset. The model architecture and hyperparameter settings for case 1 are shown in Table 5.

Figure 7 shows the performance curves and confusion matrix on case 2.

Figure 7a presents the training and validation curves for the BJTU planetary gearbox dataset over 200 epochs. In the figure, the model has demonstrated efficiency and stable convergence, with both training and validation accuracies reaching above 99% within the first 50 epochs. As training progressed, the loss continued to decrease smoothly, and no signs of over-fitting were observed. The corresponding confusion matrix is shown in Figure 7b. As illustrated, the model demonstrated strong and reliable classification accuracy for all five fault classes. Only a few misclassification cases were observed, mainly between the ‘Broken Tooth’ and ‘Missing Tooth’ classes, while the remaining categories were classified correctly in all cases. The mean testing accuracy across the five folds reached 99.70%, further demonstrating that the proposed approach maintains high performance. The complete k-fold accuracy statistics over the independent testing dataset are summarised in Table 6. From the results listed in the table, the outcomes suggest that the proposed method also performs very well in diagnosing planetary gearbox failures.

4.2. Performance Analysis

Table 7 and Table 8 presents the ranked frequency bands selected for Case 1 and Case 2, respectively, based on importance scores obtained from the XGBoost feature selection algorithm.

As discussed previously, the number of selected frequency bands was predefined to ensure coverage of at least 75% of the total spectral importance. In Case 1, the top five bands collectively account for more than 85% of the total importance, while in Case 2, the top five bands contribute over 75%. These results confirm that the choice of using the top five bands is sufficient to retain the majority of relevant frequency-domain information in both cases. It is noteworthy that the most informative frequency bands, specifically (Figure 8), 7848–8448 Hz in Case 1 and 8256–8856 Hz in Case 2, do not coincide with the conventional fault characteristic frequencies. This finding suggests that manual frequency selection based solely on theoretical fault-related characteristic frequency calculations may fail to capture the most informative features present in real-world data for DL model training. In contrast, the data-driven selection approach employed here enables the identification of critical frequency bands that contribute directly to classification performance. The presence of the low-frequency band 240–840 Hz in Case 1 and the high-frequency band 11,640–12,240 Hz in Case 2 suggests that relevant diagnostic information is not confined to a specific spectral range and may vary depending on the machinery or fault scenario. This reinforces the strength of the proposed XGBoost-based band selection strategy, which identifies informative frequency bands through data-driven analysis rather than relying on artificially defined fault-related frequencies.

As discussed in the Section 2.2, the individual frequency bin importance scores were extracted using gain-based importance from the trained XGBoost classifier. After obtaining the gain score for each frequency bin, a sliding window approach with a stride of one bin was applied to aggregate the importance scores across contiguous frequency bands. To ensure non-overlapping selection, bands were iteratively selected by excluding any frequency band that overlapped with previously chosen ones. This process continued until the predefined number of most discriminative frequency bands was identified. The selected frequency bands are shown in Figure 8 and Figure 9.

To visualize the effect of using frequency bands with varying importance levels and to simplify computation, the sliding window technique was modified to use non-overlapping windows instead of a stride of one bin. This resulted in 40 non-overlapping frequency bands across the entire frequency spectrum. These bands were ranked in descending order of importance and then divided into 8 groups based on their ranked position. The model was trained 8 times, each time using one group of frequency bands as input features, and the validation accuracy results for each group are presented in Figure 10. In both cases, it can be observed that groups containing the highest ranked frequency bands (specifically ranks 1–5 and 6–10) achieved substantially higher validation accuracy. This result demonstrates that in both scenarios, the top five most relevant bands are sufficient to achieve high classification performance while maintaining computational efficiency. On the other hand, as lower ranked frequency bands were used, the model performance gradually degraded. This trend indicates that a limited number of selected frequency bands can preserve most of the information needed for DL models, leading to efficient input reduction while maintaining high performance. The results also reinforce the validity of the proposed band selection strategy across different rotating machineries.

To further demonstrate the superior fault detection performance of the proposed method over existing machine learning approaches, a comparison with state-of-the-art machine learning-based fault detection techniques was conducted in the study. A summary of these techniques is presented in Table 9, highlighting their feature extraction strategies, model architectures, and reported classification accuracies. Among these techniques, an advanced Transformer model proposed in [42] that uses frequency features and a multi-scale encoder to capture both local and global fault patterns. With a custom cross-flipped decoder, the model achieved 99.85% accuracy. The method proposed in [43] combines wavelet packet decomposition with a DBN classifier, where DBN hyper-parameters are optimized using a chaotic sparrow search algorithm. Among these techniques, the method presented in [44] transforms vibration signals into time–frequency representations through variational mode decomposition and continuous wavelet transform, which are then classified through an improved CNN. This image-based approach enables spatial feature learning and achieves strong diagnostic performance across varying operating conditions. The approach described in [45] introduces a time-domain diagnosis model based on the KACSEN architecture, which leverages the Kolmogorov–Arnold framework for complex feature mapping. By integrating an attention mechanism, the model emphasizes informative signal components and performs effectively under different fault types. In [46], an end-to-end fault diagnosis framework combining a multi-scale CNN with LSTM layers is proposed, where the CNN extracts hierarchical features from both raw and down-sampled time-domain signals, and the LSTM captures temporal dependencies for final classification. Compared with these existing methods, the proposed model demonstrates strong performance in terms of accuracy, efficiency, and interpretability. Unlike approaches that depend on signal pre-processing, manual frequency localization, image conversion, or complex in-model computations, our method directly utilizes selected frequency bands. In our experiments, only 1/8 of the total frequency bands were selected and used for model training, significantly reducing the input size without compromising performance. The application of XGBoost for band selection ensures that only the most informative spectral regions are retained, effectively minimizing redundancy. Combined with a hybrid CNN–Transformer architecture, the model captures both local and global signal patterns with high effectiveness.

Because the number of selected frequency bands can directly influence the performance of the proposed method, the study includes an additional set of experiments to examine how different quantities of ranked bands affect the downstream model. After computing the gain-based FFT-bin importance using XGBoost and forming non-overlapping ranked frequency bands, several band-selection configurations were constructed by taking the top 1, 2, 3, 5, 7, and 10 bands. These configurations cover a wide range of cumulative importance, from very small proportions (25%) of the total bin-level importance to more than 90%. For each configuration, the corresponding FFT bins were used to generate a reduced feature set. All other components of the workflow, including the data-splitting strategy, k-fold cross-validation procedure, model architecture, and hyperparameter settings, were kept identical to those used in the main experiments.

The classification results obtained under different band-selection settings for Case 1 are summarised in Table 10 and Figure 11a. When only one or two frequency bands are selected, the mean classification accuracy remains noticeably lower than the accuracy achieved using larger numbers of bands. This shows that the spectral information contained within one or two isolated bands is insufficient to represent the vibration patterns required for reliable fault detection. In contrast, when three or more frequency bands are selected, the classification accuracy reaches a consistently high level across all five folds. The results for 3, 5, 7, and 10 bands are almost same, reflecting the fact that the classifier already achieves near-saturation performance once a sufficient amount of informative spectral content is included.

Although the performance obtained using three bands is already high, five bands were adopted in the study because they retain a larger cumulative importance fraction and provide broader spectral coverage while keeping the dimensionality low. This choice offers a more reliable balance between information content and feature compactness, ensuring stable performance without introducing too many unnecessary complexities.

The results for Case 2, shown in Table 11 and Figure 11b, follow a similar pattern to Case 1. Using only one band leads to low accuracy, while increasing the number of selected bands quickly improves performance. With three bands, the mean accuracy rises sharply to above 0.91, and once five or more bands are used, the accuracy becomes consistently high across all folds.

Once 5 bands are selected, the accuracy becomes highly consistent across all folds, and the results for 5, 7, and 10 bands are almost same. This shows that the classifier reaches its performance ceiling once a sufficient portion of discriminative spectral content is included, and adding further bands does not provide any significant improvement. Although using ten bands yields a very slightly higher accuracy than using five bands, this comes at the cost of doubling the number of input features while offering no practical benefit. These observations confirm that setting the number of frequency bands to five is an appropriate and well-balanced choice for the study.

To evaluate the contribution of different architectural components and input representations, several baseline models were implemented and compared directly against the proposed CNN–Transformer method. The baseline configurations include a CNN-only model, a Transformer-only model, and a CNN with lightweight attention, each tested using both time-domain input and full frequency-domain input without band selection. All models were trained using the same settings and k-fold cross-validation to ensure a consistent comparison.

Table 12 summarises the results. The models using time domain features show limited accuracy, indicating that raw vibration signals are difficult to classify without transformation. Accuracy increases substantially when full FFT spectra are used, confirming the advantage of frequency-domain features. Notably, the proposed method achieves accuracy comparable to the full spectrum frequency domain models while using only one-eighth of the data, due to the XGBoost-guided selection of the most informative frequency bands. This demonstrates that the proposed band-selection strategy preserves the essential discriminative information while significantly reducing the input dimensionality, and that the CNN–Transformer architecture effectively leverages these compact spectral features.

5. Concluding Remarks

This study proposed a frequency-domain fault classification framework that integrates FFT-based feature extraction, data-driven discriminative frequency band selection, and a CNN–Transformer hybrid model for classification. Feature extraction was carried out by applying the FFT to segmented sets of raw vibration signals. The obtained frequency spectra were then divided into multiple frequency bands, and XGBoost was employed to evaluate and rank the importance of each frequency band based on its contribution to classification accuracy. The top-ranked frequency bands were regarded as the most discriminative bands and subsequently fed into the hybrid DL model for fault classification.

To evaluate the effectiveness of the proposed framework, a series of experiments were conducted on two open datasets: the CWRU bearing dataset and the BJTU planetary gearbox dataset. Experimental results show that the CNN–Transformer classifier, utilizing the identified discriminative frequency bands, consistently outperforms the approaches in the existing literature by achieving superior classification accuracy across diverse fault types and mechanical systems. This confirms that the combination of data-driven discriminative frequency band selection and DL holds strong potential for robust and efficient fault diagnosis.

In the present study, the method was evaluated using datasets collected under stationary rotational speeds, as the initial objective was to assess its feasibility under controlled conditions. However, the proposed method is also applicable to wind turbine non-stationary condition monitoring signals because wind turbines operate at very low rotational frequencies (<0.3 Hz), exhibit minor speed fluctuations in a short time, and are typically monitored using high sampling frequencies in the practice of drive train condition-monitoring (typically 20 kHz; the dataset used in this study was sampled at 48 kHz). Moreover, the XGBoost-based discriminative frequency band identification approach can be applied not only to frequency spectra but also to time–frequency representations obtained through techniques such as Continuous Wavelet Transform, Short-Time Fourier Transform, and Hilbert–Huang Transform, which are widely used for non-stationary signal analysis. Therefore, the approach presented in this paper can be further enhanced in future work through the incorporation of these advanced signal-processing methods.

Furthermore, the current frequency band identification method is restricted to one-dimensional features derived solely from the frequency spectra of vibration signals. Future work may explore extensions of this framework by incorporating higher-dimensional inputs, such as time–frequency representations, or by investigating multi-channel and feature-fusion techniques. These directions could enable the model to capture more complex temporal and spatial patterns, thereby enhancing its generalizability to real-world fault scenarios.

Author Contributions

Conceptualization, C.H., W.Y., O.G. and F.D.; methodology, C.H.; software, C.H.; validation, C.H.; formal analysis, C.H.; investigation, C.H.; resources, C.H. and Z.W.; data curation, C.H. and Z.W.; writing—original draft preparation, C.H.; writing—review and editing, C.H., W.Y., L.Z. and F.D.; visualization, C.H.; supervision, W.Y., O.G. and F.D.; project administration, W.Y.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Case Western Reserve University (CWRU) bearing dataset is publicly available online at https://engineering.case.edu/bearingdatacenter [40] (accessed on 15 July 2025). The Beijing Jiaotong University (BJTU) planetary gearbox dataset is also publicly available online: https://github.com/Liudd-BJUT/WT-planetary-gearbox-dataset [41] (accessed on 15 July 2025). No new data were created in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dibaj, A.; Gao, Z.; Nejad, A.R. Fault detection of offshore wind turbine drivetrains in different environmental conditions through optimal selection of vibration measurements. Renew. Energy 2023, 203, 161–176. [Google Scholar] [CrossRef]
Badihi, H.; Zhang, Y.; Jiang, B.; Pillay, P.; Rakheja, S. A Comprehensive Review on Signal-Based and Model-Based Condition Monitoring of Wind Turbines: Fault Diagnosis and Lifetime Prognosis. Proc. IEEE 2022, 110, 754–806. [Google Scholar] [CrossRef]
Zhang, Q.; Su, N.; Qin, B.; Sun, G.; Jing, X.; Hu, S.; Cai, Y.; Zhou, L. Fault Diagnosis for Rotating Machinery Based on Dimensionless Indices: Current Status, Development, Technologies, and Future Directions. Electronics 2024, 13, 4931. [Google Scholar] [CrossRef]
Tuirán, R.; Águila, H.; Jou, E.; Escaler, X.; Mebarki, T. Fault Diagnosis in a 2 MW Wind Turbine Drive Train by Vibration Analysis: A Case Study. Machines 2025, 13, 434. [Google Scholar] [CrossRef]
Wang, Y.; Liu, H.; Li, Q.; Wang, X.; Zhou, Z.; Xu, H.; Zhang, D.; Qian, P. Overview of Condition Monitoring Technology for Variable-Speed Offshore Wind Turbines. Energies 2025, 18, 1026. [Google Scholar] [CrossRef]
Fang, C.; Chen, Y.; Deng, X.; Lin, X.; Han, Y.; Zheng, J. Denoising method of machine tool vibration signal based on variational mode decomposition and Whale-Tabu optimization algorithm. Sci. Rep. 2023, 13, 1505. [Google Scholar] [CrossRef]
Chen, B.; Hai, Z.; Chen, X.; Chen, F.; Xiao, W.; Xiao, N.; Fu, W.; Liu, Q.; Tian, Z.; Li, G. A time-varying instantaneous frequency fault features extraction method of rolling bearing under variable speed. J. Sound Vib. 2023, 560, 117785. [Google Scholar] [CrossRef]
Xu, Y.; Yan, X.; Feng, K.; Sheng, X.; Sun, B.; Liu, Z. Attention-based multiscale denoising residual convolutional neural networks for fault diagnosis of rotating machinery. Reliab. Eng. Syst. Saf. 2022, 226, 108714. [Google Scholar] [CrossRef]
Alonso-Gonzalez, M.; Diaz, V.G.; Lopez Perez, B.; Cristina Pelayo G-Bustelo, B.; Anzola, J.P. Bearing Fault Diagnosis With Envelope Analysis and Machine Learning Approaches Using CWRU Dataset. IEEE Access 2023, 11, 57796–57805. [Google Scholar] [CrossRef]
Blockeel, H.; Devos, L.; Frénay, B.; Nanfack, G.; Nijssen, S. Decision trees: From efficient prediction to responsible AI. Front. Artif. Intell. 2023, 6, 1124553. [Google Scholar] [CrossRef]
Abdallah, I.; Dertimanis, V.; Mylonas, C.; Tatsis, K.; Chatzi, E.; Dervilis, N.; Worden, K.; Maguire, A. Fault Diagnosis of Wind Turbine Structures Using Decision Tree Learning Algorithms with Big Data. In Safety and Reliability—Safe Societies in a Changing World; CRC Press: Boca Raton, FL, USA, 2018; pp. 3053–3061. [Google Scholar] [CrossRef]
Lipinski, P.; Brzychczy, E.; Zimroz, R. Decision Tree-Based Classification for Planetary Gearboxes’ Condition Monitoring with the Use of Vibration Data in Multidimensional Symptom Space. Sensors 2020, 20, 5979. [Google Scholar] [CrossRef] [PubMed]
Shubita, R.R.; Alsadeh, A.S.; Khater, I.M. Fault Detection in Rotating Machinery Based on Sound Signal Using Edge Machine Learning. IEEE Access 2023, 11, 6665–6672. [Google Scholar] [CrossRef]
Alhams, A.; Abdelhadi, A.; Badri, Y.; Sassi, S.; Renno, J. Enhanced Bearing Fault Diagnosis Through Trees Ensemble Method and Feature Importance Analysis. J. Vib. Eng. Technol. 2024, 12, 109–125. [Google Scholar] [CrossRef]
Dwyer, K.; Holte, R. Decision Tree Instability and Active Learning. In Proceedings of the Machine Learning: ECML 2007, Warsaw, Poland, 17–21 September 2007; Kok, J.N., Koronacki, J., Mantaras, R.L.d., Matwin, S., Mladenič, D., Skowron, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 128–139. [Google Scholar]
Choudakkanavar, G.; Mangai, J.A.; Bansal, M. MFCC based ensemble learning method for multiple fault diagnosis of roller bearing. Int. J. Inf. Technol. 2022, 14, 2741–2751. [Google Scholar] [CrossRef]
Hemalatha, S.; Kavitha, T.; Anand, P. Effectiveness of Classification Techniques for Fault Bearing Prediction. In Proceedings of the 2022 6th International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 1–3 December 2022; pp. 8–13. [Google Scholar] [CrossRef]
Souza, V.F.; Cicalese, F.; Laber, E.S.; Molinaro, M. Decision trees with short explainable rules. Theor. Comput. Sci. 2025, 1047, 115344. [Google Scholar] [CrossRef]
Nguyen, T.D.; Nguyen, T.H.; Do, D.T.B.; Pham, T.H.; Liang, J.W.; Nguyen, P.D. Efficient and Explainable Bearing Condition Monitoring with Decision Tree-Based Feature Learning. Machines 2025, 13, 467. [Google Scholar] [CrossRef]
Tian, J.; Jiang, Y.; Zhang, J.; Wang, Z.; Rodríguez-Andina, J.J.; Luo, H. High-Performance Fault Classification Based on Feature Importance Ranking-XgBoost Approach with Feature Selection of Redundant Sensor Data. Curr. Chin. Sci. 2022, 2, 243–251. [Google Scholar] [CrossRef]
Lin, Z.; Fan, Y.; Tan, J.; Li, Z.; Yang, P.; Wang, H.; Duan, W. Tool wear prediction based on XGBoost feature selection combined with PSO-BP network. Sci. Rep. 2025, 15, 3096. [Google Scholar] [CrossRef]
Tama, B.A.; Vania, M.; Lee, S.; Lim, S. Recent advances in the application of deep learning for fault diagnosis of rotating machinery using vibration signals. Artif. Intell. Rev. 2023, 56, 4667–4709. [Google Scholar] [CrossRef]
Chen, Y.; Liu, X.; Rao, M.; Qin, Y.; Wang, Z.; Ji, Y. Explicit speed-integrated LSTM network for non-stationary gearbox vibration representation and fault detection under varying speed conditions. Reliab. Eng. Syst. Saf. 2025, 254, 110596. [Google Scholar] [CrossRef]
Wang, R.; Dong, E.; Cheng, Z.; Liu, Z.; Jia, X. Transformer-based intelligent fault diagnosis methods of mechanical equipment: A survey. Open Phys. 2024, 22, 20240015. [Google Scholar] [CrossRef]
Zhu, Z.; Lei, Y.; Qi, G.; Chai, Y.; Mazur, N.; An, Y.; Huang, X. A review of the application of deep learning in intelligent fault diagnosis of rotating machinery. Meas. J. Int. Meas. Confed. 2023, 206, 112346. [Google Scholar] [CrossRef]
Alam, T.E.; Ahsan, M.M.; Raman, S. Multimodal bearing fault classification under variable conditions: A 1D CNN with transfer learning. Mach. Learn. Appl. 2025, 21, 100682. [Google Scholar] [CrossRef]
Zhang, S.; Wei, H.L.; Ding, J. An effective zero-shot learning approach for intelligent fault detection using 1D CNN. Appl. Intell. 2023, 53, 16041–16058. [Google Scholar] [CrossRef]
Han, S.; Yao, L.; Duan, D.; Yang, J.; Wu, W.; Zhao, C.; Zheng, C.; Gao, X. Intelligent condition monitoring with CNN and signal enhancement for undersampled signals. ISA Trans. 2024, 149, 124–136. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 2017, pp. 5999–6009. [Google Scholar]
Kim, S.; Seo, Y.H.; Park, J. Transformer-based novel framework for remaining useful life prediction of lubricant in operational rolling bearings. Reliab. Eng. Syst. Saf. 2024, 251, 110377. [Google Scholar] [CrossRef]
Ding, Y.; Jia, M.; Miao, Q.; Cao, Y. A novel time–frequency Transformer based on self–attention mechanism and its application in fault diagnosis of rolling bearings. Mech. Syst. Signal Process. 2022, 168, 108616. [Google Scholar] [CrossRef]
Han, Y.; Zhang, F.; Li, Z.; Wang, Q.; Li, C.; Lai, P.; Li, T.; Teng, F.; Jin, Z. MT-ConvFormer: A Multitask Bearing Fault Diagnosis Method Using a Combination of CNN and Transformer. IEEE Trans. Instrum. Meas. 2024, 74, 3501816. [Google Scholar] [CrossRef]
Lu, Z.; Liang, L.; Zhu, J.; Zou, W.; Mao, L. Rotating Machinery Fault Diagnosis Under Multiple Working Conditions via a Time-Series Transformer Enhanced by Convolutional Neural Network. IEEE Trans. Instrum. Meas. 2023, 72, 3533611. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Ali, A.B.M.S.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Saeed, A.; Khan, M.A.; Akram, U.; Obidallah, W.J.; Jawed, S.; Ahmad, A. Deep learning based approaches for intelligent industrial machinery health management and fault diagnosis in resource-constrained environments. Sci. Rep. 2025, 15, 1114. [Google Scholar] [CrossRef]
Brigham, E.O.; Morrow, R.E. The fast Fourier transform. IEEE Spectr. 1967, 4, 63–70. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Case Western Reserve University Bearing Data Center. Bearing Data Center Website. Available online: https://engineering.case.edu/bearingdatacenter (accessed on 29 June 2025).
Liu, D.; Cui, L.; Cheng, W. A review on deep learning in planetary gearbox health state recognition: Methods, applications, and dataset publication. Meas. Sci. Technol. 2023, 35, 012002. [Google Scholar] [CrossRef]
Hou, Y.; Wang, J.; Chen, Z.; Ma, J.; Li, T. Diagnosisformer: An efficient rolling bearing fault diagnosis method based on improved Transformer. Eng. Appl. Artif. Intell. 2023, 124, 106507. [Google Scholar] [CrossRef]
Zhao, F.; Jiang, Y.; Cheng, C.; Wang, S. An improved fault diagnosis method for rolling bearings based on wavelet packet decomposition and network parameter optimization. Meas. Sci. Technol. 2023, 35, 025004. [Google Scholar] [CrossRef]
Gu, J.; Peng, Y.; Lu, H.; Chang, X.; Chen, G. A novel fault diagnosis method of rotating machinery via VMD, CWT and improved CNN. Measurement 2022, 200, 111635. [Google Scholar] [CrossRef]
Jin, H.; Li, X.; Yu, J.; Wang, T.; Yun, Q. A bearing fault diagnosis model with enhanced feature extraction based on the Kolmogorov–Arnold representation Theorem and an attention mechanism. Appl. Acoust. 2025, 240, 110903. [Google Scholar] [CrossRef]
Chen, X.; Zhang, B.; Gao, D. Bearing fault diagnosis base on multi-scale CNN and LSTM model. J. Intell. Manuf. 2021, 32, 971–987. [Google Scholar] [CrossRef]

Figure 1. Schematic of Proposed Method.

Figure 2. Experimental platform of the CWRU bearing dataset. (a) CWRU Experimental Setup. (b) CWRU Experimental Setup Schematic Diagram.

Figure 3. CWRU Dataset: Segmented Signals and Their Frequency Spectra.

Figure 4. BJTU Wind Turbine Planetary Gearbox Experimental Setup Schematic Diagram.

Figure 5. BJTU Dataset: Segmented Signals and Their Frequency Spectra.

Figure 6. Case 1 training results. (a) Case 1 Model Performance Curves. (b) Case 1 Confusion Matrix (Best Fold on Independent Test Set).

Figure 7. Case 2 training results. (a) Case 2 Model Performance Curves. (b) Case 2 Confusion Matrix (Best Fold on Independent Test Set).

Figure 8. Selected Frequency Bands (Case 1).

Figure 9. Selected Frequency Bands (Case 2).

Figure 10. Validation accuracy comparison for both cases. (a) Case 1 Validation Accuracy by Using 8 Different Band Groups. (b) Case 2 Validation Accuracy by Using 8 Different Band Groups.

Figure 11. Sensitivity analysis for the two cases. (a) Case 1 classification sensitivity to the number of selected frequency bands. (b) Case 2 classification sensitivity to the number of selected frequency bands.

Table 1. Description of fault index labels used in the CWRU dataset.

Index	File Name	Fault Type	Fault Size (mil)	Location
0	Normal_3.mat	Normal	–	–
1	B007_3.mat	Ball Fault	7	–
2	B014_3.mat	Ball Fault	14	–
3	B021_3.mat	Ball Fault	21	–
4	IR007_3.mat	Inner Race	7	–
5	IR014_3.mat	Inner Race	14	–
6	IR021_3.mat	Inner Race	21	–
7	OR007@6_3.mat	Outer Race	7	6:00
8	OR014@6_3.mat	Outer Race	14	6:00
9	OR021@6_3.mat	Outer Race	21	6:00

Table 2. Fault index mapping for sun gear conditions in the BJTU dataset.

Index	Condition	Description
0	Healthy	No damage on the sun gear
1	Broken Tooth	Partial removal (about one-third) of a sun gear tooth
2	Wear Gear	Gear tooth surface worn
3	Missing Tooth	Complete tooth removal
4	Root Crack	Crack introduced at the root of a sun gear tooth

Table 3. Model architecture and hyperparameter settings for the CWRU case.

Component	Setting
Selected bands	5
Band width	600 Hz
Bins per band	25
CNN layers	1
CNN filters	64
CNN kernel size/padding	5/2
Pooling	MaxPool1d (kernel = 2, stride = 2)
Embedding dimension ( $d_{model}$ )	64
Transformer layers	2
Attention heads	4
Dropout	0.1
Loss function	CrossEntropyLoss
Optimizer	Adam
Learning rate	0.0008
Batch size	100
Epochs	50
Cross-validation	5 folds

Table 4. K-Fold Accuracy Statistics Over the Testing Dataset (Case 1).

Metric	Value
Accuracies per fold	1.0000, 0.9990, 0.9990, 1.0000, 0.9990
Mean accuracy	0.9994
Variance	$2.4 \times 10^{- 7}$
Standard deviation	0.00049
Mean ± Std (Apprx.)	$0.9994 \pm 0.0005$

Table 5. Model architecture and hyperparameter settings for the BJTU case.

Component	Setting
Selected bands	5
Band width	600 Hz
Bins per band	25
CNN layers	2
CNN filters	64
CNN kernel size/padding	(5/2), (3/1)
Pooling	MaxPool1d, kernel = 2, stride = 2
Embedding dimension ( $d_{model}$ )	64
Transformer layers	2
Attention heads	8
Dropout	0.1
Loss function	CrossEntropyLoss
Optimizer	Adam
Learning rate	0.00002
Batch size	64
Epochs	200
Cross-validation	5 folds

Table 6. K-Fold Accuracy Statistics Over the Testing Dataset (Case 2).

Metric	Value
Accuracies per fold	0.9965, 0.9976, 0.9969, 0.9969, 0.9972
Mean accuracy	0.9970
Variance	$1.25 \times 10^{- 7}$
Standard deviation	0.000354
Mean ± Std (Apprx.)	$0.9970 \pm 0.0004$

Table 7. Frequency Bands Importance Scores—Case 1.

Band Rank	Frequency Range	Importance Score
1	7848–8448 Hz	504.6780
2	240–840 Hz	449.1520
3	2424–3024 Hz	393.3041
4	1176–1776 Hz	316.0448
5	4008–4608 Hz	74.4110
6	3024–3624 Hz	73.6972
7	5448–6048 Hz	51.1371
8	1824–2424 Hz	20.9310
9	9192–9792 Hz	15.0191
10	18,384–18,984 Hz	10.6565

Table 8. Frequency Bands Importance Scores—Case 2.

Band Rank	Frequency Range	Importance Score
1	8256–8856 Hz	732.1145
2	4992–5592 Hz	505.1241
3	9216–9816 Hz	244.7976
4	11,640–12,240 Hz	226.8832
5	5592–6192 Hz	191.4862
6	1200–1800 Hz	69.0679
7	4392–4992 Hz	58.1534
8	0–600 Hz	52.7692
9	6192–6792 Hz	46.1512
10	3528–4128 Hz	44.1700

Table 9. Classification Performance Comparison on CWRU Bearing Dataset.

Study/Reference	Feature Type	Classifier	Accuracy (%)	Remarks
Hou et al. [42]	Frequency Domain	Improved Transformer	99.85	Transformer with multi-feature parallel fusion.
Zhao et al. [43]	WPD & Energy Features	WPD-CSSOA-DBN	98.24	Signal processing technique with energy feature selection and deep belief network.
Gu et al. [44]	Image	CNN	99.90	Signal pre-processing with image-based CNN classification.
Jin et al. [45]	Time Domain	KACSEN	99.27	Enhanced feature extraction with SE attention mechanism.
Chen et al. [46]	Time Domain	MCNN-LSTM	99.31	End-to-end fault classification model.
Proposed Method	Selected Frequency Bands	CNN + Transformer	99.94	High accuracy with frequency-band-based feature selection and hybrid classification.

Table 10. Classification performance with different numbers of selected frequency bands (Case 1).

	Number of Selected Frequency Bands
Metric	1	2	3	5	7	10
Importance Fraction	0.2591	0.4898	0.6917	0.8922	0.9563	0.9803
	Classification Accuracy
Fold 1	0.9209	0.9916	1.0000	1.0000	0.9990	0.9990
Fold 2	0.9092	0.9906	0.9979	0.9990	1.0000	0.9990
Fold 3	0.9008	0.9916	0.9979	0.9990	0.9990	0.9990
Fold 4	0.8935	0.9927	1.0000	1.0000	0.9990	0.9990
Fold 5	0.9165	0.9916	0.9990	0.9990	0.9979	1.0000
Mean	0.9045	0.9916	0.9990	0.9994	0.9990	0.9994
Std Deviation	0.0078	0.0007	0.0009	0.0005	0.0007	0.0005
Mean ± Std	0.9045 ± 0.0078	0.9916 ± 0.0007	0.9990 ± 0.0009	0.9994 ± 0.0005	0.9990 ± 0.0007	0.9994 ± 0.0005

Table 11. Classification performance with different numbers of selected frequency bands (Case 2).

	Number of Selected Frequency Bands
Metric	1	2	3	5	7	10
Importance Fraction	0.3030	0.5121	0.6134	0.7865	0.8392	0.8984
	Classification Accuracy
Fold 1	0.4894	0.8482	0.9125	0.9969	0.9958	0.9990
Fold 2	0.4863	0.8437	0.9090	0.9983	0.9948	0.9993
Fold 3	0.4856	0.8479	0.9180	0.9969	0.9958	0.9986
Fold 4	0.4904	0.8506	0.9097	0.9962	0.9962	0.9983
Fold 5	0.4988	0.8510	0.9114	0.9965	0.9958	0.9990
Mean	0.4901	0.8483	0.9121	0.9969	0.9957	0.9988
Std Deviation	0.0047	0.0026	0.0032	0.0007	0.0005	0.0004
Mean ± Std	0.4901 ± 0.0047	0.8483 ± 0.0026	0.9121 ± 0.0032	0.9969 ± 0.0007	0.9957 ± 0.0005	0.9988 ± 0.0004

Table 12. Ablation study of different model architectures and input representations (Case 1).

	Time-Domain			Frequency-Domain (No Selection)			Proposed Method
Metric	CNN	Transformer	CNN-Attn	CNN	Transformer	CNN-Attn	CNN-Trans (XGB)
	Classification Accuracy
Fold 1	0.7810	0.3139	0.4578	1.0000	1.0000	0.9979	1.0000
Fold 2	0.7534	0.3092	0.4599	1.0000	0.9990	0.9990	0.9990
Fold 3	0.7570	0.3201	0.4129	1.0000	0.9990	0.9990	0.9990
Fold 4	0.7727	0.3264	0.4708	1.0000	1.0000	0.9990	1.0000
Fold 5	0.7633	0.3050	0.4187	1.0000	0.9990	0.9990	0.9990
Mean	0.7655	0.3149	0.4440	1.0000	0.9994	0.9987	0.9994
Std Deviation	0.0102	0.0076	0.0235	0.0000	0.0005	0.0004	0.0005

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, C.; Yang, W.; Graja, O.; Duan, F.; Wei, Z.; Zhang, L. Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network. Appl. Sci. 2025, 15, 12726. https://doi.org/10.3390/app152312726

AMA Style

Huang C, Yang W, Graja O, Duan F, Wei Z, Zhang L. Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network. Applied Sciences. 2025; 15(23):12726. https://doi.org/10.3390/app152312726

Chicago/Turabian Style

Huang, Chiheng, Wenxian Yang, Oussama Graja, Fang Duan, Zeqi Wei, and Liuyang Zhang. 2025. "Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network" Applied Sciences 15, no. 23: 12726. https://doi.org/10.3390/app152312726

APA Style

Huang, C., Yang, W., Graja, O., Duan, F., Wei, Z., & Zhang, L. (2025). Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network. Applied Sciences, 15(23), 12726. https://doi.org/10.3390/app152312726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fault Diagnosis of Wind Turbine Drivetrains Using XGBoost-Assisted Discriminative Frequency Band Identification and a CNN–Transformer Network

Abstract

1. Introduction

2. Methodology

2.1. Raw Signal Segmentation and Preprocessing

2.2. XGBoost-Assisted Identification of Discriminative Frequency Bands

2.3. Deep Learning Architecture

3. Open-Access Dataset

3.1. CWRU Bearing Dataset

3.2. BJTU Planetary Gearbox Dataset

4. Results and Discussion

4.1. Training Results

4.1.1. Training Results on CWRU Bearing Dataset

4.1.2. Training Results on BJTU Wind Turbine Planetary Gearbox Dataset

4.2. Performance Analysis

5. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI