Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification

Ali, Amir R.; Kamal, Hossam

doi:10.3390/machines14020208

Open AccessArticle

Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification

by

Amir R. Ali

^1,2,*

and

Hossam Kamal

^1,2

¹

Mechatronics Engineering Department, Faculty of Engineering and Materials Science (EMS), German University in Cairo (GUC), New Cairo 11835, Egypt

²

ARAtronics Laboratory, Mechatronics Engineering Department (MCTR), German University in Cairo (GUC), New Cairo 11835, Egypt

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(2), 208; https://doi.org/10.3390/machines14020208

Submission received: 14 December 2025 / Revised: 27 January 2026 / Accepted: 3 February 2026 / Published: 10 February 2026

(This article belongs to the Section Machines Testing and Maintenance)

Download

Browse Figures

Versions Notes

Abstract

Accurate fault diagnosis in rotating machinery is critical for predictive maintenance and operational reliability in industrial applications. Despite the effectiveness of deep learning, many models underperform due to manually selected hyperparameters, which can lead to premature convergence, overfitting, weak generalization, and inconsistent performance across binary and multi-class classification. To address these limitations, the study proposes a novel hybrid hyperparameter optimization framework that combines Robotic Brain Storm Optimization (RBSO) with Manta Ray Foraging Optimization (MRFO) to optimally fine-tune deep learning architectures, including MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM models. The framework leverages RBSO for global search to promote diversity and prevent premature convergence, and MRFO for local search to enhance convergence toward optimal solutions, with their combined effect improving predictive model performance and methodological generalization. The approach was validated on three benchmark datasets, including Case Western Reserve University (CWRU), industrial machine fault detection (TMFD), and the Machinery Fault Dataset (MaFaulDa). Before optimization, Transformer-LSTM model achieved 98.35% and 97.21% accuracy on CWRU binary and multi-class classification, 99.52% and 98.57% on TMFD, and 98.18% and 92.82% on MaFaulDa. Following hybrid optimization, Transformer-LSTM exhibited superior performance, with accuracies increasing to 99.72% for both CWRU tasks, 99.97% for TMFD, and 99.98% and 98.60% for MaFaulDa, substantially reducing misclassification. These results demonstrate that the proposed RBSO–MRFO framework provides a scalable, robust, and high-accuracy solution for intelligent fault diagnosis in rotating machinery.

Keywords:

intelligent fault diagnosis; rotating machinery monitoring; deep learning models; hyperparameter tuning; robotic brain storm optimization (RBSO); manta ray foraging optimization (MRFO); Transformer-LSTM model; predictive maintenance strategies

1. Introduction

Rotating machinery, including motors, turbines, compressors, and pumps, plays a vital role in modern industrial systems. Unexpected failures in these machines can lead to significant economic losses, safety risks, and operational disruptions. Therefore, accurate fault diagnosis, encompassing the detection and classification of faults, is crucial for predictive maintenance and ensuring system reliability [1]. Despite advancements in monitoring technologies, fault diagnosis remains challenging due to the diversity of fault types, nonlinear signal behavior, and varying operating conditions [2].

Traditional diagnostic methods, such as vibration analysis and statistical signal processing, rely heavily on handcrafted features and expert knowledge. While these approaches can detect specific faults, they are often sensitive to noise, complex operational environments, and differences across machinery types, limiting their scalability and methodological generalization. Deep learning models offer an alternative by automatically extracting hierarchical features from raw sensor data. However, their performance is often constrained by manually selected hyperparameters, which can cause overfitting, weak generalization, and reduced accuracy in both binary and multi-class fault classification tasks [3,4]. These limitations underscore the need for robust, high-accuracy, and generalizable fault diagnosis approaches.

To address this gap, this study introduces a hybrid hyperparameter optimization framework that integrates Robotic Brain Storm Optimization (RBSO) with Manta Ray Foraging Optimization (MRFO). The framework leverages RBSO for global search to promote diversity and prevent premature convergence, and MRFO for local search to enhance convergence toward optimal solutions, with their combined effect improving overall model performance and methodological generalization. Unlike conventional approaches, this framework is automated, scalable, and adaptable, supporting both binary and multi-class fault diagnosis while bridging the gap between theoretical model design and industrial applications.

The proposed framework is applied to multiple deep learning architectures, including Multi-layer Perceptron (MLP), Long Short-Term Memory (LSTM), Gated Recurrent Unit–Temporal Convolutional Network (GRU-TCN), Convolutional Neural Network–Bidirectional LSTM (CNN-BiLSTM), and hybrid Transformer-Long Short-Term Memory (Transformer-LSTM) models, and evaluated on three benchmark datasets, including Case Western Reserve University (CWRU), industrial machine fault detection (TMFD), and the Machinery Fault Dataset (MaFaulDa). Key hyperparameters are optimized using the hybrid Robotic Brain Storm Optimization (RBSO)–Manta Ray Foraging Optimization (MRFO) approach to maximize predictive performance and capture complex fault patterns under diverse operating conditions.

Experimental results validate the effectiveness of the proposed framework. Before optimization, Transformer-LSTM model achieved 98.35% and 97.21% accuracy on CWRU binary and multi-class tasks, 99.52% and 98.57% on TMFD tasks, and 98.18% and 92.82% on MaFaulDa. Following hybrid RBSO–MRFO algorithm, Transformer-LSTM exhibited superior performance, with accuracies increasing to 99.72% for both CWRU tasks, 99.97% for TMFD, and 99.98% and 98.60% for MaFaulDa, substantially reducing misclassification. These results confirm that the proposed framework enhances predictive accuracy, robustness, and methodological generalization, outperforming conventional hyperparameter selection methods.

This study addresses the limitations of existing fault diagnosis techniques, presents a novel hybrid RBSO–MRFO framework, and demonstrates its effectiveness on multiple datasets. By providing a scalable, reliable, and high-accuracy solution, the proposed approach offers practical applicability for predictive maintenance in industrial rotating machinery, bridging the gap between modern research and real-world applications. This research presents the following key contributions:

Introducing a novel hybrid hyperparameter optimization framework combining RBSO and MRFO for efficient, automated tuning of deep learning models in rotating machinery.
Leveraging RBSO’s global search to enhance diversity and prevent premature convergence, alongside MRFO’s local search to accelerate convergence toward optimal solutions, improving predictive performance and generalizability of the methodology.
Optimizing the Transformer-LSTM architecture using the hybrid optimization framework for robust and accurate fault detection.
Validating the methodology on binary and multi-class classification tasks using the CWRU, TMFD, and MaFaulDa datasets, demonstrating superior predictive performance and robustness compared with existing approaches.

The paper is organized into several sections. Section 2 provides a critical review of the existing literature; Section 3 outlines the methodologies employed in the development of the system; Section 4 presents the outcomes of predictive testing and evaluates their performance; Section 5 discusses the challenges encountered and limitations observed; and Section 6 concludes with a summary of findings and perspectives for future work.

2. Related Works

The rapid development of modern technologies has substantially increased the sophistication of industrial systems, which in turn demands more advanced approaches for machinery condition monitoring and fault detection [5]. Rolling bearings are particularly critical among mechanical components due to their pervasive use and their central role in maintaining operational safety [6]. Consequently, intelligent methods for diagnosing bearing faults have become a major focus within machinery health management research [7].

Rotating machinery fault detection methods are commonly classified into model-based and data-driven categories [5,8]. Traditional model-based techniques generally rely on manually extracting features from signals, followed by designing classifiers to distinguish between different fault types [9]. To enhance efficiency, dimensionality reduction methods are often employed alongside intelligent classifiers [10]. Distance-based approaches, including k-Nearest Neighbors (k-NN), determine fault classes by comparing new observations with reference samples [11]. Wavelet packet decomposition combined with support vector machines (SVMs) has been applied for feature extraction and classification [12]. Additional strategies include Bayesian networks, extreme learning machines, random forests, and empirical mode decomposition (EMD) techniques [13,14,15,16,17,18]. Despite their usefulness, these model-based methods are highly dependent on assumptions regarding signal behavior and prior expert knowledge, which can restrict their adaptability under varied operational conditions [19].

With the rise in deep learning, data-driven approaches that automatically learn features from raw signals have gained prominence. Such methods have found applications in computer vision, natural language processing, speech recognition, remaining useful life estimation, and fault detection [20,21,22,23,24]. By minimizing human involvement in feature engineering, deep learning approaches have become particularly attractive for bearing fault diagnosis [25,26].

Convolutional neural networks (CNNs) have been widely explored for their ability to capture localized spatial patterns efficiently. CNN-based fault diagnosis methods include one-dimensional CNNs with wide initial kernels for processing time-domain signals, attention-augmented CNNs for interpretable fault visualization, integration with multilinear subspace dimensionality reduction prior to feature extraction, and multimodal CNNs that fuse accelerometer and microphone data [27,28,29,30]. Recurrent neural networks (RNNs), particularly LSTM networks, have been applied to model temporal dependencies in vibration data, showing effectiveness in predictive maintenance [31]. Hybrid approaches exist, such as combining sparse autoencoders with gated recurrent units or integrating CNN and LSTM networks to handle noisy and variable-load conditions [32,33]. Despite their effectiveness, CNNs have an inductive bias toward local feature extraction and capture long-range dependencies only indirectly through stacked local receptive fields, making explicit global feature modeling less efficient [34]. RNNs, although capable of sequential modeling, suffer from computational inefficiency, difficulty in parallelization, and the potential for vanishing or exploding gradients in long sequences, which reduces model stability [35].

The Transformer framework has emerged as a solution to these issues. By leveraging self-attention mechanisms, Transformers can model long-range dependencies while supporting efficient parallel computation [36]. In fault diagnosis, Transformer-based architectures have achieved strong performance, including time–frequency Transformers for bearing analysis, hybrid multiscale CNN-Transformer models, wavelet-based multi-signal Transformers, and Siamese Network-Transformer designs for cross-domain diagnosis [35,37,38,39]. Nevertheless, these models can be prone to overfitting when data are limited or noisy, and their performance may decline under varying operating conditions [40].

Recent strategies aim to overcome these challenges. One method employs a Transformer encoder augmented with an unsupervised denoising module, enabling direct processing of raw time-domain vibration signals without pre-training or signal transformations. This architecture produces sparse, interpretable features and demonstrates strong performance on benchmark datasets such as IMS and CWRU, even under class imbalance, while remaining lightweight [41]. Another strategy combines metaheuristic optimization with signal decomposition and deep learning. For instance, the Improved Gorilla Troop Optimization algorithm has been used to optimize decomposition and classification in a hybrid VMD-PE-LSTM pipeline, yielding high accuracy and robustness across datasets, although with increased computational time [42].

Multimodal and cross-domain fusion has also been explored. Architectures integrating one-dimensional CNNs, multi-layer perceptrons, and bidirectional cross-attention mechanisms can simultaneously extract features from both time- and frequency-domain signals, achieving near-perfect results on CWRU datasets while remaining robust to extreme noise and variable operating conditions [43]. Likewise, CNNs enhanced with Gramian Angular Field representations enable early detection of low-amplitude faults, facilitating faster training and improved accuracy compared with conventional CNNs [44].

To address changing operational environments, domain adaptation and few-shot learning approaches have been developed. While conventional meta-learning improves generalization under limited data, it may overfit. Domain adaptation aligns feature distributions across domains but often ignores class-specific subdomain differences. Dynamic balance domain-adaptation frameworks have been proposed to align both global and subdomain features, enhancing diagnostic accuracy and resilience against domain shifts and class imbalance, as demonstrated on datasets such as CWRU and PU [45]. Recent research shows that machine learning models, deep learning approaches, digital twins, and optimization techniques enhance fault detection strategies, thereby improving predictive performance and reliability in industrial systems [46,47,48,49,50,51,52,53].

Optimizing neural network parameters remains challenging due to large search spaces, slow convergence, and local minima, limiting conventional methods. Enhanced BSO-based techniques have improved classification accuracy, convergence speed, and scalability across datasets [54]. HC-BSO has demonstrated benefits in multi-robot path planning by reducing conflicts and improving computational efficiency, while RBSO-inspired swarm strategies support robust multi-target exploration [55,56].

Beyond industrial applications, metaheuristic algorithms such as the MRFO have shown high effectiveness in deep learning hyperparameter tuning. MRFO-optimized CNNs for skin cancer classification outperform other methods in accuracy and generalization, and have been applied to optimize CNNs for CIFAR-10 image recognition as well as fine-tune stacked LSTM autoencoders for biomedical classification, enhancing reliability across diverse datasets [57,58,59].

Despite significant advances in deep learning and optimization-driven fault diagnosis, most conventional models often rely on manual parameter tuning, which can lead to premature convergence, overfitting, weak generalization, and inconsistent performance across binary and multi-class classification. To overcome these limitations, this study proposes a novel hybrid hyperparameter optimization framework that combines RBSO and MRFO for systematic and automated tuning of multiple deep learning architectures, including an enhanced hybrid Transformer-LSTM model. The framework integrates RBSO for global search to promote diversity and prevent premature convergence, with MRFO for local search to enhance convergence toward optimal solutions, collectively improving model performance and methodological generalization. By applying this hybrid optimization strategy to the Transformer-LSTM architecture, the proposed approach achieves state-of-the-art accuracy, robustness, and generalization on benchmark CWRU, TMFD, and MaFaulDa datasets, providing a scalable, reliable, and high-precision solution that outperforms existing fault diagnosis methods.

3. Materials and Methods

Deep learning models for binary and multi-class classification, including MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM networks, were employed. MLPs capture static feature interactions, LSTMs model sequential dependencies, GRU-TCN combines temporal convolutions with recurrent units, CNN-BiLSTM integrates convolutional feature extraction with bidirectional LSTM, and the hybrid Transformer-LSTM combines attention mechanisms with LSTM for long-range and sequential dependencies. Models were trained with the Adam optimizer and evaluated using accuracy. Hyperparameters were optimized using a hybrid RBSO–MRFO algorithm.

3.1. Proposed Approach

3.1.1. Hybrid RBSO–MRFO Algorithm

To enhance optimization performance, the RBSO is hybridized with MRFO. RBSO provides idea-driven exploration (global search) through population-mean perturbations, while MRFO offers structured exploitation (local search) using chain, spiral, and somersault foraging behaviors. The proposed hybrid RBSO–MRFO algorithm is inspired by and adapted from the core mechanisms of Brain Storm Optimization (BSO) and Manta Ray Foraging Optimization (MRFO), with customized update equations, parameter schedules, and a hybrid coordination strategy to balance exploration and exploitation [56,60,61]. The algorithm is mathematically formulated as follows.

In RBSO, each individual updates its position using a perturbed population mean and a randomly selected leader. Gaussian perturbations are added both at the idea-generation stage and at the position-update stage to enhance exploration:

X_{i}^{t + 1} = X_{i}^{t} + r_{1} ʘ ({\bar{X}}^{t} + N (0, σ_{i d e a}^{2}) - X_{i}^{t}) + r_{2} ʘ (X_{l e a d e r}^{t} - X_{i}^{t}) + N (0, σ_{p o s}^{2})

(1)

where

r_{1}

,

r_{2}

~

{U (0, 1)}^{D}

are uniform random vectors,

{\bar{X}}^{t}

is the population mean,

X_{l e a d e r}^{t}

is a randomly chosen leader, and

σ_{i d e a}

= 0.5,

σ_{p o s}

= 0.1. The MRFO chain simulates the foraging movement of manta rays in a leader–follower pattern. The first chain element follows the global best solution, while subsequent individuals follow their predecessors:

X_{i}^{t + 1} = {\begin{array}{l} X_{t}^{*} + β (t) (J ʘ X_{t}^{*} - X_{i}^{t}), i i s t h e f i r s t c h a i n m e m b e r, \\ X_{p}^{t} + β (t) (J ʘ X_{p}^{t} - X_{i}^{t}), o t h e r w i s e \end{array}

(2)

where

X_{t}^{*}

is the global best, J~

{U (0, 1)}^{D}

is a coefficient vector, and p is the predecessor in the chain. In addition to chain movement, MRFO incorporates a spiral foraging mechanism to refine search around the global best solution. This is implemented as:

X_{i}^{t + 1} = X_{i}^{t + 1, b a s e} + 0.1 (\cos (2 π r) + s i n (2 π r)) (X_{i}^{t} - X_{t}^{*})

(3)

where r~U(0,1) is a random vector and 0.1 is the spiral amplitude constant. To further prevent premature convergence, a probabilistic somersault foraging behavior is applied, enabling sudden jumps toward or away from the best solution:

X_{i}^{t + 1} \leftarrow X_{i}^{t + 1, b a s e} + S (t) (H ʘ X_{t}^{*} - X_{i}^{t}), executed with probability p_{s}

(4)

where

H

~

{U (- 1, 1)}^{D}

,

p_{s}

denotes the somersault probability, and S(t) = 2(1 − t/T) is the somersault scaling factor. Both chain and somersault steps are dynamically adjusted with iteration through time-varying coefficients:

β (t) = 2 e x p (- ({\frac{4 t}{T})}^{2}), S (t) = 2 (1 - \frac{t}{T})

(5)

where t is the current iteration and T is the maximum number of iterations.

3.1.2. Hybrid RBSO–MRFO Workflow for Transformer-LSTM Hyperparameter Optimization

In this study, a hybrid Robotic Brain Storm Optimization (RBSO) and Manta Ray Foraging Optimization (MRFO) framework was implemented for automated hyperparameter optimization of a Transformer-LSTM model targeting sensor-based classification tasks. As detailed in Figure 1, the process begins with dataset loading and preprocessing. The initial population of candidate solutions is then generated randomly within predefined lower and upper bounds, with each candidate representing a continuous position vector subsequently mapped to discrete hyperparameters, including the number of attention heads, key dimension, feed-forward network units, LSTM units, and dropout rate. An initial fitness evaluation is performed by constructing and training a Transformer-LSTM model for each candidate, computing the validation loss while incorporating stratified train-test splitting, class balancing, Gaussian noise, and learning rate adaptation. The core of the optimization comprises a generation loop, iterated until the maximum number of iterations or convergence. Within each iteration, a fraction of candidates undergo RBSO-based global search, wherein candidate positions are updated by combining random perturbations, the population mean, and influence from the current global best to maintain diversity and explore the search space globally. The remaining candidates are updated using MRFO-based local search, employing chain-following behavior, spiral updates around the global best, and stochastic somersault maneuvers to intensify the search locally in promising regions. Following these updates, the fitness of all candidates is re-evaluated, and if a candidate achieves a lower validation loss than the previous global best, the global best solution is updated; otherwise, the previous best is retained. This process continues iteratively, effectively balancing global and local search strategies until termination criteria are met. Upon convergence, the optimal hyperparameters corresponding to the global best position are extracted. The final Transformer-LSTM model is then trained on the full training dataset using these optimal hyperparameters, leveraging class weighting, learning rate reduction on plateau, and a custom confusion matrix callback to monitor per-epoch performance, including accuracy and class-level confusion matrices. Finally, the model is evaluated on the independent test set to obtain loss, accuracy, and a comprehensive confusion matrix, confirming the effectiveness of the hybrid RBSO-MRFO algorithm in achieving robust classification performance. This integrated metaheuristic framework, as illustrated in workflow, ensures systematic global and local searches of the hyperparameter space, effectively improving convergence toward optimal configurations while providing interpretable performance monitoring during model training.

3.1.3. Hyperparameter Search and Optimization of Evaluated Models

Five models were evaluated for sequence and feature-based learning, including MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM, as detailed in Table 1. MLP captures static feature interactions, LSTM models sequential dependencies, GRU-TCN combines temporal convolutions with gated recurrent units, CNN-BiLSTM integrates convolutional feature extraction with bidirectional sequence modeling, and the Transformer-LSTM uses attention mechanisms with LSTM for long-range dependencies. All hyperparameters were optimized using a hybrid RBSO–MRFO algorithm.

The MLP model is a feedforward network designed for non-sequential tabular data. Hidden Units discrete from 2 to 16 defines network width, whereas Dropout Rate continuous from 0.0 to 0.5 regulates regularization and Learning Rate continuous from 1 × 10⁻⁵ to 1 × 10⁻² controls convergence speed. The hybrid RBSO–MRFO algorithm was used to efficiently optimize all hyperparameters, ensuring robust predictive performance [62,63,64].

The LSTM model captures sequential dependencies in time-series data. LSTM Units discrete from 8 to 128 determine memory capacity, while Dropout Rate continuous from 0.0 to 0.5 controls overfitting and Learning Rate continuous from 1 × 10⁻⁵ to 1 × 10⁻² regulates training stability. Hyperparameters were optimized using the hybrid RBSO–MRFO algorithm, enabling effective sequence learning [62,65,66].

The GRU-TCN model combines GRU layers with TCN blocks to capture both sequential and temporal features. GRU Units discrete from 8 to 256 govern memory capacity, TCN Filters discrete from 16 to 64 control temporal feature extraction, while Dropout Rate continuous from 0.0 to 0.5 ensures generalization and Learning Rate continuous from 1 × 10⁻⁵ to 1 × 10⁻² regulates convergence. Hyperparameter optimization with hybrid RBSO–MRFO ensures effective learning [62,65,66].

The CNN-BiLSTM model first extracts local temporal patterns using convolutional filters, followed by a Bidirectional LSTM layer to model sequence dependencies. CNN Filters discrete from 16 to 128 define feature extraction, LSTM Units discrete from 8 to 128 determine memory capacity, while Dropout Rate continuous from 0.0 to 0.5 regulates regularization and Learning Rate continuous from 1 × 10⁻⁵ to 1 × 10⁻² controls training dynamics. Hyperparameter optimization was performed using the hybrid RBSO–MRFO algorithm [64,65,66].

The Transformer-LSTM model integrates multi-head transformer attention with LSTM layers, capturing both long-range dependencies and sequential dynamics. Number of Heads discrete from 1 to 8, Key Dimension discrete from 8 to 128, FFN Units discrete from 8 to 512, and LSTM Units discrete from 8 to 256 define model representation capacity, while Dropout Rate continuous from 0.0 to 0.5 controls overfitting. All hyperparameters were optimized via the hybrid RBSO–MRFO algorithm to maximize sequence modeling performance [62,66,67].

All hyperparameters were selected based on established practices in deep learning optimization, guided by prior studies and the use of the hybrid RBSO–MRFO algorithm. The defined ranges provide sufficient coverage for effective exploration of model configurations while ensuring efficient and robust training [62,63,64,65,66,67].

3.1.4. Transformer-LSTM Model

Hybrid architectures that integrate Transformer blocks with recurrent neural networks provide a powerful framework for sequential classification. In the present work, convolutional projection and Gaussian noise regularization are followed by a Transformer encoder block and an LSTM layer, before passing through dense layers for final classification [36,68,69,70]. This combination enables both global dependency modeling through attention and temporal dynamics extraction via recurrence. The Transformer core uses scaled dot-product attention to capture global dependencies across the feature sequence:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(6)

where

Q

,

K

,

V

∈

R^{n \times d_{k}}

are query, key, and value matrices obtained from the input sequence,

d_{k}

is the key dimension for scaling, and the softmax ensures normalized attention weights. To enrich feature representation, multiple attention heads are used and concatenated:

M H A (X) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(7)

where each head is defined as

{h e a d}_{i}

= Attention(X

W_{i}^{Q}, {X W}_{i}^{K}, {X W}_{i}^{V}

),

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}

are projection matrices,

W^{O}

is the output projection, and h is the number of heads. The position-wise feed-forward transformation with residual and normalization improves stability and nonlinearity:

F F N (x) = L a y e r N o r m (x + \max (0, x W_{1} + b_{1}) W_{2} + b_{2})

(8)

where

W_{1}

,

W_{2}

are trainable matrices,

b_{1}

,

b_{2}

are biases, the ReLU activation max(0,.) introduces nonlinearity, and the residual connection with LayerNorm stabilizes optimization. Sequential dependencies are modeled using the LSTM cell:

i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

(9)

f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

(10)

o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

(11)

c_{t} = f_{t} ʘ c_{t - 1} + i_{t} ʘ t a n h (W_{c} x_{t} + U_{t - 1} + b_{c})

(12)

h_{t} = o_{t} ʘ t a n h (c_{t})

(13)

where

i_{t}

,

f_{t}

,

o_{t}

are the input, forget, and output gates,

c_{t}

is the memory cell,

h_{t}

the hidden state, and σ denotes the sigmoid activation. The hidden representation is projected into class probabilities using softmax:

\hat{y} = s o f t m a x (W h_{T} + b)

(14)

where

h_{T}

is the final hidden state,

W

and

b

are trainable classifier parameters, and the softmax ensures probabilistic interpretation across classes.

Binary Classification

The binary classification model employs a hybrid Transformer-LSTM network, as shown in Figure 2. As detailed in Table 2, the input layer is designed to match the number of features and is reshaped to (number of features, 1) to prepare the data for sequential processing. A Conv1D layer with 64 filters, kernel size 1, ReLU activation, and same padding is applied, followed by Gaussian noise with a standard deviation of 0.01. The network then includes a Transformer block with 1 attention head, key dimension of 4, dropout of 0.9, residual connections with layer normalization, and a feed-forward network with 16 units followed by dropout and a dense layer of d_model units with dropout. This is followed by an LSTM layer with 8 units (return_sequences = False) and dropout of 0.9, a dense layer with 128 units and ReLU activation with dropout of 0.2, and a final output layer with a single neuron and sigmoid activation for binary classification.

The model training is configured with the Adam optimizer and binary cross-entropy loss function, which is appropriate for binary classification, as summarized in Table 3. Model performance is evaluated using accuracy. Training is performed in batches of 128 samples with a learning rate of 0.001. A ReduceLROnPlateau learning rate scheduler reduces the learning rate by a factor of 0.5 if the validation loss does not improve for 3 consecutive epochs, with a minimum learning rate of 1 × 10⁻⁶. Confusion matrix visualization is included as a callback for interpretability. The architecture incorporates 1 Transformer head, key dimension of 4, feed-forward units of 16, an LSTM layer with 8 units, a dropout rate of 0.9 for Transformer and LSTM layers, and a dense layer with 128 units.

2.: Multi-Class Classification

The multi-class classification model employs a hybrid Transformer-LSTM network, as shown in Figure 3. As detailed in Table 4, the input layer is designed to match the number of features and is reshaped to (number of features, 1) to prepare the data for sequential processing. A Conv1D layer with 64 filters, kernel size 1, ReLU activation, and same padding is applied, followed by Gaussian noise with a standard deviation of 0.01. The network then includes a Multi-Head Attention layer with 1 attention head and key dimension of 16, followed by dropout of 0.5 and residual connections with layer normalization. This is followed by a feed-forward network consisting of a dense layer with 64 units and ReLU activation, dropout of 0.5, a dense layer with d_model units and dropout of 0.5, and residual addition with layer normalization. An LSTM layer with 32 units (return_sequences = False) and dropout of 0.5 is applied, followed by a dense layer with 128 units and ReLU activation with dropout of 0.2, and a final output layer with a number of neurons equal to the number of classes with softmax activation for multi-class classification.

The model training is configured with the Adam optimizer and categorical cross-entropy loss function, which is suitable for multi-class classification, as summarized in Table 5. Model performance is evaluated using accuracy. Training is conducted in batches of 128 samples with a learning rate of 0.001. A ReduceLROnPlateau learning rate scheduler reduces the learning rate by a factor of 0.5 if the validation loss does not improve for 3 consecutive epochs, with a minimum learning rate of 1 × 10⁻⁶. Confusion matrix visualization is implemented as a callback for interpretability. The architecture includes 1 Transformer head, a key dimension of 16, feed-forward units of 64, an LSTM layer with 32 units, dropout rates of 0.5 for Transformer and LSTM layers and 0.2 for the dense layer, and a dense layer with 128 units.

3.2. Comparison with Other Deep Learning Models

3.2.1. MLP Model

For binary classification, a multi-layer perceptron (MLP) model was implemented as a baseline feature-based classifier. The input layer directly receives the extracted feature vector corresponding to each sample. The network consists of two fully connected hidden layers, each comprising two neurons with ReLU activation functions, followed by dropout layers with a rate of 0.4 to mitigate overfitting. The output layer contains a single neuron with a sigmoid activation function, enabling probabilistic binary classification between normal and faulty conditions. The model was trained using the Adam optimizer with a learning rate of 0.001 and binary cross-entropy loss. Training was conducted with a batch size of 128, and performance was evaluated using classification accuracy. To improve training stability and convergence, a ReduceLROnPlateau learning rate scheduler was employed, reducing the learning rate by a factor of 0.5 when validation loss stagnated for consecutive epochs, with a minimum learning rate of 1 × 10⁻⁶. A confusion matrix visualization callback was used to enhance interpretability of classification results.

For multi-class classification, the MLP architecture was expanded to increase representational capacity. The input layer again accepts the full feature vector, followed by two dense hidden layers with 64 neurons each and ReLU activation. Dropout with a rate of 0.1 was applied after each hidden layer to reduce overfitting while maintaining sufficient model capacity. The output layer consists of a number of neurons equal to the number of fault classes, using a softmax activation function to produce normalized class probabilities. The model was optimized using the Adam optimizer with a learning rate of 0.0001 and categorical cross-entropy loss. Training was performed with a batch size of 128, and accuracy was used as the evaluation metric. A ReduceLROnPlateau scheduler and confusion matrix visualization callback were incorporated to support stable optimization and result interpretation. Although computationally efficient, the MLP lacks explicit mechanisms for temporal dependency modeling, limiting its effectiveness for time-series fault diagnosis.

3.2.2. LSTM Model

For binary classification, a unidirectional long short-term memory (LSTM) network was employed to explicitly model temporal dependencies in sensor signals. The input data were reshaped to a two-dimensional format of (number of features, 1) to enable sequential processing. A single LSTM layer with 16 memory units was used to capture temporal patterns, followed by a dropout layer with a rate of 0.2 to reduce overfitting. The output layer consists of a single neuron with sigmoid activation for binary classification. The model was trained using the Adam optimizer with a learning rate of 0.001 and binary cross-entropy loss. Training was conducted with a batch size of 128, and accuracy was used as the primary evaluation metric. A ReduceLROnPlateau scheduler and confusion matrix visualization callback were employed to support training stability and performance analysis.

For multi-class classification, the LSTM model was adjusted to reflect the increased complexity of the task. The input representation remained unchanged, while the LSTM layer was configured with four units and followed by a dropout layer with a rate of 0.5 to enhance methodological generalization. The output layer employed softmax activation with a number of neurons equal to the number of fault categories. The model was trained using the Adam optimizer with a learning rate of 0.1 and categorical cross-entropy loss. Training was performed with a batch size of 128, and performance was assessed using accuracy. Learning rate adaptation via ReduceLROnPlateau and confusion matrix visualization were incorporated. While LSTM networks effectively capture sequential dependencies, their reliance on local temporal context and limited parallelization restrict scalability and long-range dependency modeling compared to attention-based architectures.

3.2.3. GRU-TCN Model

For binary classification, a hybrid gated recurrent unit–temporal convolutional network (GRU-TCN) architecture was implemented to jointly capture short-term temporal features and long-term dependencies. The input layer accepts sensor data formatted as (number of features, 1). A causal one-dimensional convolutional layer with two filters and a kernel size of three was applied to extract local temporal patterns, followed by batch normalization and ReLU activation. A GRU layer with eight units was then used to model sequential dependencies in the extracted feature maps. A dropout layer with a high rate of 0.9 was applied before the output layer to mitigate overfitting. The output layer consists of a single neuron with sigmoid activation. The model was trained using the Adam optimizer with a learning rate of 0.06 and binary cross-entropy loss, incorporating class weights to address data imbalance. Training was conducted with a batch size of 128, and learning rate scheduling and confusion matrix visualization were applied to support convergence and interpretability.

For multi-class classification, the GRU-TCN architecture was scaled to handle increased class complexity. The convolutional layer employed 64 filters with a kernel size of three and causal padding, followed by batch normalization and ReLU activation. A GRU layer with 128 units captured long-term temporal dependencies, followed by dropout with a rate of 0.2. The output layer employed softmax activation with neurons equal to the number of fault classes. The model was optimized using the Adam optimizer with a learning rate of 0.001 and categorical cross-entropy loss. Training was performed with a batch size of 128, incorporating class weighting, learning rate reduction on plateau, and confusion matrix visualization. While the GRU-TCN model effectively combines convolutional and recurrent learning, its receptive field remains constrained by kernel size and sequential processing depth.

3.2.4. CNN-BiLSTM Model

For binary classification, a hybrid convolutional neural network with bidirectional LSTM (CNN-BiLSTM) architecture was employed. The input sensor data were formatted as (number of features, 1). A causal one-dimensional convolutional layer with 16 filters and a kernel size of three was applied, followed by batch normalization and ReLU activation to extract local temporal features. A bidirectional LSTM layer with eight units was then used to capture temporal dependencies in both forward and backward directions. A dropout layer with a rate of 0.2 was applied before the output layer, which consists of a single neuron with sigmoid activation. The model was trained using the Adam optimizer with a learning rate of 0.08 and binary cross-entropy loss, incorporating class weights. Training was conducted with a batch size of 128, and learning rate scheduling and confusion matrix visualization were applied.

For multi-class classification, the CNN-BiLSTM architecture was extended using a convolutional layer with 64 filters and a kernel size of three, followed by batch normalization. A bidirectional LSTM layer with 64 units modeled bidirectional temporal dependencies, followed by dropout with a rate of 0.2. The output layer employed softmax activation for multi-class prediction. The model was trained using the Adam optimizer with a learning rate of 0.001 and categorical cross-entropy loss with a batch size of 128. Learning rate adaptation and confusion matrix visualization were included. Although the CNN-BiLSTM model captures bidirectional temporal context, it remains limited in modeling global dependencies across long sequences.

3.2.5. Comparative Analysis of the Proposed RBSO–MRFO Optimized Transformer-LSTM and Other Deep Learning Models

The key distinction among the evaluated models lies in how temporal dependencies, feature interactions, and global context are captured. MLP models operate purely on static feature representations and lack temporal awareness. LSTM-based architectures introduce sequential modeling but are constrained by vanishing gradients and limited long-range dependency capture. GRU-TCN and CNN-BiLSTM architectures improve temporal feature extraction by combining convolutional and recurrent layers; however, their receptive fields remain structurally limited and scale poorly with increasing sequence length.

In contrast, the proposed Transformer-LSTM architecture integrates self-attention mechanisms with recurrent modeling, enabling explicit learning of global dependencies across the entire input sequence while preserving fine-grained temporal dynamics. The Transformer component effectively captures long-range interactions through multi-head attention, while the LSTM layer reinforces sequential continuity and temporal ordering. Furthermore, the hybrid RBSO–MRFO framework provides a systematic and automated mechanism for tuning both architectural and training hyperparameters, balancing global and local search to avoid premature convergence. This results in superior methodological generalization, robustness, and classification accuracy across both binary and multi-class fault diagnosis tasks. The combined advantages of attention-based global modeling, recurrent temporal learning, and metaheuristic-driven hyperparameter optimization explain the consistent performance gains observed over conventional deep learning architectures on the CWRU, TMFD, and MaFaulDa datasets.

3.3. Case Western Reserve University (CWRU) Dataset Development

The Case Western Reserve University (CWRU) bearing dataset is widely used in fault diagnosis research and serves as a standard benchmark for evaluating diagnostic algorithms. The experimental setup consists of a 1.5 kW (2 hp) electric motor, a torque sensor, a power meter, and an electronic controller [71]. All experiments were conducted under no-load conditions (0 hp) with the motor rotating at approximately 1792 rpm, and vibration signals were collected from the drive-end bearing housing at a sampling frequency of 48 kHz. All ten fault classes were measured under these same operating conditions, ensuring that observed differences in the vibration signals are attributable solely to fault type rather than variations in motor load or speed.

The dataset includes ten bearing conditions, encompassing normal operation as well as faults in the inner race, outer race, and rolling elements. Artificial faults have diameters of 0.1778 mm, 0.3556 mm, and 0.5334 mm, each with a depth of 0.2794 mm. Faults are annotated as N (normal), I1–I3 (inner race), O1–O3 (outer race), and B1–B3 (rolling element), with the numerical suffix indicating increasing fault size.

All fault types contain 230 samples. Table 6 presents the sample distribution for each bearing condition. It shows that each class is equally represented with 230 samples, ensuring balanced representation across all fault types. This uniform distribution allows deep learning models to be trained and evaluated without bias toward any particular class.

Statistical analyses were performed for each fault type, including mean, standard deviation, minimum, maximum, and quartiles. Table 7 summarizes these statistics. Normal bearings exhibit a low mean vibration amplitude of 0.94 A, a standard deviation of 1.94 A, and a maximum peak of 11.70 A, reflecting stable operation. In contrast, outer race faults, such as OR_007_6_1 and OR_021_6_1, show significantly higher mean amplitudes of 11.41 A and 7.04 A, with maximum peaks of 313.74 A and 104.54 A, indicating severe faults. Inner race faults, such as IR_021_1, display elevated variability, with a standard deviation of 18.43 A and a maximum of 162.79 A. Ball faults, including Ball_021_1, present moderate amplitudes (mean of 2.36 A, maximum of 73.78 A). These patterns highlight the distinct characteristics of each fault type and provide essential information for accurate fault classification and predictive maintenance.

A boxplot was created to visually compare the distributions across all bearing conditions. Figure 4 shows that normal bearings have minimal variation, while faults in the outer race and rolling elements produce elevated peaks and wider distributions. This visualization enhances interpretability, highlights clear distinctions between fault types, and supports deep learning-based classification and predictive maintenance strategies.

3.4. Industrial Machine Fault Detection (TMFD) Dataset Development

A comprehensive dataset was collected from an industrial turning machine powered by a single-phase AC motor, which is protected by a slow-blow fuse to prevent prolonged overcurrent damage [47]. The motor’s current was continuously monitored using a current sensor, focusing on overload events that indicate electrical faults. Operational data were systematically captured via a programmable logic controller (PLC) S7-1200 (Siemens AG, Munich, Germany) integrated with a human–machine interface (HMI) and the totally integrated automation portal (TIA Portal v16) software, enabling uninterrupted recording of machine performance. The final dataset comprises 18,567 records, each corresponding to distinct operational conditions.

For deep learning-based fault detection and classification, four input features were selected, including motor current, machine lathe velocity, total operating time, and operating system type. The dataset supports two outputs, including a binary fault indicator (0 for normal, 1 for fault) and a three-class classification identifying normal operation, steady-state overload, and transient overload. Table 8 summarizes the sample distribution for each operating condition. Normal operation dominates the dataset with 18,026 samples, while steady-state and transient overloads are represented by 320 and 220 samples, respectively. This distribution reflects the typical industrial scenario in which faults are critical to detect.

Statistical characterization of each operating condition was performed to analyze fault behavior and support interpretability. Table 9 presents key metrics, including mean, standard deviation, minimum, maximum, and quartiles. Normal operation exhibits a mean current of 73.52 A with a standard deviation of 125.45 A, whereas steady-state and transient overloads have comparable mean currents (70.04 A and 70.38 A) but higher variability (138.70 A and 138.53 A). These statistics highlight the distinct signatures of overload events compared to normal operation, providing essential information for predictive maintenance applications.

A boxplot was developed to visually compare the distributions across all bearing conditions, as presented in Figure 5. Normal operation exhibits wider variations due to routine fluctuations, whereas steady-state and transient overloads display distinct peaks corresponding to fault events. This visualization enhances interpretability and supports the development of deep learning-based fault detection and predictive maintenance strategies, improving machine reliability and operational safety.

3.5. Machinery Fault (MaFaulDa) Dataset Development

The MaFaulDa dataset is a publicly available resource widely used for developing, training, and evaluating machine and deep learning models for fault detection in rotating machinery. Originally curated by Felipe Moreira Lopes Ribeiro from the Federal University of Rio de Janeiro, Brazil, a widely used platform for machine learning resources [72]. Data were acquired using the Machinery Fault Simulator (MFS), model ABTV (Spectra Quest), which replicates multiple mechanical fault conditions and generates realistic vibration signals. The dataset consists of 1951 sequences, each corresponding to one of seven predefined operating conditions, including normal operation and six imbalance fault levels (6 g, 10 g, 15 g, 20 g, 25 g, 30 g). Signals were recorded at 50 kHz over five seconds per sequence, producing 250,000 samples per sequence and a total of approximately 487.75 million measurements (~13 GB). Each sequence contains eight features, including a tachometer signal, tri-axial accelerometer readings from two sensors, and a single-channel audio signal.

To reduce noise and computational load, sequences were segmented into non-overlapping windows of 5000 samples, with each segment averaged. Binary labels were assigned, where 1 indicates normal operation and 0 corresponds to faults under different imbalance levels. These controlled operating conditions ensure that variations in the signals are directly attributable to fault severity rather than experimental inconsistencies.

Table 10 presents the sample distribution per fault type. Sample sizes are relatively balanced, ranging from 11.75 million to 12.25 million sequences per class, ensuring that the deep learning model can be trained and evaluated without bias toward any particular fault condition.

Statistical characterization was performed for each fault type, including mean, standard deviation, minimum, maximum, and quartiles (Q1, median, Q3). Table 11 summarizes these metrics. Normal operation exhibits low mean vibration amplitudes (0.0071) with moderate variability (standard deviation 0.7466) and maximum peaks of 5.11, whereas higher imbalance faults, such as 25 g and 30 g, show substantially larger variability (standard deviations of 0.9612 and 1.1439) and extreme maximum amplitudes (126.86 and 130.40, respectively). Intermediate imbalance levels (10 g, 15 g, 20 g) display moderate elevations in amplitude and variability. These statistical differences provide a quantitative foundation for distinguishing normal operation from fault conditions and are critical for accurate fault classification and predictive maintenance.

A boxplot was generated to provide a visual comparison of signal amplitudes across all fault types. Figure 6 shows that normal operation has a tight distribution with low-amplitude signals, whereas higher imbalance faults, particularly 25 g and 30 g, exhibit broader distributions and higher peaks. Intermediate fault levels (10 g, 15 g, 20 g) display moderate amplitude elevations. This visualization highlights the distinct characteristics of each fault condition, improves dataset interpretability, and supports the development of deep learning models for robust fault classification and predictive maintenance.

In this study, the TMFD dataset is used to evaluate electrical fault conditions, specifically motor overloads, which reflect abnormal current and voltage behavior in industrial turning machines. In contrast, the CWRU and MaFaulDa datasets are employed to assess mechanical fault scenarios, including bearing defects, imbalance, and rotational anomalies. The inclusion of both electrical and mechanical fault datasets enables a comprehensive evaluation of the proposed framework, highlighting its methodological generalization capability and robustness in practical industrial applications.

3.6. Data Preprocessing

3.6.1. CWRU Dataset

The CWRU bearing dataset was processed for both binary and multi-class classification using time-domain features extracted from vibration signals. Raw fault labels were mapped into numerical categories representing various bearing conditions, including inner race, outer race, and ball defects. Outliers were mitigated using Local Outlier Factor and z-score filtering on a per-class basis. The cleaned dataset was randomly shuffled and normalized using Min-Max scaling. For binary classification, the Normal class was labeled as one class and all fault conditions as the other, whereas for multi-class classification, the fault categories were retained and one-hot encoded for neural network training. The dataset was split into training and testing subsets using an 80:20 ratio, with random shuffling and a fixed random seed to ensure reproducibility.

3.6.2. TMFD Dataset

The TMFD dataset was preprocessed for binary and multi-class fault detection. Missing values were removed, duplicates eliminated, and class distributions verified. Binary classification distinguished between normal and fault conditions, whereas multi-class classification included three fault types, including no fault, steady-state overload, and transient overload. Features were normalized using Min-Max scaling. For binary and multi-class classification, the dataset was split into training and testing subsets using an 80:20 ratio. For multi-class classification, the split was stratified to preserve the original class distributions across both subsets, ensuring balanced representation of all fault types.

3.6.3. MaFaulDa Dataset

The MaFaulDa motor fault dataset was processed for both binary and multi-class classification using vibration signals under multiple load conditions. Normal and faulty states were aggregated, downsampled to reduce computational complexity, and transformed via frequency-domain analysis using FFT. For binary classification, normal states were distinguished from all fault conditions, while multi-class classification included seven classes corresponding to normal and six fault conditions. Features were normalized using Min-Max scaling. The dataset was divided into training and testing subsets using a 75:25 ratio with random shuffling to ensure all classes were represented appropriately in both subsets. Multi-class labels were one-hot encoded for training.

To avoid data leakage, the training and testing sets were separated for the CWRU, TMFD, and MaFaulDa datasets, and all splits were performed using fixed random seeds to ensure full reproducibility.

3.7. Performance Metrics

Accurate fault detection in industrial machinery is particularly challenging due to the substantial imbalance typically observed in operational datasets. Failure events are extremely rare, usually constituting less than 1% of the data, whereas normal operation dominates with over 99% of samples. This imbalance can lead conventional deep learning models to favor the majority class, reducing their sensitivity to critical faults and limiting the informativeness of traditional metrics such as overall accuracy [47]. Overall accuracy is defined as the proportion of correctly classified instances, including true positives (TP) and true negatives (TN) out of all samples, which also comprise false positives (FP) and false negatives (FN):

O v e r a l l a c c u r a c y = \frac{(T P + T N)}{(T P + F P + T N + F N)}

(15)

Although commonly used, overall accuracy may give a misleading picture in highly skewed datasets. A model might achieve high accuracy primarily by predicting the majority class correctly, while failing to capture rare but crucial fault occurrences. To address this limitation, it is necessary to adopt metrics that provide a more granular view of model performance, particularly for minority class detection. The most informative measures are Precision, Recall, and the F1 score. Precision quantifies the proportion of correctly identified positive samples among all instances predicted as positive:

P r e c i s i o n = T P \times (T P + F P)

(16)

Recall measures the fraction of actual positive instances that the model correctly identifies:

R e c a l l = T P \times (T P + F N)

(17)

The F1 score, which represents the harmonic mean of Precision and Recall, provides a single balanced metric, especially useful in datasets with severe class imbalance:

F 1 s c o r e = \frac{2 \times (P r e s i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

(18)

In this study, overall accuracy was calculated as the ratio of correctly predicted samples to the total number of samples. Class-wise Precision, Recall, and F1-score were computed from the confusion matrix according to Equations (16)–(18), and weighted averages were subsequently calculated across all classes based on their support to account for class imbalance. This approach ensures that minority classes contribute proportionally to the overall performance metrics. Consequently, minor discrepancies may arise between the reported F1-score and values obtained from a direct formula-based calculation due to this weighted aggregation. This methodology provides a rigorous and fair evaluation of the model’s performance in highly imbalanced datasets, accurately reflecting its ability to detect rare fault events.

By emphasizing these metrics, the proposed methodology ensures a comprehensive assessment of the model’s capability to detect minority class instances, addressing the challenges posed by extreme class imbalance and guiding the improvement of industrial fault detection systems.

4. Results and Discussion

4.1. Model Experiment Results

The performance of multiple deep learning architectures was systematically evaluated for fault diagnosis using the CWRU, TMFD, and MaFaulDa datasets. Both binary and multi-class classification tasks were considered to comprehensively assess model capabilities under diverse operating conditions. Each model was carefully optimized through hyperparameter tuning using RBSO–MRFO algorithm to ensure effective feature extraction, enabling robust representation of the underlying patterns in the data and providing a solid foundation for comparative analysis and performance evaluation.

4.1.1. CWRU Dataset

1.: Binary Classification

The optimized hyperparameters for the deep learning models used in binary classification of the CWRU dataset are summarized in Table 12. For the MLP model, the best configuration included 15 hidden units, a dropout rate of 0.0619, and a learning rate of 0.0098, balancing model complexity and regularization to enhance methodological generalization. The LSTM model achieved optimal performance with 128 LSTM units, a dropout rate of 0.1039, and a learning rate of 0.0100, enabling effective temporal feature learning while preventing overfitting. The GRU-TCN model combined 93 GRU units with 16 TCN filters, a dropout rate of 0.2318, and a learning rate of 0.0085, capturing both sequential dependencies and temporal convolution patterns efficiently. For the CNN-BiLSTM architecture, 67 CNN filters, 102 LSTM units, a dropout rate of 0.0998, and a learning rate of 0.0051 were selected, allowing effective extraction of spatial features prior to sequential modeling. The Transformer-LSTM model was optimized with 3 attention heads, a key dimension of 32, 431 FFN units, 161 LSTM units, and a dropout rate of 0.2067, providing a balanced architecture for attention-based feature representation and temporal dependency modeling, resulting in robust classification performance across the dataset.

2.: Multi-Class Classification

The deep learning models employed for multi-class classification of the CWRU dataset were tuned with specific optimized hyperparameter configurations to maximize performance, as summarized in Table 13. For the MLP network, the optimal parameters consisted of 14 hidden units, a dropout rate of 0.0170, and a learning rate of 0.0100, promoting efficient learning while maintaining training stability. The LSTM architecture performed best with 127 LSTM units, a relatively high dropout rate of 0.4771, and a learning rate of 0.0092, enabling effective temporal feature extraction and mitigating overfitting. The GRU-TCN model was configured with 156 GRU units, 54 TCN filters, a dropout rate of 0.2579, and a learning rate of 0.0065, allowing the network to capture both sequential dependencies and temporal convolutional patterns efficiently. For CNN-BiLSTM, 30 CNN filters, 120 LSTM units, a dropout rate of 0.3741, and a learning rate of 0.0078 were selected to enhance spatial feature extraction prior to sequential modeling. The Transformer-LSTM model utilized 2 attention heads, a key dimension of 95, 339 FFN units, 256 LSTM units, and no dropout, providing robust attention-based feature representation along with temporal dependency modeling, ensuring reliable multi-class classification across the dataset.

4.1.2. TMFD Dataset

1.: Binary Classification

The optimized hyperparameter tuning for deep learning models applied to binary classification of the TMFD dataset is presented in Table 14. For the MLP model, the optimal parameters comprised 8 hidden units, a dropout rate of 0.0302, and a learning rate of 0.0077, which ensured efficient feature representation while maintaining stable convergence during training. The LSTM architecture achieved its best configuration with 91 LSTM units, a dropout rate of 0.4810, and a learning rate of 0.0090, enabling robust temporal dependency learning while mitigating overfitting. In the GRU-TCN model, 122 GRU units were combined with 53 TCN filters, a dropout rate of 0.2888, and a learning rate of 0.0099, effectively capturing both sequential patterns and temporal convolutional features. The CNN-BiLSTM model reached optimal performance using 57 CNN filters, 113 LSTM units, a dropout rate of 0.0000, and a learning rate of 0.0081, supporting spatial feature extraction prior to sequential processing. For the Transformer-LSTM architecture, the best configuration included 4 attention heads, a key dimension of 74, 143 FFN units, 250 LSTM units, and a dropout rate of 0.5000, providing comprehensive attention-based feature modeling along with temporal sequence learning, resulting in robust binary classification across the TMFD dataset.

2.: Multi-Class Classification

The optimized hyperparameter configurations for deep learning models applied to multi-class classification of the TMFD dataset are summarized in Table 15. The MLP model achieved its best performance with 13 hidden units, a dropout rate of 0.0638, and a learning rate of 0.0097, supporting efficient feature mapping while maintaining training stability. The LSTM network was tuned with 66 LSTM units, a dropout rate of 0.1040, and a learning rate of 0.0100, facilitating effective temporal sequence learning and controlling overfitting. For the GRU-TCN architecture, 184 GRU units and 50 TCN filters were combined with a dropout rate of 0.1490 and a learning rate of 0.0088, capturing both sequential dependencies and temporal convolution features. The CNN-BiLSTM model utilized 98 CNN filters, 111 LSTM units, a dropout rate of 0.5000, and a learning rate of 0.0082, optimizing spatial feature extraction followed by sequential processing. In the Transformer-LSTM model, 2 attention heads, a key dimension of 94, 251 FFN units, 145 LSTM units, and a dropout rate of 0.0553 provided balanced attention-based feature representation alongside temporal modeling, ensuring robust multi-class classification performance across the TMFD dataset.

4.1.3. MaFaulDa Dataset

1.: Binary Classification

The hyperparameter optimization for deep learning models applied to binary classification of the MaFaulDa dataset is presented in Table 16. The MLP model achieved optimal performance with 16 hidden units, a dropout rate of 0.0414, and a learning rate of 0.0061, allowing efficient representation of input features while maintaining stability during training. The LSTM network was configured with 89 LSTM units, a dropout rate of 0.4077, and a learning rate of 0.0100, enabling effective temporal pattern recognition and reducing the risk of overfitting. For the GRU-TCN architecture, 50 GRU units were paired with 60 TCN filters, a dropout rate of 0.1740, and a learning rate of 0.0086, capturing both sequential dependencies and temporal convolutional features. The CNN-BiLSTM model performed best with 81 CNN filters, 84 LSTM units, a dropout rate of 0.2304, and a learning rate of 0.0053, combining spatial feature extraction with sequential modeling. In the Transformer-LSTM model, 4 attention heads, a key dimension of 80, 8 FFN units, 144 LSTM units, and a dropout rate of 0.1267 were selected, providing a robust balance between attention-based feature representation and temporal dependency modeling, ensuring high classification performance on the MaFaulDa dataset.

2.: Multi-Class Classification

The optimized hyperparameters for deep learning models applied to multi-class classification of the MaFaulDa dataset are reported in Table 17. The MLP model reached its best configuration with 14 hidden units, a minimal dropout rate of 0.0034, and a learning rate of 0.0100, allowing effective feature encoding while maintaining training stability. For the LSTM network, 75 LSTM units, a dropout rate of 0.1150, and a learning rate of 0.0092 provided optimal temporal representation while controlling overfitting. The GRU-TCN model was configured with 190 GRU units, 53 TCN filters, a dropout rate of 0.0251, and a learning rate of 0.0089, effectively integrating sequential memory and temporal convolution for accurate feature extraction. In the CNN-BiLSTM architecture, 67 CNN filters and 86 LSTM units, combined with a dropout rate of 0.1978 and a learning rate of 0.0100, facilitated robust spatial feature learning followed by sequential processing. The Transformer-LSTM model employed a single attention head, a key dimension of 11, 29 FFN units, 243 LSTM units, and a dropout rate of 0.4965, providing balanced attention-based feature extraction along with long-term temporal dependency modeling, ensuring reliable multi-class classification performance across the MaFaulDa dataset.

4.2. Analysis of the Effects Before and After Applying Model Optimization

The performance of multiple deep learning architectures, including MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM, was evaluated for both binary and multi-class fault diagnosis across the CWRU, TMFD, and MaFaulDa datasets. Integration of the RBSO–MRFO framework consistently enhanced predictive accuracy, reliability, and robustness for all models, demonstrating the general effectiveness of the proposed approach for machine fault diagnosis under diverse operating conditions.

4.2.1. Performance of Predictive Models of CWRU Dataset

1.: Binary Classification

The performance of the Transformer-LSTM model on the CWRU dataset was evaluated using confusion matrices before and after optimization, as shown in Figure 7a,b. In the pre-optimized model, all 45 No Fault samples were correctly classified, while six Fault samples were misclassified, resulting in 312 accurate Fault predictions. Following optimization, the model preserved perfect classification for the No Fault class and reduced Fault misclassifications from six to one, achieving 317 correct Fault predictions. This improvement highlights the effectiveness of the RBSO–MRFO algorithm in enhancing fault detection accuracy, demonstrating the model’s robustness and reliability for machine fault diagnosis applications.

The performance of various deep learning models for binary classification was evaluated before and after applying the RBSO-MRFO framework, as summarized in Table 18. The baseline MLP model achieved an accuracy of 96.96%, precision of 97.56%, recall of 96.96%, and an F-score of 97.10%, while the LSTM reached 97.24% across accuracy and recall, with a precision of 97.24% and F-score of 97.12%. The GRU-TCN model recorded slightly lower performance, with an accuracy of 95.59% and F-score of 95.87%. CNN-BiLSTM and Transformer-LSTM models demonstrated stronger results, achieving accuracy values of 98.07% and 98.35%, respectively. After optimization, all models exhibited improved performance, with optimized MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM attaining accuracy ranging from 98.89% to 99.72%, precision from 98.98% to 99.73%, recall from 98.89% to 99.72%, and F-score from 98.91% to 99.72%. Notably, the optimized Transformer-LSTM achieved the highest metrics across all measures, as illustrated in Figure 8, highlighting the effectiveness of the proposed optimization.

2.: Muli-Class Classification

The performance of the Transformer-LSTM model for machine fault diagnosis was evaluated using confusion matrices before and after optimization. The class labels were arranged as Ball_014_1, Ball_021_1, Normal_1, IR_014_1, Ball_007_1, IR_021_1, OR_014_6_1, IR_007_1, OR_007_6_1, and OR_021_6_1. As presented in Figure 9, the pre-optimized model demonstrated satisfactory classification performance for several classes, particularly Normal_1, IR_014_1, and OR_021_6_1, while other classes exhibited misclassifications, such as Ball_014_1 being confused with OR_014_6_1 and OR_014_6_1 misclassified as Ball_021_1. These misclassifications indicate challenges in differentiating bearing faults with similar vibration characteristics. As shown in Figure 10, after optimization, the Transformer-LSTM achieved near-perfect classification across all classes, with correct predictions increasing and misclassifications significantly reduced. All ball and inner race fault types were accurately identified, and only minimal misclassification was observed in OR_007_6_1. The results indicate that the optimization process enhances the model’s ability to capture subtle variations in vibration signals, improving both accuracy and reliability. The optimized Transformer-LSTM provides a robust framework for precise multi-class fault diagnosis, demonstrating its suitability for real-time condition monitoring and fault detection in rotating machinery applications.

The evaluation of deep learning models for multi-class classification was conducted both before and after applying the optimization strategy, as detailed in Table 19. Initially, the MLP model achieved an accuracy of 96.14%, precision of 96.51%, recall of 96.14%, and an F-score of 96.05%, while the LSTM model recorded slightly lower values, with accuracy and recall of 95.31%, precision of 95.34%, and F-score of 95.29%. The GRU-TCN network demonstrated stronger performance, reaching 98.07% across accuracy, recall, and F-score, with precision at 98.10%. CNN-BiLSTM and Transformer-LSTM models showed accuracies of 96.14% and 97.21%, respectively, with corresponding precision, recall, and F-score values reflecting similar trends. After optimization, all models exhibited enhanced performance, with optimized MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM achieving accuracy from 98.62% to 99.72%, precision from 98.70% to 99.73%, recall from 98.62% to 99.72%, and F-score from 98.62% to 99.72%. The optimized Transformer-LSTM reached the peak performance across all metrics, as illustrated in Figure 11, confirming the effectiveness of the proposed method.

4.2.2. Performance of Predictive Models of TMFD Dataset

1.: Binary Classification

The classification performance of the Transformer-LSTM model for binary fault diagnosis was evaluated before and after optimization, as presented in Figure 12a,b. The classes were arranged as No Fault and Fault on both axes. In the pre-optimized model, the No Fault class achieved 3590 correct predictions with minor misclassifications, while the Fault class recorded 106 correct identifications with some errors. After optimization, the model demonstrated improved performance, achieving 3601 correct predictions for No Fault and 112 correct predictions for Fault, with misclassifications nearly eliminated. These results indicate that optimization enhances the model’s ability to accurately detect faults, increasing reliability and robustness for real-time condition monitoring applications.

The classification performance of various deep learning models for binary tasks was assessed before and after optimization, as summarized in Table 20. The unoptimized MLP achieved an accuracy of 93.88%, precision of 97.30%, recall of 93.88%, and F-score of 95.18%, while the LSTM network reached 97.38% for accuracy and recall, with a precision of 98.59% and F-score of 97.76%. GRU-TCN demonstrated comparable results with an accuracy of 97.39%, precision of 98.60%, recall of 97.39%, and F-score of 97.76%. CNN-BiLSTM and Transformer-LSTM further improved baseline performance, achieving accuracies of 98.55% and 99.52%, respectively. After applying the optimization procedure, all models exhibited marked improvements, including optimized MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM reached accuracy values ranging from 99.00% to 99.97%, precision from 99.24% to 99.97%, recall from 99.00% to 99.97%, and F-score from 99.07% to 99.97%. The optimized Transformer-LSTM attained the highest scores across all metrics, as depicted in Figure 13, highlighting the significant effectiveness of the optimization framework in enhancing classification reliability and methodological generalization.

2.: Muli-Class Classification

The performance of the Transformer-LSTM model for the TMFD dataset was evaluated using pre-optimization and post-optimization confusion matrices. In the pre-optimized model, the No Fault class was perfectly classified, with all 3606 samples correctly identified. However, significant misclassifications were observed for overload conditions, where 32 Steady-State Overload samples were incorrectly predicted as Transient Overload, and 21 Transient Overload samples were classified as Steady-State Overload, highlighting limitations in distinguishing fault types, as shown in Figure 14. After applying the RBSO–MRFO algorithm, the optimized Transformer-LSTM model exhibited substantially improved classification performance. All no-fault samples remained correctly classified, while Steady-State Overload misclassifications decreased dramatically to a single instance, and all Transient Overload samples were correctly identified, as presented in Figure 15. These improvements indicate that the optimization algorithm effectively enhanced the model’s discriminative ability between overload conditions, thereby reducing false positives and false negatives. The optimized Transformer-LSTM demonstrates superior accuracy and reliability in machine fault diagnosis, ensuring robust performance in both normal and fault conditions. This underscores the significance of optimization techniques in improving deep learning-based fault classification systems for real-time monitoring applications.

The comparative evaluation of deep learning models for multi-class classification, presented in Table 21, demonstrates the impact of optimization on model performance. Before optimization, MLP achieved an accuracy of 93.21%, precision of 97.87%, recall of 93.21%, and F-score of 95.10%, whereas LSTM reached 98.51% accuracy, 98.56% precision, 98.51% recall, and 98.53% F-score. GRU-TCN showed comparatively lower performance with 69.17% accuracy, 95.81% precision, 69.17% recall, and 79.39% F-score. CNN-BiLSTM and Transformer-LSTM delivered higher results, with accuracies of 98.33% and 98.57%, and corresponding precision, recall, and F-score values. After applying the RBSO-MRFO algorithm, all models exhibited substantial improvement: optimized MLP achieved 99.11% accuracy, 99.12% precision, 99.11% recall, and 99.07% F-score; optimized LSTM reached 99.91% across all metrics; optimized GRU-TCN improved to 89.74% accuracy and 92.00% F-score; optimized CNN-BiLSTM attained 99.92% for all metrics; notably, the optimized Transformer-LSTM outperformed all models, achieving 99.97% in accuracy, precision, recall, and F-score, as depicted in Figure 16, highlighting the effectiveness of the proposed optimization framework.

4.2.3. Performance of Predictive Models of MaFaulDa Dataset

1.: Binary Classification

The classification performance of the Transformer-LSTM model was analyzed on the MaFaulDa dataset before and after optimization, as depicted in Figure 17a,b. In the pre-optimized model, 7076 No Fault samples were correctly classified, while 148 were misclassified as Fault. Additionally, five Fault samples were incorrectly labeled as No Fault, with 1195 accurately predicted. After optimization, the model achieved substantial improvements, correctly classifying 7223 No Fault samples with only one misclassification and reducing Fault misclassifications to a single sample while correctly predicting 1199 Fault cases. These results demonstrate that the RBSO–MRFO algorithm significantly enhances classification accuracy and reliability for machine fault diagnosis applications.

The evaluation of deep learning models for binary classification was performed before and after applying the proposed optimization framework, as summarized in Table 22. Initially, the MLP network achieved an accuracy of 90.97%, precision of 90.39%, recall of 90.97%, and F-score of 90.03%, while the LSTM model reached 97.60% accuracy and recall, 97.91% precision, and an F-score of 97.67%. GRU-TCN exhibited similar results with 97.52% accuracy and recall, 97.78% precision, and 97.58% F-score. CNN-BiLSTM and Transformer-LSTM obtained baseline accuracy values of 95.98% and 98.18%, respectively, with corresponding precision, recall, and F-score demonstrating comparable trends. After optimization, all models showed substantial enhancement: optimized MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM achieved accuracy between 99.30% and 99.98%, precision from 99.32% to 99.98%, recall from 99.30% to 99.98%, and F-score from 99.31% to 99.98%. Notably, the optimized Transformer-LSTM reached the highest scores across all metrics, as depicted in Figure 18, demonstrating the robustness and effectiveness of the optimization approach in improving classification reliability.

2.: Multi-Class Classification

The performance of the Transformer-LSTM model for machine fault diagnosis was assessed using confusion matrices before and after optimization. The class labels were organized as 6G, Normal, 10G, 15G, 20G, 25G, and 30G. As shown in Figure 19, the pre-optimized model demonstrated high accuracy in several classes, including Normal with 1214 correct predictions and 30G with 1160 correct predictions. However, misclassifications were notable in intermediate fault classes, particularly 15G, 20G, and 25G, indicating difficulty in distinguishing similar fault patterns. After optimization, the Transformer-LSTM model showed marked improvement in classification performance across all classes, as presented in Figure 20. Correct predictions increased substantially for 10G from 1142 to 1194, 15G from 1097 to 1123, 20G from 1024 to 1200, and 25G from 937 to 1138, while misclassifications decreased significantly, especially between neighboring fault classes. The optimized model achieved nearly perfect recognition for Normal and 30G classes, with misclassification rates approaching zero. These improvements highlight the model’s enhanced ability to identify subtle differences in machine fault signatures. The results demonstrate that optimization of the Transformer-LSTM significantly enhances its diagnostic accuracy and reliability, making it highly suitable for precise multi-class machine fault detection in industrial applications.

The classification performance of multiple deep learning models for multi-class tasks was assessed before and after applying the proposed optimization approach, as presented in Table 23. Initially, the MLP achieved an accuracy of 88.05%, precision of 87.93%, recall of 88.05%, and F-score of 87.84%, whereas LSTM yielded lower values with 75.57% accuracy and recall, 74.93% precision, and 74.97% F-score. GRU-TCN demonstrated improved baseline performance with 91.17% accuracy, 91.07% precision, 91.17% recall, and 91.10% F-score. CNN-BiLSTM and Transformer-LSTM delivered higher initial metrics, with accuracies of 90.46% and 92.82%, respectively, and corresponding precision, recall, and F-score reflecting similar improvements. Following optimization, all models exhibited marked enhancement: optimized MLP, LSTM, GRU-TCN, CNN-BiLSTM, and Transformer-LSTM attained accuracy ranging from 92.23% to 98.60%, precision from 92.16% to 98.60%, recall from 92.23% to 98.60%, and F-score from 92.18% to 98.60%. The optimized Transformer-LSTM reached the peak performance across all metrics, as depicted in Figure 21, confirming the effectiveness of the proposed strategy in improving classification reliability and consistency.

The proposed RBSO–MRFO optimized Transformer-LSTM framework demonstrates consistently high performance across three datasets, including CWRU, TMFD, and MaFaulDa. The framework achieves very high accuracy in both binary and multi-class fault classification tasks, with only a minimal number of misclassifications observed in each dataset. The inclusion of the MaFaulDa dataset further demonstrates the generalizability of the methodology across different machines and fault conditions, highlighting its robustness and scalability in industrial applications. While hyperparameters were optimized separately for each dataset, the hybrid hyperparameter optimization, combined with regularization strategies such as Dropout and GaussianNoise layers, ensures that the model avoids overfitting while maintaining strong predictive performance. These results collectively indicate that the proposed framework is reliable, computationally efficient, and capable of adapting to diverse operational conditions, thereby addressing potential limitations associated with overfitting in deep learning-based fault diagnosis models.

4.3. Computational Cost of Hybrid RBSO–MRFO Algorithm and Time, and Memory of Optimized Transformer-LSTM Model

The computational efficiency of deep learning models is a critical factor in practical applications, particularly for real-time or resource-constrained environments. In this study, the Transformer–LSTM model was optimized using the RBSO–MRFO algorithm, and the computational cost of the optimization process was evaluated. The performance of the optimized model was assessed in terms of training time, inference time, and memory consumption. These metrics provide insights into the model’s suitability for deployment by quantifying the computational cost and resource requirements across different datasets and classification tasks. The following subsections present a detailed analysis of these performance aspects.

4.3.1. Computational Cost of Hybrid RBSO–MRFO Algorithm

The Transformer-LSTM model was optimized using a hybrid RBSO–MRFO hyperparameter optimization (HPO) algorithm, which explores both continuous hyperparameters, such as dropout rate and layer units, and discrete hyperparameters, including the number of attention heads, as summarized in Table 24. Due to the computational complexity of optimizing multiple parameters in a deep learning framework, HPO was performed offline using a population size of 14 and 12 iterations, with 6 training epochs per candidate, balancing thoroughness of search with computational feasibility. The population size determines the number of candidate configurations evaluated per iteration, while the number of iterations controls the refinement cycles, together defining the total computational demand (168). Caching previously evaluated hyperparameters prevented redundant training, significantly reducing runtime. Moderate batch sizes of 128 balanced GPU memory usage and training speed, while stabilization layers, such as Dropout and Gaussian Noise, improved convergence efficiency and training stability. The stopping criterion, based on a fixed number of iterations, ensured controlled execution time. This HPO approach provides several key advantages. It enables efficient exploration of a mixed continuous-discrete hyperparameter space, ensures robust convergence, reduces computational cost and redundancy, and allows offline experimentation without compromising final model performance. This approach demonstrates an effective and computationally efficient hyperparameter optimization framework suitable for training the Transformer-LSTM model with strong predictive performance.

4.3.2. Training Time of Optimized Transformer-LSTM Model

The training time performance of the Transformer-LSTM model, after applying the RBSO–MRFO algorithm, was assessed across multiple datasets and classification types, as detailed in Table 25. For the CWRU dataset, binary classification required 0.082 s per batch and 0.64 milliseconds per sample, whereas multi-class classification demonstrated a slight increase to 0.084 s per batch and 0.65 milliseconds per sample. In the TMFD dataset, binary classification exhibited a reduced training time of 0.076 s per batch and 0.59 milliseconds per sample, while multi-class classification required 0.079 s per batch and 0.62 milliseconds per sample. Notably, the MaFaulDa dataset achieved the most efficient training, with binary classification taking 0.045 s per batch and 0.35 milliseconds per sample, and multi-class classification requiring 0.048 s per batch and 0.37 milliseconds per sample. These results indicate that the RBSO–MRFO-enhanced Transformer-LSTM model consistently maintains low computational overhead, ensuring efficient training across diverse datasets and classification tasks, with only marginal increases in processing time for multi-class scenarios.

4.3.3. Inference Time of Optimized Transformer-LSTM Model

The inference time of the Transformer-LSTM model, after applying the RBSO–MRFO algorithm, was evaluated across multiple datasets and classification types, as summarized in Table 26. For the CWRU dataset, binary classification required 0.071 s per batch and 0.55 milliseconds per sample, while multi-class classification exhibited a marginal increase to 0.074 s per batch and 0.57 milliseconds per sample. In the TMFD dataset, binary classification achieved an inference time of 0.069 s per batch and 0.54 milliseconds per sample, whereas multi-class classification required 0.075 s per batch and 0.59 milliseconds per sample. The MaFaulDa dataset demonstrated the lowest inference times, with binary classification taking 0.036 s per batch and 0.28 milliseconds per sample, and multi-class classification requiring 0.039 s per batch and 0.30 milliseconds per sample. These results indicate that the RBSO–MRFO-optimized Transformer-LSTM model consistently delivers low latency across diverse datasets and classification tasks, confirming its suitability for real-time machine fault diagnosis applications.

4.3.4. Memory Consumption of Optimized Transformer-LSTM Model

The memory consumption of the Transformer-LSTM model, following the implementation of the RBSO–MRFO algorithm, was analyzed across different datasets and classification tasks, as summarized in Table 27. For the CWRU dataset, binary classification required 1.29 MB of memory, while multi-class classification demanded 2.04 MB. In the TMFD dataset, memory utilization was 1.87 MB for binary classification and 1.08 MB for multi-class classification. For the MaFaulDa dataset, binary classification consumed 1.13 MB, whereas multi-class classification required 1.63 MB. These results indicate that the memory requirements of the optimized Transformer-LSTM model are generally low and dataset-dependent. The observed variations highlight the algorithm’s capability to efficiently manage resource allocation across different classification types, ensuring minimal memory overhead while maintaining high performance, thereby demonstrating the suitability of the model for machine fault diagnosis applications in environments with limited computational resources.

4.4. Comparison Experiments with Related Work

The comparison experiments reported in Table 28 and Table 29 were conducted to comprehensively evaluate the effectiveness of the proposed RBSO–MRFO optimized Transformer-LSTM model against state-of-the-art methods for both binary and multi-class fault diagnosis across multiple benchmark datasets, including CWRU, TMFD, and MaFaulDa.

For binary classification, the results in Table 28 indicate that traditional machine learning and deep learning approaches exhibit competitive diagnostic performance. On the CWRU dataset, classical methods such as KNN, MLP-BP, MLP-BP combined with SVM, and CWT-based ANN achieved accuracies ranging from 94.7% to 99.6% [73]. Although these methods demonstrate strong fault recognition capability, the proposed RBSO–MRFO optimized Transformer-LSTM model achieved the highest accuracy of 99.72%, reflecting its superior feature extraction and methodological generalization performance.

On the TMFD dataset, deep learning models, including DNN, CNN, LSTM, and GRU obtained accuracies between 98.51% and 99.29%, whereas the proposed method further improved the classification accuracy to 99.97%, indicating enhanced robustness under varying operating conditions [47].

Furthermore, on the MaFaulDa dataset, conventional and optimized machine learning methods, including unoptimized SVM, optimized SVM, oversampled optimized SVM, unoptimized KNN, optimized KNN, oversampled optimized KNN, time-domain-based DNN, and FFT-based DNN reported accuracies ranging from 85.9% to 99.7% [74]. In contrast, the proposed method achieved an accuracy of 99.98%, outperforming all comparative approaches and demonstrating its strong adaptability to complex and imbalanced fault datasets.

For multi-class classification, the results presented in Table 29 further confirm the superiority of the proposed method. On the CWRU dataset, existing hybrid and optimized deep learning models, including CNN-LSTM, HPSO-CNN-LSTM, TSFFCNN-PSO-SVM, 1-D CNN-PSO-SVM, CNN-LSTM with GRU, CNN-BiLSTM with Grid Search, and optimized 1-D CNN-LSTM achieved accuracies between 94.20% and 99.35% [75,76]. The proposed RBSO–MRFO optimized Transformer-LSTM model attained the highest accuracy of 99.72%, demonstrating improved robustness and enhanced discrimination among multiple fault categories.

On the TMFD dataset, comparative deep learning models such as DNN, CNN, LSTM, and GRU achieved accuracies ranging from 97.09% to 99.86%, while the proposed method again yielded the best performance with an accuracy of 99.97% [47].

In addition, on the MaFaulDa dataset, conventional deep learning architectures, including DNN, CNN, LSTM, GRU, and Transformer-DNN achieved accuracies between 90.51% and 98.39% [47]. The proposed method surpassed these approaches by achieving an accuracy of 98.60%, further validating its effectiveness in handling complex multi-class fault diagnosis scenarios.

These results consistently demonstrate that the proposed optimization-driven Transformer-LSTM architecture delivers superior diagnostic accuracy across different datasets and classification tasks, highlighting its strong methodological generalization capability and practical applicability for intelligent fault diagnosis.

4.5. Ablation Experiments

Ablation experiments were conducted to evaluate the contribution of each component within the proposed RBSO–MRFO optimized Transformer-LSTM model for both binary and multi-class bearing fault classification. These experiments systematically analyzed the performance impact of using the Transformer-LSTM hybrid architecture alone, the addition of either RBSO or MRFO algorithm, and the full integration of RBSO–MRFO with Transformer-LSTM.

For binary classification, the results in Table 30 show that the Transformer-LSTM model achieved accuracies of 98.35%, 99.52%, and 98.18% on the CWRU, TMFD, and MaFaulDa datasets, respectively. Introducing the RBSO algorithm module further improved accuracy to 99.45%, 99.87%, and 99.43%, indicating the positive effect of guided hyperparameter search. Similarly, the MRFO-optimized Transformer-LSTM model yielded accuracies of 99.17%, 99.84%, and 99.37%, demonstrating that MRFO effectively enhances discriminative capability. Ultimately, the full RBSO–MRFO optimized Transformer-LSTM model achieved the highest accuracies of 99.72%, 99.97%, and 99.98% on CWRU, TMFD, and MaFaulDa, respectively, confirming that the combination of both optimization strategies with the hybrid architecture maximizes performance.

For multi-class classification, the ablation results in Table 31 exhibit a similar outcome. The Transformer-LSTM model achieved accuracies of 97.21%, 98.57%, and 92.82% on CWRU, TMFD, and MaFaulDa, respectively. Incorporating RBSO algorithm improved accuracy on CWRU and MaFaulDa to 98.35% and 97.71%, while the TMFD dataset showed a lower accuracy of 80.56%, suggesting dataset-specific sensitivity when using RBSO alone. The MRFO-optimized Transformer-LSTM model achieved 98.07%, 88.83%, and 97.55% on the three datasets, indicating moderate improvements. The full RBSO–MRFO optimized Transformer-LSTM model consistently outperformed all other variants, achieving accuracies of 99.72%, 99.97%, and 98.60%, respectively. These findings confirm that the integration of both RBSO and MRFO algorithms with the hybrid Transformer-LSTM architecture provides superior feature extraction capability, robustness, and highly reliable fault diagnosis performance in multi-class scenarios.

The superior performance of the proposed RBSO–MRFO optimized Transformer-LSTM model can be attributed to the synergistic integration of the hybrid architecture with dual metaheuristic optimization. The Transformer-LSTM combination captures both global features through self-attention and sequential dependencies through LSTM, enabling precise representation of complex bearing patterns. The RBSO and MRFO modules systematically explore and exploit the hyperparameter space, optimizing attention heads, feed-forward network units, LSTM units, and dropout rates. This dual optimization refines the model’s discriminative capacity and accelerates convergence, which is critical for distinguishing subtle fault patterns in both binary and multi-class tasks. Collectively, the ablation results demonstrate that each component contributes meaningfully, and their integration under the hybrid RBSO–MRFO framework is the key factor driving the model’s consistently superior accuracy across all datasets.

5. Challenges and Limitations

While the proposed RBSO–MRFO framework effectively enhances hyperparameter optimization for deep learning models in rotating machinery fault diagnosis, it presents several challenges. Integrating two metaheuristic algorithms increases computational complexity and can lead to longer training times, particularly for large-scale datasets and complex architectures such as Transformer-LSTM. Achieving a proper balance between RBSO’s global search and MRFO’s local search is critical, as misalignment may result in premature convergence or suboptimal hyperparameter selection. The framework is sensitive to initial parameter settings and the definition of search spaces, which can affect reproducibility and robustness across diverse fault scenarios. Additionally, the increased model complexity may reduce interpretability and demand higher memory and computational resources, posing practical constraints for real-time industrial deployment. Although the results are insightful, several limitations must be addressed to improve the framework’s reliability and its applicability in practical settings:

Integrating additional deep learning parameters and hybrid optimization strategies may require rigorous tuning to fully optimize the performance of deep learning models.
Enhancing generalizability of the methodology may require further validation to ensure robust and scalable deployment across diverse machines and operating conditions.

6. Conclusions

This study introduced a novel hybrid hyperparameter optimization framework combining RBSO and MRFO to enhance deep learning-based fault diagnosis in rotating machinery. By integrating RBSO’s global search capability with MRFO’s local search, the proposed approach effectively balances global and local search, mitigating premature convergence while improving convergence toward optimal hyperparameter configurations. The framework was applied to optimize multiple deep learning architectures, including MLP, LSTM, GRU–TCN, CNN–BiLSTM, and Transformer-LSTM models, with a particular focus on capturing complex temporal fault patterns through the Transformer-LSTM architecture. Empirical validation on three benchmark datasets, including CWRU, TMFD, and MaFaulDa demonstrated substantial improvements in both binary and multi-class classification accuracy. Following hybrid RBSO–MRFO algorithm, the Transformer-LSTM model exhibited superior performance, achieving 99.72% for both CWRU tasks, 99.97% for TMFD, and 99.98% and 98.60% for MaFaulDa, substantially reducing misclassification rates. These results confirm that the hybrid optimization strategy not only enhances predictive performance but also strengthens methodological generalization across diverse fault conditions. The proposed RBSO–MRFO framework offers a scalable, robust, and high-accuracy solution for intelligent fault diagnosis, providing a practical approach for advancing predictive maintenance and reliability in industrial rotating machinery systems.

Future research will focus on integrating additional deep learning parameters with advanced hybrid optimization strategies to further enhance the performance of deep learning models, including Transformer–LSTM, and will demonstrate the generalizability of the methodology across diverse machines and operating conditions, ensuring robust, scalable, and practical deployment in industrial settings. Strategies to reduce computational complexity, such as lightweight network design, model compression, and streamlined RBSO–MRFO hyperparameter optimization, will also be explored to improve efficiency while maintaining predictive performance. It is emphasized that RBSO–MRFO algorithm will be performed offline to produce the final trained model, which will then be deployed on edge or embedded devices for low-latency, resource-efficient, real-time fault diagnosis. In addition, full-scale online optimization with larger populations, increased iterations, and extended training epochs will be investigated to further refine hyperparameters, enhance model performance, and evaluate computational efficiency in real-time applications.

Author Contributions

Conceptualization, A.R.A.; Methodology, A.R.A.; Software, H.K.; Validation, H.K.; Formal analysis, H.K.; Investigation, A.R.A.; Resources, A.R.A.; Data curation, H.K.; Writing—original draft, H.K.; Writing—review & editing, A.R.A.; Visualization, H.K.; Supervision, A.R.A.; Project administration, A.R.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article for the TMFD dataset will be made available by the authors upon reasonable request. The CWRU Dataset can be downloaded publicly from its official website. The MaFaulDa dataset can be downloaded publicly from its official website.

Conflicts of Interest

The authors declare no conflict of interest.

References

Alqunun, K.; Bechiri, M.B.; Naoui, M.; Khechekhouche, A.; Marouani, I.; Guesmi, T.; Alshammari, B.M.; AlGhadhban, A.; Allal, A. An efficient bearing fault detection strategy based on a hybrid machine learning technique. Sci. Rep. 2025, 15, 18739. [Google Scholar] [CrossRef]
Ni, Y.; Li, S.; Guo, P. Discrete wavelet integrated convolutional residual network for bearing fault diagnosis under noise and variable operating conditions. Sci. Rep. 2025, 15, 16185. [Google Scholar] [CrossRef]
Farag, M.M. Towards a standard benchmarking framework for domain adaptation in intelligent fault diagnosis. IEEE Access 2025, 13, 24426–24453. [Google Scholar] [CrossRef]
Xu, S. Mechanical fault diagnosis based on combination of sparsely connected neural networks and a modified version of social network search. Egypt. Inform. J. 2025, 29, 100633. [Google Scholar] [CrossRef]
Gao, Z.; Cecati, C.; Ding, S.X. A survey of fault diagnosis and fault-tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Trans. Ind. Electron. 2015, 62, 3757–3767. [Google Scholar] [CrossRef]
Li, J.; Huang, R.; He, G.; Liao, Y.; Wang, Z.; Li, W. A two-stage transfer adversarial network for intelligent fault diagnosis of rotating machinery with multiple new faults. IEEE/ASME Trans. Mechatron. 2021, 26, 1591–1601. [Google Scholar] [CrossRef]
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Frank, P.M.; Ding, S.X.; Marcu, T. Model-based fault diagnosis in technical processes. Trans. Inst. Meas. Control 2000, 22, 57–101. [Google Scholar] [CrossRef]
Cococcioni, M.; Lazzerini, B.; Volpi, S.L. Robust diagnosis of rolling element bearings based on classification techniques. IEEE Trans. Ind. Inform. 2013, 9, 2256–2263. [Google Scholar] [CrossRef]
Xue, X.; Zhou, J. A hybrid fault diagnosis approach based on mixed-domain state features for rotating machinery. ISA Trans. 2017, 66, 284–295. [Google Scholar] [CrossRef]
Ettefagh, M.M.; Ghaemi, M.; Yazdanian Asr, M. Bearing fault diagnosis using hybrid genetic algorithm k-means clustering. In Proceedings of the 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA), Alberobello, Italy, 23–25 June 2014; pp. 84–89. [Google Scholar]
Song, W.; Xiang, J. A method using numerical simulation and support vector machine to detect faults in bearings. In Proceedings of the 2017 International Conference on Sensing, Diagnostics, Prognostics, and Control (SDPC), Shanghai, China, 16–18 August 2017; pp. 603–607. [Google Scholar]
Qu, J.; Zhang, Z.; Gong, T. A novel intelligent method for mechanical fault diagnosis based on dual-tree complex wavelet packet transform and multiple classifier fusion. Neurocomputing 2016, 171, 837–853. [Google Scholar] [CrossRef]
Wu, J.; Wu, C.; Cao, S.; Or, S.W.; Deng, C.; Shao, X. Degradation data-driven time-to-failure prognostics approach for rolling element bearings in electrical machines. IEEE Trans. Ind. Electron. 2019, 66, 529–539. [Google Scholar] [CrossRef]
Hu, Q.; Qin, A.; Zhang, Q.; He, J.; Sun, G. Fault diagnosis based on weighted extreme learning machine with wavelet packet decomposition and KPCA. IEEE Sens. J. 2018, 18, 8472–8483. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Q.; Xiong, J.; Xiao, M.; Sun, G.; He, J. Fault diagnosis of a rolling bearing using wavelet packet denoising and random forests. IEEE Sens. J. 2017, 17, 5581–5588. [Google Scholar] [CrossRef]
Van, M.; Kang, H. Bearing fault diagnosis using non-local means algorithm and empirical mode decomposition-based feature extraction and two-stage feature selection. IET Sci. Meas. Technol. 2015, 9, 671–680. [Google Scholar] [CrossRef]
Fu, Q.; Jing, B.; He, P.; Si, S.; Wang, Y. Fault feature selection and diagnosis of rolling bearings based on EEMD and optimized Elman_AdaBoost algorithm. IEEE Sens. J. 2018, 18, 5024–5034. [Google Scholar] [CrossRef]
Liu, R.; Yang, B.; Zio, E.; Chen, X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mech. Syst. Signal Process. 2018, 108, 33–37. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015. [Google Scholar]
Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–87. [Google Scholar] [CrossRef]
Zhang, A.; Wang, H.; Li, S.; Cui, Y.; Liu, Z.; Yang, G.; Hu, J. Transfer learning with deep recurrent neural networks for remaining useful life estimation. Appl. Sci. 2018, 8, 2416. [Google Scholar] [CrossRef]
Li, W.; Zhang, L.; Wu, C.; Cui, Z.; Niu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv. Manuf. Technol. 2022, 123, 1999–2015. [Google Scholar] [CrossRef]
Salakhutdinov, R. Deep learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; p. 1973. [Google Scholar]
Zhao, R.; Yan, R.; Chen, Z.; Mao, K.; Wang, P.; Gao, R.X. Deep learning and its applications to machine health monitoring. Mech. Syst. Signal Process. 2019, 115, 213–237. [Google Scholar] [CrossRef]
Zhang, W.; Peng, G.; Li, C.; Chen, Y.; Zhang, Z. A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors 2017, 17, 425. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Xu, J.; Yan, R.; Gao, R.X. A new intelligent bearing fault diagnosis method using SDP representation and SE-CNN. IEEE Trans. Instrum. Meas. 2020, 69, 2377–2389. [Google Scholar] [CrossRef]
Guo, Y.; Zhou, Y.; Zhang, Z. Fault diagnosis of multi-channel data by the CNN with the multilinear principal component analysis. Measurement 2021, 171, 108513. [Google Scholar] [CrossRef]
Wang, X.; Mao, D.; Li, X. Bearing fault diagnosis based on vibro-acoustic data fusion and 1D-CNN network. Measurement 2021, 173, 108518. [Google Scholar] [CrossRef]
Chen, Z.; Liu, Y.; Liu, S. Mechanical state prediction based on LSTM neural network. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 3876–3881. [Google Scholar]
Zhao, K.; Jiang, H.; Li, X.; Wang, R. An optimal deep sparse autoencoder with gated recurrent unit for rolling bearing fault diagnosis. Meas. Sci. Technol. 2020, 31, 015005. [Google Scholar] [CrossRef]
Qiao, M.; Yan, S.; Tang, X.; Xu, C. Deep convolutional and LSTM recurrent neural networks for rolling bearing fault diagnosis under strong noises and variable loads. IEEE Access 2020, 8, 66257–66269. [Google Scholar] [CrossRef]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar] [CrossRef]
Kolen, J.F.; Kremer, S.C. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Networks; IEEE: Piscataway, NJ, USA, 2001; pp. 237–243. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Ding, Y.; Jia, M.; Miao, Q.; Cao, Y. A novel time–frequency transformer based on self-attention mechanism and its application in fault diagnosis of rolling bearings. Mech. Syst. Signal Process. 2022, 168, 108616. [Google Scholar] [CrossRef]
Weng, C.; Lu, B.; Yao, J. A one-dimensional vision transformer with multiscale convolution fusion for bearing fault diagnosis. In Proceedings of the 2021 Global Reliability and Prognostics and Health Management (PHM-Nanjing), Nanjing, China, 15–17 October 2021; pp. 1–6. [Google Scholar]
He, Q.; Li, S.; Bai, Q.; Zhang, A.; Yang, J.; Shen, M. A Siamese vision transformer for bearings fault diagnosis. Micromachines 2022, 13, 1656. [Google Scholar] [CrossRef] [PubMed]
Tang, X.; Xu, Z.; Wang, Z. A novel fault diagnosis method of rolling bearing based on integrated vision transformer model. Sensors 2022, 22, 3878. [Google Scholar] [CrossRef] [PubMed]
Vu, M.T.; Hiraga, M.; Miura, N.; Masuda, A. Failure mode classification for rolling element bearings using time-domain transformer-based encoder. Sensors 2024, 24, 3953. [Google Scholar] [CrossRef] [PubMed]
Luo, X.; Yu, F.; Qian, J.; An, B.; Duan, N. An intelligent fault diagnosis model for rolling bearings based on IGTO-optimized VMD and LSTM networks. Appl. Sci. 2025, 15, 4338. [Google Scholar] [CrossRef]
Sun, A.; He, K.; Dai, M.; Ma, L.; Yang, H.; Dong, F.; Liu, C.; Fu, Z.; Song, M. Bearing fault diagnosis based on golden cosine scheduler-1DCNN-MLP-cross-attention mechanisms (GCOS-1DCNN-MLP-cross-attention). Machines 2025, 13, 819. [Google Scholar] [CrossRef]
Zhong, W.; Pang, B. Intelligent diagnosis method for early weak faults based on wave intercorrelation–convolutional neural networks. Electronics 2025, 14, 2808. [Google Scholar] [CrossRef]
Zhang, Y.; Xia, K.; Chen, X. Dynamic balance domain-adaptive meta-learning for few-shot multi-domain motor bearing fault diagnosis under limited data. Symmetry 2025, 17, 1438. [Google Scholar] [CrossRef]
Ali, A.R.; Kamal, H. Time-to-fault prediction framework for automated manufacturing in humanoid robotics using deep learning. Technologies 2025, 13, 42. [Google Scholar] [CrossRef]
Ali, A.R.; Kamal, H. Robust fault detection in industrial machines using hybrid transformer-DNN with visualization via a humanoid-based telepresence robot. IEEE Access, 2025; in press. [Google Scholar] [CrossRef]
Ali, A.R.; Kamal, H. Hybrid HHO–WHO Optimized Transformer-GRU Model for Advanced Failure Prediction in Industrial Machinery and Engines. Sensors 2026, 26, 534. [Google Scholar] [CrossRef]
Ali, A.R.; Kamal, H. Advanced fault detection in power transmission systems using hybrid DT-MLP model with ROS and ENN techniques. In Proceedings of the 15th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 12–15 May 2025; pp. 1–6. [Google Scholar]
Ali, A.R.; Kamal, H. Advanced Machinery Fault Detection Using Hybrid AE-MLP Deep Learning Model. In Proceedings of the 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 25–27 October 2025; pp. 143–148. [Google Scholar]
Ali, A.R.; Kamal, H. Hybrid RF-MLP Model for Enhanced Fault Detection in Power Transmission Systems Using Data Resampling Techniques. In Proceedings of the 2025 International Telecommunications Conference (ITC-Egypt), Cairo, Egypt, 28–31 July 2025; pp. 364–369. [Google Scholar]
Ali, A.R.; Kamal, H. Real-time digital twin-driven optimization of industrial machinery. In Proceedings of the 15th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 12–15 May 2025; pp. 1–6. [Google Scholar]
Ali, A.R.; Kamal, H. Data resampling techniques for improved fault detection in power transmission systems using artificial neural network multilayer perceptron and support vector machines. In Proceedings of the 6th Novel Intelligent Leading Emerging Science Conference (NILES), Cairo, Egypt, 19–21 October 2024; pp. 188–193. [Google Scholar]
Bacanin, N.; Alhazmi, K.; Zivkovic, M.; Venkatachalam, K.; Bezdan, T.; Nebhen, J. Training multi-layer perceptron with enhanced brain storm optimization metaheuristics. Comput. Mater. Contin. 2022, 70, 4199–4215. [Google Scholar] [CrossRef]
Qiu, G.; Deng, J.; Li, J.; Wang, W. Hybrid clustering-enhanced brain storm optimization algorithm for efficient multi-robot path planning. Biomimetics 2025, 10, 347. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Zhao, D.; Xiang, X.; Shi, Y. Robotic brain storm optimization: A multi-target collaborative searching paradigm for swarm robotics. In Proceedings of the International Conference on Swarm Intelligence, Qingdao, China, 17–21 July 2021; Springer: Cham, Switzerland, 2021; pp. 155–167. [Google Scholar]
Adamu, S.; Alhussian, H.; Aziz, N.; Abdulkadir, S.J.; Alwadin, A.; Abdullahi, M.; Garba, A. Unleashing the power of manta rays foraging optimizer: A novel approach for hyper-parameter optimization in skin cancer classification. Biomed. Signal Process. Control 2025, 99, 106855. [Google Scholar] [CrossRef]
Kamil, O.A.; Al-Shammari, S.W. Manta ray foraging optimization for hyper-parameter selection in convolutional neural network. IOP Conf. Ser. Mater. Sci. Eng. 2020, 978, 012051. [Google Scholar]
Al-Rasheed, A.; Alzahrani, J.S.; Eltahir, M.M.; Mohamed, A.; Hilal, A.M.; Motwakel, A.; Zamani, A.S.; Eldesouki, M.I. Manta ray foraging optimization with machine learning based biomedical data classification. Comput. Mater. Contin. 2022, 73, 2. [Google Scholar] [CrossRef]
Zhao, W.; Zhang, Z.; Wang, L. Manta ray foraging optimization: An effective bio-inspired optimizer for engineering applications. Eng. Appl. Artif. Intell. 2020, 87, 103300. [Google Scholar] [CrossRef]
Shi, Y. Brain storm optimization algorithm. In Advances in Swarm Intelligence (ICSI 2011); LNCS 6728; Springer: Berlin, Germany, 2011; pp. 303–309. [Google Scholar]
Wei, P.; Fan, C.; Yang, X.; Chen, X.; Gan, J.; Deng, X.; Wei, Y.; Li, Z. HOES: An efficient multi-evolutionary expert system for deep learning model optimization in time series prediction. Sci. Rep. 2025, 16, 527. [Google Scholar] [CrossRef]
Tiwari, S.; Chadha, S.; Chauhan, R. Exploring climate change dynamics using machine learning and deep learning approaches. J. Inf. Syst. Eng. Manag. 2025, 10, 139–154. [Google Scholar] [CrossRef]
Khan, S.; Mazhar, T.; Khan, M.A.; Shahzad, T.; Ahmad, W.; Bibi, A.; Saeed, M.M.; Hamam, H. Comparative analysis of deep neural network architectures for renewable energy forecasting: Enhancing accuracy with meteorological and time-based features. Discov. Sustain. 2024, 5, 533. [Google Scholar] [CrossRef]
Sezgin, F.H.; Algorabi, Ö.; Sart, G.; Güler, M. Hyperparameter-Optimized RNN, LSTM, and GRU Models for Airline Stock Price Prediction: A Comparative Study on THYAO and PGSUS. Symmetry 2025, 17, 1905. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, H.; Tong, B.; Song, S. Rolling bearing fault diagnosis via Meta-BOHB optimized CNN–transformer model and time-frequency domain analysis. Sensors 2025, 25, 6920. [Google Scholar] [CrossRef]
Hashi, A.O.; Hashim, S.Z.M.; Mirjalili, S.; Kebande, V.R.; Al-Dhaqm, A.; Nasser, M.; ASamah, A.B. A hybrid CNN-transformer framework optimized by Grey Wolf Algorithm for accurate sign language recognition. Sci. Rep. 2025, 15, 43550. [Google Scholar] [CrossRef] [PubMed]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Raj, K.K.; Kumar, S.; Kumar, R.R.; Andriollo, M. Enhanced fault detection in bearings using machine learning and raw accelerometer data: A case study using the Case Western Reserve University dataset. Information 2024, 15, 259. [Google Scholar] [CrossRef]
Huynh, H.H.; Min, C.-H. Rotating machinery fault detection using support vector machine via feature ranking. Algorithms 2024, 17, 441. [Google Scholar] [CrossRef]
Ahmad, I.; Iqbal, M.M.; Ramzan, S.; Majeed, S.; Butt, N. Diagnosis the bearings faults through exercising deep learning algorithms. J. Comput. Biomed. Inform. 2024, 6, 228–236. [Google Scholar]
Ullah, I.; Khan, N.; Memon, S.A.; Kim, W.-G.; Saleem, J.; Manzoor, S. Vibration-Based Anomaly Detection for Induction Motors Using Machine Learning. Sensors 2025, 25, 773. [Google Scholar] [CrossRef]
Tian, H.; Fan, H.; Feng, M.; Cao, R.; Li, D. Fault diagnosis of rolling bearing based on HPSO algorithm optimized CNN–LSTM neural network. Sensors 2023, 23, 6508. [Google Scholar] [CrossRef]
Kalay, O.C. An optimized 1-D CNN–LSTM approach for fault diagnosis of rolling bearings considering epistemic uncertainty. Machines 2025, 13, 612. [Google Scholar] [CrossRef]

Figure 1. Hybrid RBSO–MRFO algorithm workflow for automated hyperparameter optimization of Transformer-LSTM model.

Figure 2. Model architecture of proposed hybrid Transformer-LSTM for binary classification.

Figure 3. Model architecture of proposed hybrid Transformer-LSTM for multi-class classification.

Figure 4. Visual comparison of signal distributions across operating conditions for CWRU dataset.

Figure 5. Visual comparison of signal distributions across operating conditions for TMFD dataset.

Figure 6. Visual comparison of signal distributions across operating conditions for MaFaulDa dataset.

Figure 7. Confusion matrix for binary classification of the Transformer-LSTM model on the CWRU dataset: (a) pre-optimization, (b) post-optimization.

Figure 8. Test accuracy of deep learning models pre-optimization and post-optimization for binary classification on the CWRU dataset.

Figure 9. Pre-optimization confusion matrix of the Transformer-LSTM model for multi-class classification on the CWRU dataset.

Figure 10. Post-optimization confusion matrix of the Transformer-LSTM model for multi-class classification on the CWRU dataset.

Figure 11. Test accuracy of deep learning models pre-optimization and post-optimization for multi-class classification on the CWRU dataset.

Figure 12. Confusion matrix for binary classification of the Transformer-LSTM model on the TMFD dataset: (a) pre-optimization, (b) post-optimization.

Figure 13. Test accuracy of deep learning models pre-optimization and post-optimization for binary classification on the TMFD dataset.

Figure 14. Pre-optimization confusion matrix of the Transformer-LSTM model for multi-class classification on the TMFD dataset.

Figure 15. Post-optimization confusion matrix of the Transformer-LSTM model for multi-class classification on the TMFD dataset.

Figure 16. Test accuracy of deep learning models pre-optimization and post-optimization for multi-class classification on the TMFD dataset.

Figure 17. Confusion matrix for binary classification of the Transformer-LSTM model on the MaFaulDa dataset: (a) pre-optimization, (b) post-optimization.

Figure 18. Test accuracy of deep learning models pre-optimization and post-optimization for binary classification on the MaFaulDa dataset.

Figure 19. Pre-optimization confusion matrix of the Transformer-LSTM model for multi-class classification on the MaFaulDa dataset.

Figure 20. Post-optimization confusion matrix of the Transformer-LSTM model for multi-class classification on the MaFaulDa dataset.

Figure 21. Test accuracy of deep learning models pre-optimization and post-optimization for multi-class classification on the MaFaulDa dataset.

Table 1. Hyperparameter search and optimization for evaluated models using hybrid RBSO–MRFO algorithm for binary and multi-class classification.

Model	Hyperparameter	Range	Type
MLP [62,63,64]	Hidden Units	(2, 16)	Discrete
	Dropout Rate	(0.0, 0.5)	Continuous
	Learning Rate	(1 × 10⁻⁵, 1 × 10⁻²)	Continuous
LSTM [62,65,66]	LSTM Units	(8, 128)	Discrete
	Dropout Rate	(0.0, 0.5)	Continuous
	Learning Rate	(1 × 10⁻⁵, 1 × 10⁻²)	Continuous
GRU–TCN [62,65,66]	GRU Units	(8, 256)	Discrete
	TCN Filters	(16, 64)	Discrete
	Dropout Rate	(0.0, 0.5)	Continuous
	Learning Rate	(1 × 10⁻⁵, 1 × 10⁻²)	Continuous
CNN–BiLSTM [64,65,66]	CNN Filters	(16, 128)	Discrete
	LSTM Units	(8, 128)	Discrete
	Dropout Rate	(0.0, 0.5)	Continuous
	Learning Rate	(1 × 10⁻⁵, 1 × 10⁻²)	Continuous
Transformer-LSTM [62,66,67]	Number of Heads	(1, 8)	Discrete
	Key Dimension	(8, 128)	Discrete
	FFN Units	(8, 512)	Discrete
	LSTM Units	(8, 256)	Discrete
	Dropout Rate	(0.0, 0.5)	Continuous

Table 2. Transformer-LSTM Model Architecture for Binary Classification.

Layer	Configuration
Model type	Hybrid Transformer-LSTM network
Input layer	Input shape (number of features)
Reshape layer	Reshape to (number of features, 1)
Conv1D layer	64 filters, kernel size 1, ReLU activation, same padding
Gaussian noise	Standard deviation 0.01
Transformer block	Multi-head attention 1 head, key dimension 4, dropout 0.9, Add & LayerNorm; feed-forward Dense 16, dropout, Dense d_model, dropout, Add & LayerNorm
LSTM layer	8 units, return sequences False, dropout 0.9
Dense layer	128 units, ReLU activation, dropout 0.2
Output layer	1 unit, Sigmoid activation
Output	Binary classification

Table 3. Transformer-LSTM Model Hyperparameters for Binary Classification.

Hyperparameter	Value
Optimizer	Adam
Loss function	Binary cross-entropy
Metrics	Accuracy
Batch size	128
Learning rate	0.001
Learning rate schedule	ReduceLROnPlateau, patience 3, factor 0.5, min_lr 1 × 10⁻⁶, monitored on validation loss
Callbacks	Confusion matrix visualization
Transformer heads	1
Key dimension	4
Feed-forward units	16
LSTM units	8
Dropout rate	0.9 (Transformer + LSTM), 0.2 (Dense layer)
Dense layer units	128

Table 4. Transformer-LSTM model architecture for multi-class classification.

Layer	Configuration
Model type	Hybrid Transformer-LSTM network
Input layer	Input shape (number of features)
Reshape layer	Reshape to (number of features, 1)
Conv1D layer	64 filters, kernel size 1, ReLU activation, same padding
Gaussian noise	Standard deviation 0.01
Multi-head attention	1 head, key dimension 16, dropout 0.5, Add & LayerNorm
Feed-forward network	Dense 64, ReLU, dropout 0.5, Dense d_model, dropout 0.5, Add & LayerNorm
LSTM layer	32 units, return sequences False, dropout 0.5
Dense layer	128 units, ReLU activation, dropout 0.2
Output layer	Number of classes units, Softmax activation
Output	Multi-class classification

Table 5. Transformer-LSTM model hyperparameters for multi-class classification.

Hyperparameter	Value
Optimizer	Adam
Loss function	Categorical cross-entropy
Metrics	Accuracy
Batch size	128
Learning rate	0.001
Learning rate schedule	ReduceLROnPlateau, patience 3, factor 0.5, minimum 1 × 10⁻⁶, monitored on validation loss
Callbacks	Confusion matrix visualization
Transformer heads	1
Key dimension	16
Feed-forward units	64
LSTM units	32
Dropout rate	0.5 (Transformer + LSTM), 0.2 (Dense layer)
Dense layer units	128

Table 6. Sample distribution per operating condition on CWRU dataset.

Fault Type	Number of Samples
Normal_1	230
IR_014_1	230
IR_007_1	230
IR_021_1	230
OR_007_6_1	230
OR_014_6_1	230
OR_021_6_1	230
Ball_007_1	230
Ball_014_1	230
Ball_021_1	230

Table 7. Statistical characterization of operating conditions on CWRU dataset.

Fault Type	Samples	Mean	Std	Min	Q1	Median	Q3	Max
Ball_014_1	230	2.24	4.46	−2.49	0.013	0.168	3.066	40.87
Ball_021_1	230	2.36	6.05	−2.82	0.008	0.175	0.715	73.78
Normal_1	230	0.94	1.94	−0.43	−0.119	0.064	0.222	11.70
IR_014_1	230	1.37	2.26	−1.18	0.032	0.207	1.50	8.12
Ball_007_1	230	1.24	2.46	−0.63	0.017	0.137	0.508	11.31
IR_021_1	230	6.93	18.43	−3.11	0.014	0.605	2.561	162.79
OR_014_6_1	230	1.70	3.76	−1.35	0.009	0.132	0.519	21.47
IR_007_1	230	2.60	4.37	−1.57	0.021	0.286	4.501	16.90
OR_007_6_1	230	11.41	29.94	−5.25	0.058	1.11	4.679	313.74
OR_021_6_1	230	7.04	14.12	−6.29	0.014	0.699	6.923	104.54

Table 8. Sample distribution per operating condition on TMFD dataset.

Operating Condition	Number of Samples
Normal	18,026
Steady-State Overload	320
Transient Overload	220

Table 9. Statistical characterization of operating conditions on TMFD dataset.

Operating Condition	Samples	Mean	Std	Q1	Median	Q3	Max
Normal	18,026	73.52	125.45	1.0	3.7	100.0	580.0
Steady-State Overload	320	70.04	138.70	0.0	1.2	20.93	500.0
Transient Overload	220	70.38	138.53	0.0	2.0	21.75	500.0

Table 10. Sample distribution per operating condition on MaFaulDa dataset.

Fault Type	Number of Samples
Normal	12,250,000
6 g	12,250,000
10 g	12,000,000
15 g	12,000,000
20 g	12,250,000
25 g	11,750,000
30 g	11,750,000

Table 11. Statistical characterization of operating conditions on MaFaulDa dataset.

Fault Type	Samples	Mean	Std	Min	Q1	Median	Q3	Max
6 g	12,250,000	0.009379	0.753013	−5.0265	−0.25145	−0.009956	0.12983	5.3841
Normal	12,250,000	0.007108	0.746563	−4.4835	−0.29003	−0.014158	0.16688	5.1078
10 g	12,000,000	0.014068	0.801364	−4.8189	−0.28728	−0.010551	0.14267	6.4163
15 g	12,000,000	0.009017	0.873038	−7.7810	−0.34193	−0.011001	0.14749	7.3737
20 g	12,250,000	0.006443	0.924449	−154.930	−0.38188	−0.012361	0.15468	35.2620
25 g	11,750,000	0.009178	0.961216	−204.120	−0.40610	−0.011347	0.16391	126.8600
30 g	11,750,000	−0.001049	1.143868	−193.770	−0.46202	−0.018592	0.19607	130.4000

Table 12. Optimized hyperparameter values for deep learning models in binary classification of CWRU dataset.

Model	Hyperparameters	Best Value
MLP	Hidden Units	15
	Dropout Rate	0.0619
	Learning Rate	0.0098
LSTM	LSTM Units	128
	Dropout Rate	0.1039
	Learning Rate	0.0100
GRU-TCN	GRU Units	93
	TCN Filters	16
	Dropout Rate	0.2318
	Learning Rate	0.0085
CNN-BiLSTM	CNN Filters	67
	LSTM Units	102
	Dropout Rate	0.0998
	Learning Rate	0.0051
Transformer-LSTM	Number of Heads	3
	Key Dimension	32
	FFN Units	431
	LSTM Units	161
	Dropout Rate	0.2067

Table 13. Optimized hyperparameter values for deep learning models in multi-class classification of CWRU dataset.

Model	Hyperparameters	Best Value
MLP	Hidden Units	14
	Dropout Rate	0.0170
	Learning Rate	0.0100
LSTM	LSTM Units	127
	Dropout Rate	0.4771
	Learning Rate	0.0092
GRU-TCN	GRU Units	156
	TCN Filters	54
	Dropout Rate	0.2579
	Learning Rate	0.0065
CNN-BiLSTM	CNN Filters	30
	LSTM Units	120
	Dropout Rate	0.3741
	Learning Rate	0.0078
Transformer-LSTM	Number of Heads	2
	Key Dimension	95
	FFN Units	339
	LSTM Units	256
	Dropout Rate	0.0000

Table 14. Optimized hyperparameter values for deep learning models in binary classification of TMFD dataset.

Model	Hyperparameters	Best Value
MLP	Hidden Units	8
	Dropout Rate	0.0302
	Learning Rate	0.0077
LSTM	LSTM Units	91
	Dropout Rate	0.4810
	Learning Rate	0.0090
GRU-TCN	GRU Units	122
	TCN Filters	53
	Dropout Rate	0.2888
	Learning Rate	0.0099
CNN-BiLSTM	CNN Filters	57
	LSTM Units	113
	Dropout Rate	0.0000
	Learning Rate	0.0081
Transformer-LSTM	Number of Heads	4
	Key Dimension	74
	FFN Units	143
	LSTM Units	250
	Dropout Rate	0.5000

Table 15. Optimized hyperparameter values for deep learning models in multi-class classification of TMFD dataset.

Model	Hyperparameters	Best Value
MLP	Hidden Units	13
	Dropout Rate	0.0638
	Learning Rate	0.0097
LSTM	LSTM Units	66
	Dropout Rate	0.1040
	Learning Rate	0.0100
GRU-TCN	GRU Units	184
	TCN Filters	50
	Dropout Rate	0.1490
	Learning Rate	0.0088
CNN-BiLSTM	CNN Filters	98
	LSTM Units	111
	Dropout Rate	0.5000
	Learning Rate	0.0082
Transformer-LSTM	Number of Heads	2
	Key Dimension	94
	FFN Units	251
	LSTM Units	145
	Dropout Rate	0.0553

Table 16. Optimized hyperparameter values for deep learning models in binary classification of MaFaulDa dataset.

Model	Hyperparameters	Best Value
MLP	Hidden Units	16
	Dropout Rate	0.0414
	Learning Rate	0.0061
LSTM	LSTM Units	89
	Dropout Rate	0.4077
	Learning Rate	0.0100
GRU-TCN	GRU Units	50
	TCN Filters	60
	Dropout Rate	0.1740
	Learning Rate	0.0086
CNN-BiLSTM	CNN Filters	81
	LSTM Units	84
	Dropout Rate	0.2304
	Learning Rate	0.0053
Transformer-LSTM	Number of Heads	4
	Key Dimension	80
	FFN Units	8
	LSTM Units	144
	Dropout Rate	0.1267

Table 17. Optimized hyperparameter values for deep learning models in multi-class classification of MaFaulDa dataset.

Model	Hyperparameters	Best Value
MLP	Hidden Units	14
	Dropout Rate	0.0034
	Learning Rate	0.0100
LSTM	LSTM Units	75
	Dropout Rate	0.1150
	Learning Rate	0.0092
GRU-TCN	GRU Units	190
	TCN Filters	53
	Dropout Rate	0.0251
	Learning Rate	0.0089
CNN-BiLSTM	CNN Filters	67
	LSTM Units	86
	Dropout Rate	0.1978
	Learning Rate	0.0100
Transformer-LSTM	Number of Heads	1
	Key Dimension	11
	FFN Units	29
	LSTM Units	243
	Dropout Rate	0.4965

Table 18. Performance comparison of deep learning models for binary classification before and after optimization on the CWRU dataset.

Prediction Model	Accuracy	Precision	Recall	F-Score
MLP	96.96%	97.56%	96.96%	97.10%
LSTM	97.24%	97.24%	97.24%	97.12%
GRU-TCN	95.59%	96.75%	95.59%	95.87%
CNN-BiLSTM	98.07%	98.11%	98.07%	98.00%
Transformer-LSTM	98.35%	98.54%	98.35%	98.39%
Optimized MLP	99.17%	99.22%	99.17%	99.18%
Optimized LSTM	98.89%	98.98%	98.89%	98.91%
Optimized GRU-TCN	99.17%	99.23%	99.17%	99.19%
Optimized CNN-BiLSTM	99.45%	99.45%	99.45%	99.44%
Optimized Transformer-LSTM	99.72%	99.73%	99.72%	99.72%

Table 19. Performance comparison of deep learning models for multi-class classification before and after optimization on the CWRU dataset.

Prediction Model	Accuracy	Precision	Recall	F-Score
MLP	96.14%	96.51%	96.14%	96.05%
LSTM	95.31%	95.34%	95.31%	95.29%
GRU-TCN	98.07%	98.10%	98.07%	98.07%
CNN-BiLSTM	96.14%	96.65%	96.14%	96.06%
Transformer-LSTM	97.21%	97.29%	97.21%	97.16%
Optimized MLP	99.44%	99.46%	99.44%	99.45%
Optimized LSTM	98.62%	98.70%	98.62%	98.62%
Optimized GRU-TCN	99.17%	99.20%	99.17%	99.17%
Optimized CNN-BiLSTM	99.45%	99.47%	99.45%	99.45%
Optimized Transformer-LSTM	99.72%	99.73%	99.72%	99.72%

Table 20. Performance comparison of deep learning models for binary classification before and after optimization on the TMFD dataset.

Prediction Model	Accuracy	Precision	Recall	F-Score
MLP	93.88%	97.30%	93.88%	95.18%
LSTM	97.38%	98.59%	97.38%	97.76%
GRU-TCN	97.39%	98.60%	97.39%	97.76%
CNN-BiLSTM	98.55%	99.02%	98.55%	98.68%
Transformer-LSTM	99.52%	99.53%	99.52%	99.52%
Optimized MLP	99.00%	99.24%	99.00%	99.07%
Optimized LSTM	99.94%	99.94%	99.94%	99.94%
Optimized GRU-TCN	99.35%	99.47%	99.35%	99.38%
Optimized CNN-BiLSTM	99.41%	99.50%	99.41%	99.43%
Optimized Transformer-LSTM	99.97%	99.97%	99.97%	99.97%

Table 21. Performance comparison of deep learning models for multi-class classification before and after optimization on the TMFD dataset.

Prediction Model	Accuracy	Precision	Recall	F-Score
MLP	93.21%	97.87%	93.21%	95.10%
LSTM	98.51%	98.56%	98.51%	98.53%
GRU–TCN	69.17%	95.81%	69.17%	79.39%
CNN–BiLSTM	98.33%	98.94%	98.33%	98.51%
Transformer-LSTM	98.57%	98.62%	98.57%	98.58%
Optimized MLP	99.11%	99.12%	99.11%	99.07%
Optimized LSTM	99.91%	99.91%	99.91%	99.91%
Optimized GRU–TCN	89.74%	94.69%	89.74%	92.00%
Optimized CNN–BiLSTM	99.92%	99.92%	99.92%	99.92%
Optimized Transformer-LSTM	99.97%	99.97%	99.97%	99.97%

Table 22. Performance comparison of deep learning models for binary classification before and after optimization on the MaFaulDa dataset.

Prediction Model	Accuracy	Precision	Recall	F-Score
MLP	90.97%	90.39%	90.97%	90.03%
LSTM	97.60%	97.91%	97.60%	97.67%
GRU-TCN	97.52%	97.78%	97.52%	97.58%
CNN-BiLSTM	95.98%	96.48%	95.98%	96.11%
Transformer-LSTM	98.18%	98.37%	98.18%	98.23%
Optimized MLP	99.66%	99.66%	99.66%	99.66%
Optimized LSTM	99.30%	99.32%	99.30%	99.31%
Optimized GRU-TCN	99.31%	99.33%	99.31%	99.32%
Optimized CNN-BiLSTM	99.50%	99.50%	99.50%	99.50%
Optimized Transformer-LSTM	99.98%	99.98%	99.98%	99.98%

Table 23. Performance comparison of deep learning models for multi-class classification before and after optimization on the MaFaulDa dataset.

Prediction Model	Accuracy	Precision	Recall	F-Score
MLP	88.05%	87.93%	88.05%	87.84%
LSTM	75.57%	74.93%	75.57%	74.97%
GRU–TCN	91.17%	91.07%	91.17%	91.10%
CNN–BiLSTM	90.46%	90.33%	90.46%	90.37%
Transformer-LSTM	92.82%	93.03%	92.82%	92.82%
Optimized MLP	92.23%	92.16%	92.23%	92.18%
Optimized LSTM	97.44%	97.43%	97.44%	97.43%
Optimized GRU–TCN	95.98%	95.96%	95.98%	95.96%
Optimized CNN–BiLSTM	96.23%	96.22%	96.23%	96.22%
Optimized Transformer-LSTM	98.60%	98.60%	98.60%	98.60%

Table 24. Detailed computational cost metrics of RBSO–MRFO algorithm.

Parameter	Value
Population Size	14
Number of Iterations/Generations	12
Total Fitness Evaluations	168 (population size × iterations)
Stopping Criteria	Fixed number of iterations
Hyperparameter Encoding	Mixed continuous and discrete
Training Epochs per Evaluation	6
Batch Size	128

Table 25. Training time for optimized Transformer-LSTM model.

Dataset	Classification Type	Training Time per Batch (Seconds)	Training Time per Sample (Milliseconds)
CWRU	Binary classification	0.082	0.64
CWRU	Multi-class classification	0.084	0.65
TMFD	Binary classification	0.076	0.59
TMFD	Multi-class classification	0.079	0.62
MaFaulDa	Binary classification	0.045	0.35
MaFaulDa	Multi-class classification	0.048	0.37

Table 26. Inference time for optimized Transformer-LSTM model.

Dataset	Classification Type	Inference Time per Batch (Seconds)	Inference Time per Sample (Milliseconds)
CWRU	Binary classification	0.071	0.55
CWRU	Multi-class classification	0.074	0.57
TMFD	Binary classification	0.069	0.54
TMFD	Multi-class classification	0.075	0.59
MaFaulDa	Binary classification	0.036	0.28
MaFaulDa	Multi-class classification	0.039	0.30

Table 27. Memory consumption for optimized Transformer-LSTM model.

Dataset	Classification Type	Memory Consumption (MB)
CWRU	Binary classification	1.29
CWRU	Multi-class classification	2.04
TMFD	Binary classification	1.87
TMFD	Multi-class classification	1.08
MaFaulDa	Binary classification	1.13
MaFaulDa	Multi-class classification	1.63

Table 28. Results of comparison experiments of prior methods and proposed method of binary classification across CWRU, TMFD and MaFaulDa datasets.

Dataset	Model	Accuracy
CWRU	KNN [73]	94.7%
	MLP-BP [73]	99.5%
	MLP-BP + SVM [73]	98.8%
	CWT + ANN [73]	99.6%
	RBSO–MRFO + Transformer-LSTM (Proposed method)	99.72%
TMFD	DNN [47]	99.29%
	CNN [47]	98.51%
	LSTM [47]	99.11%
	GRU [47]	99.27%
	RBSO–MRFO + Transformer-LSTM (Proposed method)	99.97%
MaFaulDa	Unoptimized SVM [74]	85.9%
	Optimized SVM [74]	90.4%
	Oversampled optimized SVM [74]	95.4%
	Unoptimized KNN [74]	87.4%
	Optimized KNN [74]	89.8%
	Oversampled optimized KNN [74]	92.8%
	Time-domain based DNN [74]	95%
	FFT based DNN [74]	99.7%
	RBSO–MRFO + Transformer-LSTM (Proposed method)	99.98%

Table 29. Results of comparison experiments of prior methods and proposed method of multi-class classification across CWRU, TMFD and MaFaulDa datasets.

Dataset	Model	Accuracy
CWRU	CNN-LSTM [75]	94.20%
	HPSO-CNN-LSTM [75]	99.20%
	TSFFCNN-PSO-SVM [76]	98.50%
	1-D CNN-PSO-SVM [76]	98.20%
	CNN-LSTM with Gated Recurrent Unit [76]	99.29%
	CNN-BiLSTM with Grid Search [76]	99.28%
	Optimized 1-D CNN-LSTM [76]	99.35%
	RBSO–MRFO + Transformer-LSTM (Proposed method)	99.72%
TMFD	DNN [47]	99.67%
	CNN [47]	99.86%
	LSTM [47]	97.09%
	GRU [47]	97.09%
	RBSO–MRFO + Transformer-LSTM (Proposed method)	99.97%
MaFaulDa	DNN [47]	97.04%
	CNN [47]	90.51%
	LSTM [47]	95.71%
	GRU [47]	96.64%
	Transformer-DNN [47]	98.39%
	RBSO–MRFO + Transformer-LSTM (Proposed method)	98.60%

Table 30. Results of ablation experiments of binary classification across CWRU, TMFD and MaFaulDa datasets.

Experimental Method	CWRU	TMFD	MaFaulDa
Experimental Method	Accuracy	Accuracy	Accuracy
Transformer-LSTM	98.35%	99.52%	98.18%
RBSO + Transformer-LSTM	99.45%	99.87%	99.43%
MRFO + Transformer-LSTM	99.17%	99.84%	99.37%
RBSO–MRFO + Transformer-LSTM (Proposed method)	99.72%	99.97%	99.98%

Table 31. Results of ablation experiments of multi-class classification across CWRU, TMFD and MaFaulDa datasets.

Experimental Method	CWRU	TMFD	MaFaulDa
Experimental Method	Accuracy	Accuracy	Accuracy
Transformer-LSTM	97.21%	98.57%	92.82%
RBSO + Transformer-LSTM	98.35%	80.56%	97.71%
MRFO + Transformer-LSTM	98.07%	88.83%	97.55%
RBSO–MRFO + Transformer-LSTM (Proposed method)	99.72%	99.97%	98.60%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ali, A.R.; Kamal, H. Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification. Machines 2026, 14, 208. https://doi.org/10.3390/machines14020208

AMA Style

Ali AR, Kamal H. Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification. Machines. 2026; 14(2):208. https://doi.org/10.3390/machines14020208

Chicago/Turabian Style

Ali, Amir R., and Hossam Kamal. 2026. "Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification" Machines 14, no. 2: 208. https://doi.org/10.3390/machines14020208

APA Style

Ali, A. R., & Kamal, H. (2026). Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification. Machines, 14(2), 208. https://doi.org/10.3390/machines14020208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Rotating Machinery Fault Diagnosis Using Hybrid RBSO–MRFO Adaptive Transformer-LSTM for Binary and Multi-Class Classification

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Proposed Approach

3.1.1. Hybrid RBSO–MRFO Algorithm

3.1.2. Hybrid RBSO–MRFO Workflow for Transformer-LSTM Hyperparameter Optimization

3.1.3. Hyperparameter Search and Optimization of Evaluated Models

3.1.4. Transformer-LSTM Model

3.2. Comparison with Other Deep Learning Models

3.2.1. MLP Model

3.2.2. LSTM Model

3.2.3. GRU-TCN Model

3.2.4. CNN-BiLSTM Model

3.2.5. Comparative Analysis of the Proposed RBSO–MRFO Optimized Transformer-LSTM and Other Deep Learning Models

3.3. Case Western Reserve University (CWRU) Dataset Development

3.4. Industrial Machine Fault Detection (TMFD) Dataset Development

3.5. Machinery Fault (MaFaulDa) Dataset Development

3.6. Data Preprocessing

3.6.1. CWRU Dataset

3.6.2. TMFD Dataset

3.6.3. MaFaulDa Dataset

3.7. Performance Metrics

4. Results and Discussion

4.1. Model Experiment Results

4.1.1. CWRU Dataset

4.1.2. TMFD Dataset

4.1.3. MaFaulDa Dataset

4.2. Analysis of the Effects Before and After Applying Model Optimization

4.2.1. Performance of Predictive Models of CWRU Dataset

4.2.2. Performance of Predictive Models of TMFD Dataset

4.2.3. Performance of Predictive Models of MaFaulDa Dataset

4.3. Computational Cost of Hybrid RBSO–MRFO Algorithm and Time, and Memory of Optimized Transformer-LSTM Model

4.3.1. Computational Cost of Hybrid RBSO–MRFO Algorithm

4.3.2. Training Time of Optimized Transformer-LSTM Model

4.3.3. Inference Time of Optimized Transformer-LSTM Model

4.3.4. Memory Consumption of Optimized Transformer-LSTM Model

4.4. Comparison Experiments with Related Work

4.5. Ablation Experiments

5. Challenges and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI