Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis

Kim, Jung-Woo; Lee, Jong-Hak; Son, Dong-Hun; Choi, Sung-Hyun; Park, Kyoung-Su

doi:10.3390/lubricants14010012

Open AccessArticle

Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis

by

Jung-Woo Kim

¹,

Jong-Hak Lee

²,

Dong-Hun Son

²,

Sung-Hyun Choi

² and

Kyoung-Su Park

^1,*

¹

Department of Mechanical Engineering, Gachon University, 1342 Seongnamdaero, Sujeong-gu, Seongnam-si 461-701, Gyeonggi-do, Republic of Korea

²

LIG Nex1, 207, 333 Pangyo-ro, Bundang-gu, Seongnam-si 13488, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Lubricants 2026, 14(1), 12; https://doi.org/10.3390/lubricants14010012 (registering DOI)

Submission received: 23 November 2025 / Revised: 24 December 2025 / Accepted: 26 December 2025 / Published: 28 December 2025

(This article belongs to the Special Issue Advances in Wear Life Prediction of Bearings)

Download

Browse Figures

Versions Notes

Abstract

This study investigates how the clarity of frequency-domain characteristics in vibration signals affects the performance of deep learning models for bearing fault classification. Two datasets were used; these were the CWRU benchmark dataset, which exhibits distinct and easily separable spectral signatures across fault modes, and a custom low-speed bearing dataset in which small defects do not significantly alter the frequency spectrum. To enable a clear and interpretable comparison, simplified CNN and LSTM architectures with a single core layer were deliberately employed. This design choice allows performance differences to be attributed directly to the inherent learning mechanisms of each architecture rather than to model complexity. Representation analysis shows that LSTM-F achieves the highest accuracy when the dataset contains clearly distinguishable spectral patterns, as in the CWRU case. In contrast, CNN-S outperforms both LSTM models in the experimental dataset, where fault-induced frequency characteristics are weak or ambiguous. Additional representation analyses further reveal that LSTM-F relies on consistent frequency-indexed patterns, whereas CNN-S captures more complex time–frequency interactions, making it more robust under low-separability conditions. These findings demonstrate that the optimal deep learning architecture for bearing fault classification depends on the degree of frequency separability in the data. LSTM-F is preferable for severe faults with distinct spectral features, while CNN-S is more effective for minor defects or systems exhibiting complex, weakly discriminative frequency behavior.

Keywords:

slewing bearing; fault classification; convolutional neural networks; long short-term memory networks; representation

1. Introduction

Rotating machinery is ubiquitous in mechanical systems, ranging from light-duty to heavy-duty and from high-speed to low-speed applications. Bearings are among the most critical components for ensuring smooth rotation, serving key roles in systems such as motor shafts, milling machine spindles, conveyor rollers, wind turbine blades, crane booms, radar antennas, excavator bodies, and bridge piers [1,2]. The reliability of bearings becomes particularly critical in scenarios involving low-speed rotations and substantial mechanical stresses. Bearings predominantly experience malfunctions caused by faults in the inner race (IF), outer race (OF), or rolling elements (RF) [3,4,5]. These failures can severely degrade manufacturing quality or even cause catastrophic accidents, making accurate fault mode classification essential for mechanical systems.

Traditional fault classification methods extract statistical features from vibration signals [6,7], such as acceleration, in either the time or frequency domain [8,9,10,11]. Mathematically and physically defined equations have been used to distinguish faulty bearings from normal ones [12]. However, conventional approaches often fail to detect subtle defects, making it difficult to differentiate between normal and abnormal conditions. Frequency-based techniques such as the Fourier transform are particularly limited when fault signatures are weak or ambiguous [13,14]. The problem becomes even more challenging under low-speed and slightly damaged operating conditions, where signals contain less discriminative information [15]. To address these challenges, various studies have focused on extracting weak fault features from raw vibration data. Advanced signal processing methods, such as empirical mode decomposition (EMD), can separate hidden fault-related components and identify characteristic frequencies from noisy data using the Fourier transform [16,17,18,19]. Other approaches have proposed novel fault-sensitive features [20,21], which have demonstrated higher detection sensitivity than EMD-based methods. However, these studies were typically performed under significant fault conditions, not under slight or early-stage faults. Furthermore, approaches based on [20,21] generally classify bearings only as normal or abnormal, without specifying the failure mode. Since no clear numerical threshold exists for determining fault modes, diagnosis often depends on subjective human judgment, which may lead to false positives or negatives and, consequently, severe system failures.

To overcome these limitations, data-driven methods based on deep learning have emerged, providing automatic threshold determination and fault mode identification [22,23]. Deep learning models such as convolutional neural networks (CNNs) [22] and long short-term memory networks (LSTMs) [23] have demonstrated excellent performance in bearing fault classification. Their superiority arises from their ability to uncover complex correlations between input and output data beyond human-defined engineering rules, autonomously learning discriminative features [24]. An et al. [25] proposed a bearing fault diagnosis method incorporating periodic sparse attention within an LSTM framework, achieving a 2% improvement in accuracy compared to CNN-based models. Gu et al. [26] introduced a robust fault diagnosis approach that combines discrete wavelet transforms with multi-sensor fusion through Bi-LSTM, yielding up to a 20% accuracy increase over CNN models. Li et al. [27] integrated highway gates with an attention mechanism for improved representation learning, while Li et al. [28] demonstrated that a 1D-CNN could outperform LSTM by acting as a fault-frequency band filter. Their model achieved an F1 score exceeding 98% by using a loss function designed to extract the center frequencies of fault modes selectively. Zhang and Deng [29] further showed that a CNN-based model, when combined with short-time Fourier transform (STFT) features, achieved 99.96% accuracy, outperforming a bidirectional LSTM model (96.15%). Yang et al. [30] proposed converting 1D vibration signals (n² × 1) into 2D images (n × n) to be processed by CNNs, achieving a 7% accuracy improvement over LSTM when using a random forest classifier.

As summarized in [25,26,27,28,29,30], deep learning methods have substantially improved the accuracy of bearing fault diagnosis. However, results remain inconsistent regarding whether CNN-based or LSTM-based models are more effective. More importantly, most comparative studies utilize complex, deep models with multiple layers and sophisticated components. While these models achieve high accuracy, their “black-box” nature makes it difficult to discern whether performance differences stem from the core architectural principles of CNNs and LSTMs or simply from the model’s depth and complexity. This presents a fundamental challenge for both researchers and engineers when selecting an appropriate model under limited computational resources and for understanding the underlying failure mechanisms. Therefore, it is necessary to investigate when and why each architecture performs better at a fundamental level and to understand the root causes of performance differences. Since deep learning inherently functions as a form of representation learning [31], analyzing these differences through transparent model designs can provide valuable insights for improving cost efficiency and productivity in manufacturing. Despite the extensive use of CNN- and LSTM-based models in bearing fault diagnosis, it remains unclear whether their performance differences originate from fundamental architectural mechanisms or from increased model complexity. Most existing comparative studies rely on deep, multi-layered models, which obscure causal interpretation. This study explicitly addresses this gap by employing deliberately simplified CNN and LSTM architectures.

In this paper, the existing gap is addressed by introducing a simplified experimental framework for comparing CNN and LSTM architectures. Single-layer CNN and LSTM models are deliberately employed to ensure that any performance differences are directly attributable to their fundamental approaches to processing time–frequency data—with CNNs capturing local spatial patterns in spectrograms and LSTMs modeling sequential dependencies—rather than being confounded by model depth or parameter count. This design choice enables clearer visualization and interpretation of the learned representations, providing deeper insights into the causal relationships between data characteristics and model performance.

The study first compares the performance of these simplified CNN- and LSTM-based models for bearing fault detection using two representative deep learning architectures. To ensure generality, datasets from both high-speed, light-duty rotating machinery and low-speed, heavy-duty rotating machinery are employed. Because no benchmark dataset exists for the latter, a low-RPM bearing test rig was constructed to collect the necessary data.

Subsequently, the learned representations of both models are analyzed and visualized to identify the root causes of their performance differences, specifically examining how the extracted features contribute to classification. The objectives of this study are as follows:

To compare the performance of STFT-based CNN models and handcrafted-feature-based LSTM models in fault classification of rotating machinery using a simplified, interpretable framework.
To analyze the reasons behind performance differences from the perspective of representation learning and fundamental network architecture.
To correlate learned representations with physically interpretable features for deeper insight into fault characteristics and provide practical model selection guidelines.

To accomplish these objectives, Section 2 reviews the theoretical background of the two comparative methodologies. Section 3 describes the two datasets used: the Case Western Reserve University (CWRU) bearing dataset and the low-RPM bearing dataset developed in this study. Section 4 details the data size, deep learning architectures, hyperparameters, and evaluation metrics. Finally, Section 5 and Section 6 present and discuss the experimental results and conclusions.

2. Theoretical Backgrounds

2.1. Convolutional Neural Network

A convolutional neural network (CNN) is an artificial neural architecture designed to extract spatial patterns from input data [32]. Recent studies have demonstrated its extensive use in bearing fault detection, showing remarkable performance across various rotating machinery systems when combined with different 2D feature extraction techniques [33,34]. CNNs employ locally connected representations by convolving learnable weight matrices with the input data [35]. The striding operation enhances efficiency by sharing the same weights across different local regions, enabling the network to learn similar local patterns even when they appear at different spatial locations [36,37]. This characteristic—well established in computer vision applications—allows CNNs to robustly extract object patterns regardless of positional variation [38]. In this study, the effectiveness of these properties is further examined within the context of signal processing, as discussed in Section 5.

2.2. Short-Time Fourier Transform

Vibration refers to the repetitive motion of an object relative to a stationary reference frame [39]. The discrete Fourier transform (DFT) converts a discrete-time signal, which varies with respect to time, into a frequency-domain representation [40]. The short-time Fourier transform (STFT) extends this concept by capturing frequency information that evolves over time. It does so by segmenting the discrete-time signal into multiple overlapping windows and performing the Fourier transform within each segment [41]. As a result, the STFT produces a two-dimensional representation that simultaneously conveys time and frequency information. Among various two-dimensional transformation methods, the STFT has been shown to be particularly effective when combined with CNN architectures for rotating machinery fault diagnosis [42].

2.3. Long Short-Term Memories

The long short-term memory (LSTM) network employs cell states to facilitate effective learning of dependencies over long temporal sequences [43]. The core concept of its architecture lies in the constant error carousel (CEC) [44], which operates through three gate mechanisms: the forget gate, input gate, and output gate. The CEC mitigates gradient explosion and the vanishing-gradient problem during backpropagation through time [45], thereby enabling the model to capture long-term sequential dependencies more effectively. However, recursive architectures such as LSTM are inherently susceptible to error accumulation over time [46]. Prior studies have also reported that recursively designed deep learning models may suffer from similar issues [47,48]. Although the tasks in [48,49] differ from the one addressed in this study, these findings highlight a potentially significant concern. Therefore, Section 5 provides further discussion of the error accumulation problem observed in LSTM-based approaches.

2.4. Handcrafted Features

As reported in [7,8], several statistical features effectively represent the health condition of bearings. The use of such handcrafted features not only provides meaningful representations of system behavior but also enhances the computational efficiency of LSTM-based analysis. In this study, features are extracted from segmented vibration signals using a time-windowing process that preserves the system’s temporal characteristics, as detailed in Table 1. Because the principal advantage of deep learning lies in its ability to perform representation learning, it is crucial to select features that are both informative and relevant to fault diagnosis. To achieve this, existing feature extraction methods were reviewed, and the most suitable features for the application were identified [49,50]. The extracted features are summarized as follows.

Two distinct LSTM models were developed to evaluate their efficiency, with each model extracting features from either the time domain or the frequency domain. This approach enables a comparative analysis between time-sequential and frequency-sequential feature representations. Details are provided in Section 4.2.

3. Datasets

A benchmark dataset and an experimental dataset were used for verification to confirm that performance differences and representation analyses generalize across different system scales, independent of scale-specific effects.

3.1. Benchmark Dataset

The CWRU dataset contains vibration data collected from two bearings operating under three distinct conditions—rotational speed (RPM), fault severity levels, and fault types—as summarized in Table 2. This benchmark dataset is widely used to evaluate the effectiveness of various models in classifying fault modes in small ball bearings, with detailed descriptions available in [51]. While multiple analytical approaches can be derived from this dataset, the present study focuses specifically on identifying fault modes of the drive-end bearing.

A primary objective of this study is to compare the performance of two deep learning approaches. To avoid performance saturation, the dataset size was empirically adjusted by incrementally increasing the number of training samples until any model achieved an F1 score of 99%. Data expansion was terminated once one model reached this threshold while the other models remained below it. The resulting dataset sizes and model performance outcomes are summarized in Table 2 and Table 3.

3.2. Experimental Setup

Because existing benchmark bearing datasets primarily contain data from small, high-speed bearings, they are inadequate for general analyses involving large or slow-rotating bearings. The objective of this study is to identify which deep learning models perform best in classifying bearing failure modes and to clarify the underlying reasons for their performance. To address this limitation, a custom test rig equipped with large, heavy-duty bearings—specifically single-row cylindrical slewing bearings [1]—was developed. Controlled artificial damage was introduced to the bearings to acquire vibration signals from slightly defected specimens, enabling a more detailed and comprehensive evaluation of different models for bearing fault classification.

3.2.1. Test Setup Configuration

We used a cylindrical roller bearing with an outer diameter of 320 mm, inner diameter of 280 mm, gear pitch diameter of 312 mm, bearing pitch diameter of 234 mm, and contact angle of 45°. The bearing featured 78 teeth and 61 cylindrical rollers, each with a diameter of 12 mm. As shown in Figure 1, the experimental frame was constructed using aluminum plates and profiles. The base and ceiling plates were machined with bolt holes to enable attachment to the profile columns. The profiles were uniformly arranged along the perimeter of the frame, leaving one side open for motor and gearbox installation. After assembling seven vertical profiles, the ceiling plate was mounted, and smaller profiles were fixed across the middle to support the motor and gearbox. The gearbox was stabilized by adding four short profiles between the vertical supports of the ceiling. These profiles formed a gearbox mounting frame, to which a custom-designed aluminum plate was bolted. This structure prevented vibration or shaking during operation, ensuring stable rotation. A pinion gear was mounted on the motor shaft, engaging with the driven gear connected to the lower bearing. Consequently, rotation of the motor caused the meshed gears to drive the lower bearing, and by design, the upper bearing rotated correspondingly. The pinion gear had an outer diameter of 80 mm, inner diameter of 70 mm, pitch circle diameter of 78 mm, and 18 teeth, resulting in a gear ratio of 1:4.33. The gearbox itself provided a 1:3 gear ratio. The driving source was a Higen FMA-CN06 induction motor, capable of continuously outputting 1.54 N·m or more. This capacity was selected to ensure continuous rotation under a friction coefficient of 0.05 with a moment arm of 80 mm, overcoming a load of 10,000 N. At the center of the setup, a LINAK LA34 linear actuator applied a vertical load to the bearings through an aluminum contact plate. The actuator could generate artificial loading conditions up to 10,000 N, simulating realistic heavy-duty environments in which large bearings operate. The thickness and cross-sectional dimensions of the aluminum components were determined through ANSYS (2022R2) simulations. The floor plate was 30 mm thick, the ceiling plate 40 mm thick, and the profiles had a 4 cm² cross-sectional area. Simulation results confirmed that under a 10,000 N load applied by the actuator, the maximum stress was 73.8 MPa, which is well below aluminum’s yield strength, ensuring structural safety and rigidity. The actuator’s load transfer system was designed for uniform stress distribution. The load plate had a two-tier cylindrical structure consisting of a bottom layer (diameter: 170 mm, thickness: 30 mm) and a top layer (diameter: 200 mm, thickness: 20 mm), corresponding to half of the inner race length. This configuration allowed the upper cylindrical section to fit precisely into the bearing’s inner race, while the lower section maintained full contact, evenly transmitting the 10,000 N load circumferentially.

To allow bearing rotation under load, both the upper and lower bearings were integrated into a dual-bearing system. The upper bearing was bolted to the ceiling, while the outer races of both bearings were fixed together, enabling simultaneous rotation. Two sets of bolts were used: one to secure the inner race of the upper bearing to the ceiling and another to fasten the outer races of both bearings together. This ensured that the inner race of the upper bearing remained stationary, while its outer race rotated with that of the lower bearing. This dual-bearing configuration provided stable rotation while maintaining a uniform vertical load during operation.

3.2.2. Slightly Defected Bearings

The CWRU bearing dataset is one of the most widely adopted benchmark datasets that comprehensively includes vibration signals for the four bearing conditions essential for this study: normal, outer race fault (OF), inner race fault (IF), and roller fault (RF). Its well-established status and clear fault characteristics make it an ideal reference for fundamental model comparisons. To reproduce the configuration of the CWRU dataset and ensure a fair comparison, an equivalent experimental setup was constructed using four bearings and one sub-bearing, as illustrated in Figure 2. To simulate minor bearing defects, we artificially introduced controlled surface damage to specific bearing components. Four types of defective bearings were fabricated, each featuring precisely machined scratches on the outer race, inner race, or rollers. The defect geometries were as follows: the outer race fault (OF) had a surface scratch 70 mm in length and 0.5 mm in depth; the inner race fault (IF) had a surface scratch 50 mm in length and 0.5 mm in depth; and the roller fault (RF) involved six rollers, each scratched to a length of 3 mm and a depth of 0.2 mm. Compared with the CWRU benchmark dataset, the ratio between the bearing pitch diameter and the defect size in our experimental setup was significantly smaller. This indicates that the defects introduced in our system more closely represent incipient or early-stage bearing faults rather than large, easily detectable failures. This distinction provides an opportunity to evaluate the sensitivity and robustness of the proposed models under more challenging and realistic operating conditions.

3.2.3. Data Acquisition

ICP-type accelerometers (Model 626B02, PCB Piezotronics, New York, USA) were used to measure vertical and horizontal vibration signals as shown in Figure 1. This sensor was selected for its capability to detect vibrations over a wide frequency range from 0.1 to 6000 Hz, sufficiently covering both low-frequency and high-frequency vibrations. This is because the operating frequencies of slewing bearings typically lie in the low-frequency range, while the sensor is capable of capturing high-frequency characteristic patterns that are indicative of various fault types in slewing bearings. Using these sensors, we measured the frequency band from 0 to 1 kHz at a sampling rate of 2 kHz, and we constructed the dataset by measuring signal segments over a total duration of 50 s. The proposed experiment involves bearings designed for low rotational speeds (low-RPM conditions). Therefore, since the acquired signals contain DC offset components, the vibration data were analyzed using an oscilloscope configured for AC coupling, which eliminated the offset and enabled accurate waveform observation. The corresponding vibration data were simultaneously recorded and stored on a hard drive for further analysis.

According to previous studies [4,5,15,16,17], most systems employing slewing bearings operate at rotational speeds below 10 RPM. Therefore, experiments were conducted under two different speed conditions; these were 5 RPM, representing extremely low-speed operation, and 20 RPM, representing relatively higher-speed conditions. These conditions are summarized in Table 3. The acquired vibration samples exhibited comparable variance in the time domain and similar spectral characteristics in the frequency domain. The primary power frequencies were observed at 17.33 Hz for 20 RPM and 4.33 Hz for 5 RPM, with pronounced harmonic components detected at integer multiples of these fundamental frequencies, consistent with prior observations [52,53,54]. The analytical expression for the power frequency was derived by following the formulation (see Equation (1)) presented in [55]:

f_{p o w e r} = \frac{R P M \times N_{p o l e}}{60}

(1)

where RPM represents the rotational speed,

N_{p o l e}

denotes the number of motor pole pairs, and

f_{p o w e r}

refers to the harmonic frequencies based on the fundamental frequency. As reported in [56], the failure modes in the CWRU dataset can be visually distinguished with relative ease. In contrast, the dataset constructed in this study presents a substantially greater challenge for visual classification, as illustrated in Figure 3 and Figure 4. While the amplitudes exhibit slight variations across different bearings—as shown in the first rows of Figure 3 and Figure 4—no significant differences are observed in the frequency-domain trends or amplitude variations of the time-domain signals, unlike those in the CWRU dataset.

Interestingly, the normal bearing demonstrates the largest amplitude, which appears to result from its unique mechanical characteristics. In comparison, the other three defective bearings exhibit similar amplitude levels and frequency responses, as evident in the lower rows of Figure 3 and Figure 4.

A total of 6400 vibration signal segments were collected during continuous operation at a single rotational speed after system stabilization. The dataset consists of 1600 segments per failure mode, with 800 signals measured in the axial direction and 800 in the radial direction, ensuring balanced representation across measurement axes. Consistent with the CWRU dataset, the dataset size in this experiment was carefully selected to prevent model performance saturation. Empirical results revealed a significant performance advantage of the CNN model over the LSTM model; therefore, training and validation were conducted according to the configuration summarized in Table 4.

Table 3. Configuration of the experimental low-speed bearing dataset under different rotational speeds.

Dataset	RPM	Failure Mode	Training Size	Validation Size	Test Size
Exp. A	5	Normal, OF, IF and RF	240	3080	3080
Exp. B	20	Normal, OF, IF and RF	240	3080	3080

Table 4. Bearing fault characteristic frequencies and preprocessing parameters.

	Dataset 1	Dataset 2	Dataset 3	Dataset 4	Exp. A	Exp. B
Rolling element frequency	135 Hz	138 Hz	139 Hz	141 Hz	1.6 Hz	6.5 Hz
Outer pass frequency	103 Hz	105 Hz	106 Hz	107 Hz	2.5 Hz	9.8 Hz
Inner pass frequency	155 Hz	158 Hz	160 Hz	162 Hz	2.6 Hz	10.4 Hz
$Selected H$	70	70	70	70	688	190
$Selected L$	90	90	90	90	2344	586
$Selected O$	20	20	20	20	1656	396

4. Validation

4.1. Preprocessing

This subsection describes the preprocessing and windowing strategy applied to preserve bearing fault characteristics while ensuring fair comparison across models. Both STFT and handcrafted features employ time windowing techniques. Time windowing involves setting the window length, determining the size of the extracted data within that window and the hopping interval, and defining the distance between the previous and next windows. The time windowing technique reveals previously unseen features of sequential data by condensing information. However, the improper selection of parameters may lead to the loss of the system’s essential information. It is crucial to account for the fact that, when a bearing malfunctions, distinct vibration patterns emerge at periodic intervals from the faulty region [57]. Because signals in the axial direction are slightly larger than those in the radial direction, this imbalance may degrade performance. Therefore, min–max normalization was applied to the raw time signals to prevent such performance degradation. Parameter selection was performed to properly reflect the characteristics of bearing structure and rotation, following two key principles. First, the window length must be sufficiently long to capture all vibration components associated with bearing fault frequencies. If the sampling frequency and window length are denoted

f_{s}

and

L

, respectively, the resulting frequency resolution is

Δ f = f_{s} / L

. In practice, a window length is chosen such that the frequency resolution becomes smaller than the ball spin frequency (BSF), which is typically the lowest of the bearing fault frequencies. Thus, the selected window length satisfies

L \geq f_{s} / f_{B S F}

. Secondly, the hopping interval should be small enough to extract different characteristics between the normal area and defected area. Second, the hopping interval must be small enough to extract distinct characteristics between the normal and defected regions. To meet this requirement, the hopping interval was set to be slower than the ball pass frequency of the inner race (BPFI), which represents the fastest periodic component among all fault modes. If the

H

is a hopping interval, the time resolution would be

Δ t = H / f_{s}

. Therefore, the desirable hopping interval is now

H \leq f_{s} / f_{B P F I}

. These parameters are summarized in Table 4. Based on these principles, deep learning models were found to achieve strong performance regardless of variations in the parameters. After preprocessing, STFT images are cropped to a size of power of 2, as shown in Figure 5 and Figure 6. For this reason, we will use transposed convolution to analyze which physical features the output of each layer considers important in our representation. For this purpose, the dimensions of the data before convolution and after transposed convolution must be the same, which requires them to be to a power of 2. Moreover, handcrafted features are cropped for comparison. The sizes of all preprocessed data’s input size are listed in Table 5.

4.2. Interpretable Model Design for Comparative Analysis

The primary objective of this study is to perform a transparent and fundamental comparison between CNN and LSTM architectures by isolating their core learning mechanisms. To achieve this, simplified models with a single convolutional or LSTM layer were intentionally designed, excluding fully connected layers except for the final output layer [58]. This deliberate simplification serves two purposes: (1) it minimizes the black-box nature of deep learning, enabling direct visualization and analysis of the features learned by the core layer and allowing these features to be directly linked to performance outcomes, and (2) it ensures that observed performance differences arise from the distinct ways in which CNNs exploit local spatial patterns in STFT spectrograms versus how LSTMs capture sequential dependencies in feature vectors, rather than from the representational power of deeper, stacked layers. Although deeper models may achieve higher absolute accuracy, they would obscure the fundamental causal relationships this study aims to uncover. It should be emphasized that the simplified single-layer architectures are not intended to maximize classification accuracy but to isolate and analyze the fundamental learning mechanisms of CNNs and LSTMs in a transparent manner.

To ensure fairness in comparison, both models employed the same hyperparameter settings: a dropout rate of 40%, the Adam (adaptive moment estimation) optimizer with a learning rate of 0.001, and gradient clipping with a threshold of 0.01. The model weights that yielded the best performance on the validation set were saved, and training was terminated when the loss function failed to decrease for more than 50 consecutive epochs, using a batch size of 32.

4.2.1. CNN–STFT

The architecture of the CNN-based model was designed with a simple structure consisting of a single convolutional layer, as illustrated in Figure 7. The input layer receives STFT images with a single channel, represented as

(F, T, 1)

. The data then passes through the convolutional layer employing a rectified linear unit (ReLU) activation function, and the final output predicts four class labels.

Due to the simplicity of the architecture, the kernel size and stride interval were carefully selected based on mathematical considerations as shown Equations (2) and (3). The following presents the mathematical formulation used to determine the optimal kernel size and stride interval over time.

K_{t} \leq c e i l (\frac{P}{Δ t})

(2)

S_{t} \leq 0.2 K_{t}

(3)

The kernel size over time

K_{t}

was chosen to cover one rotational period

P

of the bearing. This choice is grounded in the physical characteristics of bearing vibration [59]: the rotational period represents the longest fundamental period of the system. All bearing fault frequencies—including ball pass frequencies (BPFI, BPFO) and ball spin frequency (BSF)—are harmonics or multiples of this fundamental rotational frequency, meaning they have shorter periods than

P

. Therefore, a time window spanning one full rotation is sufficient to capture at least one complete cycle of the highest-frequency fault component (typically BSF) while also encompassing the fundamental periodicity. A smaller window might miss low-frequency components or complete cycles of certain faults, whereas a larger window would introduce redundant information without additional discriminative benefit. The stride interval over time

S_{t}

was chosen such that it does not exceed 20% of the kernel length, preventing the extraction of repetitive representations, as expressed in Equation (3). For the same rationale used in selecting the STFT parameters, the largest power of two within the allowable range was selected. Although it cannot be asserted that these expressions always yield optimal performance, our experiments demonstrated that they consistently provided sufficiently high accuracy. To ensure a fair comparison, identical kernel sizes were also applied to the LSTM-F model. This approach was adopted because the frequency spectrum of the vibration signals in defective regions is often ambiguous. The kernel size along the frequency axis was empirically determined to achieve an appropriate level of information condensation, and the selected values are summarized in Table 6. These results indicate that CNN-S is more robust under low frequency separability conditions, whereas LSTM-F benefits from clearly distinguishable spectral patterns.

4.2.2. LSTM–Handcrafted

Some 12 handcrafted features were extracted from the two domains described in Section 2.3—the raw vibration signal in the time domain and the Fourier-transformed spectral signal—as illustrated in Figure 8. The comparison between these two domains aims to evaluate their effectiveness in capturing the system’s characteristics—specifically, to determine whether spectral information or amplitude information provides a more representative description of the system’s behavior. For time-domain signals, referred to as LSTM-T (time-based), twelve features were extracted from the normalized data using the time-windowing parameters listed in Table 5. For frequency-domain signals, referred to as LSTM-F (frequency-based), the same features were extracted from the normalized first half of the spectrum, since the Fourier-transformed signal exhibits symmetry about its central frequency indices.

After normalizing each signal, the sequence with a total length of

T

(for time-domain data) or

T / 2

(for frequency-domain data) was partitioned into

W

segments, resulting in segment dimensions of

(W, T / W)

or

(W, T / 2 W)

, respectively. Subsequently, both signals were cropped to have a final shape of

(W, 12)

, consistent with the STFT preprocessing results, by extracting twelve features from each segment. The number of hidden units in the LSTM layer was determined such that the total number of trainable parameters was comparable to that of the CNN model, as summarized in Table 7.

Model performance was evaluated using four standard metrics as shown in Figure 9: accuracy, precision, recall, and F1 macro [60]. Accuracy measures the proportion of correct predictions, including both true positives (TP) and true negatives (TN), out of all predictions, as defined in Equation (4). It provides an overall assessment of the model’s correctness. Precision (

P_{k}

) quantifies the proportion of correctly predicted positive instances among all instances predicted as positive, as shown in Equation (5). It is particularly important when the cost of false positives (FP) is high. Recall (

R_{k}

) measures the proportion of correctly predicted positive instances among all actual positive instances, as defined in Equation (6). It becomes critical when the cost of false negatives (FN) is high. The F1 macro score provides a balanced evaluation by considering both precision and recall across multiple classes, as expressed in Equation (7). It represents the harmonic mean of precision and recall, offering a single comprehensive measure of the model’s classification performance.

A c c u r a c y = \frac{1}{4} \sum_{i = 1}^{4} T P_{i i}

(4)

P_{k} = \frac{T P_{k k}}{T P_{k k} + \sum_{i = 1}^{4} F P_{i k}}, i f i = k t h e n F P_{i k} = 0

(5)

R_{k} = \frac{T P_{k k}}{T P_{k k} + \sum_{i = 1}^{4} F N_{k i}}, i f i = k t h e n F N_{i k} = 0

(6)

F 1 M a c r o = \frac{1}{4} \sum_{i = 1}^{4} \frac{2 P_{i} R_{i}}{P_{i} + R_{i}}

(7)

5. Results and Discussion

The results of benchmark and experimental setup showed different results as shown in Table 8 and Figure 10. In the experimental setup dataset, CNN-S achieves the highest performance with an F1 score exceeding 99%, followed by LSTM-F, which attains an average accuracy of 93%. In comparison, LSTM-F achieved the best performance, with an F1 score exceeding 99%, followed by CNN-S as the second best with an F1 score of 91%.

Upon completion of training, the CNN outputs were averaged pixel-by-pixel to visualize the physical information most critical for classification and to interpret performance differences from the perspective of the network architecture. Figure 11 presents the heatmaps corresponding to the four bearings from experiments A and B. High-intensity pixels indicate regions where the image patterns associated with specific fault modes are more prominently represented.

In the experimental dataset, the frequency ranges exhibit strong activation across all time intervals for all four bearings. Notably, higher-frequency components were inconsistently excited in all three faulty-bearing cases, regardless of rotational speed or fault frequency, unlike the patterns observed in Figure 5 and Figure 6. This phenomenon likely results from the rough surface texture of the defective regions, which generates irregular high-frequency components and elevated energy responses.

Moreover, as shown in Figure 11e–l, the CNN consistently focused on relatively high-frequency regions, suggesting that these ranges play an important role in fault discrimination. In Figure 5 and Figure 6, most fault-related energy is concentrated below 50 Hz, implying that the fault patterns would primarily be captured in the low-frequency range when using the STFT. However, the CNN representations in Figure 11 indicate that not only frequencies below 50 Hz but also high-frequency regions above 200 Hz are significant for classification.

In contrast, the benchmark dataset exhibits clearer distinctions in frequency trends across fault modes (OF, IF, and RF) [61], as shown in Figure 11a–d. The rotational characteristics of the bearings and motors are more prominently captured in these representations. Furthermore, the CNN representations derived from the benchmark dataset display noticeable temporal variations compared with those from the experimental dataset. This observation suggests that when fault-related frequency characteristics are distinctly defined, classification performance can be enhanced by emphasizing specific frequency bands rather than modeling the entire time–frequency complexity.

To further examine the physical significance of the learned representations, a transposed convolution (deconvolution) operation [58] was applied to the extracted feature maps to restore them to the original image dimensions. This operation was implemented using a layer with the same kernel size

(K_{f}, K_{t})

, stride

(S_{f}, S_{t})

, and number of filters

N

as the original convolutional layer (Section 4.2.1, Table 6). The weights of this layer were fixed as the transpose of the weights from the corresponding trained convolutional layer, and zero padding was applied to ensure the output dimensions matched the original STFT input size. This process creates a heatmap that projects the learned feature activations back onto the input space, highlighting the time–frequency regions most influential for the CNN’s decision.

Figure 12 and Figure 13 show reconstructed STFT-based heatmaps and their corresponding magnified views, respectively. In Figure 12 and Figure 13e–l, certain frequency patterns are observed to recur periodically over time, indicating that the CNN successfully captured repetitive time–frequency features associated with bearing faults. In contrast, as shown in Figure 12 and Figure 13a–d, the benchmark dataset, which exhibits more distinct frequency characteristics, does not produce such consistent temporal recurrences. This difference arises because the benchmark dataset represents more severe fault conditions, leading to clear frequency distinctions among fault types. Meanwhile, the experimental dataset, which involves milder defects, lacks clearly separable frequency features, thereby requiring the CNN to learn more complex joint time–frequency representations.

The difference in representation between the two datasets is particularly significant because CNNs fundamentally rely on repetitive spatial patterns within images. When consistent feature patterns corresponding to specific labels are absent, the model’s training stability tends to decrease, as also reflected in Table 7. LSTM models, on the other hand, predict labels by processing the current sequence in conjunction with all preceding sequences. Owing to their recurrent structure, LSTMs must extract distinguishable representations at consistent temporal or spectral indices to maintain stable learning. To examine this behavior, both the input features and the output representations of the LSTM models were analyzed to determine whether discriminative patterns were consistently captured. Figure 14 visualizes the average output activations of LSTM cells across the entire training dataset. A higher average activation at a specific index (time or frequency) indicates that the model identified that index as containing critical information. As shown in Figure 14a, LSTM models trained on datasets yielding superior performance exhibit clearly distinguishable index activations, suggesting that the model successfully learned meaningful sequential dependencies. Conversely, in LSTM models that underperformed relative to CNN, the activation distributions lack such distinct features. This difference is particularly evident in the time-domain model (LSTM-T). Because LSTMs make predictions based on both current and previous index values, datasets that do not exhibit consistent sequential relationships hinder effective learning. This observation is supported by the performance trends summarized in Figure 14 and Table 8. Considering the results of CNN-S, LSTM-T, and LSTM-F, it was observed that when the modal frequency characteristics and time-domain vibration behaviors were highly similar—as shown in Figure 3, Figure 5 and Figure 6—the extracted representations became ambiguous and difficult to discriminate, as illustrated in the second and third rows of Figure 12 and Figure 13 and the second and third columns of Figure 14.

Consequently, all three methods required a relatively large amount of training data (approximately 400 samples) to achieve satisfactory performance, with CNN-S ultimately producing the most reliable results. In contrast, for major defects, as reported in [56] and [61], the frequency and time–frequency characteristics were clearly distinguishable. Under such conditions, even a small amount of training data (around 40 samples) was sufficient to achieve accurate predictions, and the LSTM-F model demonstrated the best performance. The extracted representations in this case exhibited well-separated vector distributions, as seen in Figure 12 and Figure 13a–d, and Figure 14a,d. However, the LSTM-T model consistently failed to yield stable or distinctive representations, resulting in the lowest accuracy among the three approaches. Therefore, the CNN-S approach is more appropriate for minor defect scenarios, whereas the LSTM-F model is preferable for major defect conditions. To assess this separability, one can calculate the mean power spectrum in the frequency domain and measure the average distance between classes. A larger value of this metric indicates more distinct frequency characteristics, suggesting that an algorithm such as LSTM-F would be suitable. Alternatively, separability can be evaluated by computing the silhouette score obtained from the clustering of frequency features [62,63].

To place our representation-based analysis in a broader context, it can be compared with recent physics-informed and mechanism-driven feature learning studies in the field of rotating machinery [64,65,66]. However, such an integration is beyond the scope of the present work and will be addressed in future research.

6. Conclusions

This study conducted a fundamental investigation into the performance of CNN and LSTM models for bearing fault diagnosis through a purposefully simplified and interpretable experimental framework. The utilization of both a widely adopted benchmark dataset (CWRU) and a custom-designed low-speed bearing dataset demonstrates that optimal model selection is critically dependent on the nature of the frequency-domain characteristics present in the vibration data. The conclusions presented in this section are directly derived from the comparative performance metrics and representation analyses discussed in Section 5.

The key finding of this study is that the superiority of a given architecture is not universal but is determined by the degree of data separability in the frequency domain. LSTM models, particularly when fed with frequency-domain handcrafted features (LSTM-F), excel in scenarios where bearing faults generate distinct and easily separable spectral signatures. This was clearly observed in the CWRU dataset, which contains significant faults. In contrast, CNN models processing STFT spectrograms (CNN-S) demonstrated superior performance when dealing with minor or incipient faults that do not produce clearly distinguishable frequency patterns, as evidenced by our experimental low-speed bearing dataset. The time-domain LSTM (LSTM-T) consistently underperformed in both scenarios, highlighting the importance of frequency-domain information for this task.

The simplified, single-layer architecture employed in this study was instrumental in enabling a transparent analysis of the learned representations. This design choice allows for performance differences to be directly attributed to the core learning mechanisms of each model: LSTMs’ ability to capture sequential dependencies in well-defined feature sequences versus CNNs’ strength in identifying complex local spatial patterns within time–frequency representations. The representation analysis provided clear evidence that LSTM-F relies on consistent, index-specific patterns, whereas CNN-S leverages more complex time–frequency interactions, making it more robust under low-separability conditions.

Based on these findings, a practical guideline for model selection in industrial applications is proposed; LSTM-F is the preferred choice for diagnosing severe faults with distinct spectral features, while CNN-S is more effective for detecting minor defects or for systems exhibiting complex and weakly discriminative frequency behavior.

For future work, the fundamental insights gained from this controlled study should be validated and extended. A natural progression is to investigate whether the same performance trends and selection guidelines hold for more complex, state-of-the-art models such as deep convolutional networks and Transformer-based architectures. In addition, integrating our representation analysis framework with recent physics-informed and mechanism-driven feature learning approaches in rotating machinery—such as surrogate dynamic modeling and fault frequency-guided feature enhancement—presents a promising direction for further improving interpretability and robustness. Furthermore, applying these guidelines to real-world noisy industrial environments and exploring automated methods through which to assess dataset frequency separability for model selection would provide valuable contributions to the field of predictive maintenance.

Author Contributions

Methodology, validation, investigation, data curation, writing, J.-W.K.; conceptualization, data curation, writing—review and editing, funding acquisition, J.-H.L., D.-H.S. and S.-H.C.; conceptualization, writing, supervision, project administration, K.-S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2025-00515526); by RF systems Co., Ltd. (202510040001); and by the Gachon University research fund of 2023 (GCU-202308090001).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

Authors Jong-Hak Lee, Dong-Hun Son, Sung-Hyun Choi were employed by the company LIG Nex1. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

SKF. Slewing Bearings. Available online: https://cdn.skfmediahub.skf.com/api/public/0901d196809590fe/pdf_preview_medium/0901d196809590fe_pdf_preview_medium.pdf (accessed on 15 September 2005).
NSK. Technical Report. Available online: https://www.nsk.com/tools-resources/technical-report/ (accessed on 1 October 2023).
Peng, B.; Bi, Y.; Xue, B.; Zhang, M.; Wan, S. A survey on fault diagnosis of rolling bearings. Algorithms 2022, 15, 347. [Google Scholar] [CrossRef]
Zhang, R.; Guo, L.; Zong, Z.; Gao, H.; Qian, M.; Chen, Z. Dynamic modeling and analysis of rolling bearings with rolling element defect considering time-varying impact force. J. Sound Vib. 2023, 526, 117820. [Google Scholar] [CrossRef]
Jin, X.; Chen, Y.; Wang, L.; Han, H.; Chen, P. Failure prediction, monitoring and diagnosis methods for slewing bearings of large-scale wind turbine: A review. Measurement 2021, 172, 108855. [Google Scholar] [CrossRef]
Jain, P.; Bhosle, S. Analysis of vibration signals caused by ball bearing defects using time-domain statistical indicators. Int. J. Adv. Technol. Eng. Explor. 2022, 9, 700. [Google Scholar] [CrossRef]
Lebold, M.; Mcclintic, K.; Campbell, R.; Byington, C.; Maynard, K. Review of vibration analysis methods for gearbox diagnostics and prognostics. In Proceedings of the 54th Meeting of the Society for Machinery Failure Prevention Technology, Virginia Beach, VA, USA, 1–4 May 2000; Volume 16. [Google Scholar]
Mclnerny, S.A.; Dai, Y. Basic vibration signal processing for bearing fault detection. IEEE Trans. Educ. 2003, 46, 149–156. [Google Scholar] [CrossRef]
Kiral, Z.; Karagulle, H. Vibration analysis of rolling element bearings with various defects under the action of an unbalanced force. Mech. Syst. Signal Process 2006, 20, 1967–1991. [Google Scholar] [CrossRef]
Ocak, H.; Loparo, K. Estimation of the running speed and bearing defect frequencies of an induction motor from vibration data. Mech. Syst. Signal Process 2004, 18, 515–533. [Google Scholar] [CrossRef]
Yang, H.; Mathew, J.; Ma, L. Vibration feature extraction techniques for fault diagnosis of rotating machinery: A literature survey. In Proceedings of the 10th Asia-Pacific Vibration Conference, Gold Coast, Australia, 12–14 November 2003; pp. 801–807. [Google Scholar]
Saruhan, H.; Saridemir, S.; Qicek, A.; Uygur, I. Vibration analysis of rolling element bearings defects. J. Appl. Res. Technol. 2014, 12, 384–395. [Google Scholar] [CrossRef]
Kapangowda, N.; Krishna, H.; Vasanth, S.; Thammaiah, A. Internal combustion engine gearbox bearing fault prediction using J48 and random forest classifier. Int. J. Electr. Comput. Eng. 2023, 13, 4. [Google Scholar]
Knight, A.; Bertani, S. Mechanical fault detection in a medium-sized induction motor using stator current monitoring. IEEE Trans. Energy Convers 2005, 20, 753–760. [Google Scholar] [CrossRef]
Caesarendra, W. Vibration and Acoustic Emission-Based Condition Monitoring and Prognostic Methods for Very Low Speed Slew Bearing. Ph.D. Thesis, University of Wollongong, Wollongong, Australia, 2015. [Google Scholar]
Caesarendra, W.; Kosasih, P.; Tieu, A.; Moodie, C.; Choi, B. Condition monitoring of naturally damaged slow speed slewing bearing based on ensemble empirical mode decomposition. J. Mech. Sci. Technol. 2013, 27, 2253–2262. [Google Scholar] [CrossRef]
Caesarendra, W.; Park, J.; Kosasih, P.; Choi, B. Condition monitoring of low speed slewing bearings based on ensemble empirical mode decomposition method. Trans. Korean Soc. Noise Vib. Eng. 2013, 23, 131–143. [Google Scholar] [CrossRef]
Žvokelj, M.; Zupan, S.; Prebil, I. Multivariate and multiscale monitoring of large-size low-speed bearings using ensemble empirical mode decomposition method combined with principal component analysis. Mech. Syst. Signal Process 2010, 24, 1049–1067. [Google Scholar] [CrossRef]
Han, T.; Liu, Q.; Zhang, L.; Tan, A. Fault feature extraction of low speed roller bearing based on teager energy operator and CEEMD. Measurement 2019, 138, 400–408. [Google Scholar] [CrossRef]
Caesarendra, W.; Kosasih, B.; Tieu, A.; Moodie, C. Circular domain features based condition monitoring for low speed slewing bearing. Mech. Syst. Signal Process. 2014, 45, 114–138. [Google Scholar] [CrossRef]
Caesarendra, W.; Kosasih, B.; Tieu, A.; Moodie, C. Application of the largest Lyapunov exponent algorithm for feature extraction in low speed slew bearing condition monitoring. Mech. Syst. Signal Process 2015, 50, 116–138. [Google Scholar] [CrossRef]
Wu, C.; Zheng, S. Fault Diagnosis method of rolling bearing based on MSCNN-LSTM. Comput. Mater. Contin. 2024, 79, 4395–4411. [Google Scholar] [CrossRef]
Han, K.; Wang, W.; Guo, J. Research on a bearing fault diagnosis method based on a CNN-LSTM-GRU model. Machines 2024, 12, 927. [Google Scholar] [CrossRef]
Xu, M.; Yu, Q.; Chen, S.; Lin, J. Rolling bearing fault diagnosis based on CNN-LSTM with FFT and SVD. Information 2024, 15, 399. [Google Scholar] [CrossRef]
An, Y.; Zhang, K.; Liu, Q.; Chai, Y.; Huang, X. Rolling bearing fault diagnosis method base on periodic sparse attention and LSTM. IEEE Sens. J. 2022, 22, 12044–12053. [Google Scholar] [CrossRef]
Gu, K.; Zhang, Y.; Liu, X.; Li, H.; Ren, M. DWT-LSTM-based fault diagnosis of rolling bearings with multi-sensors. Electronics 2021, 10, 2076. [Google Scholar] [CrossRef]
Li, X.; Su, K.; He, Q.; Wang, X.; Xie, Z. Research on fault diagnosis of highway Bi-LSTM based on attention mechanism. Eksploat. I Niezawodn.-Maint. Reliab. 2023, 25, 162937. [Google Scholar] [CrossRef]
Li, C.; Xu, J.; Xing, J. A frequency feature extraction method based on convolutional neural network for recognition of incipient fault. IEEE Sensors J. 2023, 24, 564–572. [Google Scholar] [CrossRef]
Zhang, Q.; Deng, L. An intelligent fault diagnosis method of rolling bearings based on short-time Fourier transform and convolutional neural network. J. Fail. Anal. Prev. 2023, 23, 795–811. [Google Scholar] [CrossRef]
Yang, S.; Yang, P.; Yu, H.; Bai, J.; Feng, W.; Su, Y.; Si, Y. A 2DCNN-RF model for offshore wind turbine high-speed bearing-fault diagnosis under noisy environment. Energies 2022, 15, 3340. [Google Scholar] [CrossRef]
Zhou, Z.; Ai, Q.; Lou, P.; Hu, J.; Yan, J. A novel method for rolling bearing fault diagnosis based on Gramian angular field and CNN-ViT. Sensors 2024, 24, 3967. [Google Scholar] [CrossRef] [PubMed]
Alzawi, S.; Mohammed, T.; Albawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar]
Wei, Q.; Yang, Y. Bearing fault diagnosis with parallel CNN and LSTM. Math. Biosci. Eng. 2024, 21, 2385–2406. [Google Scholar] [CrossRef] [PubMed]
Liao, J.-X.; Wei, S.-L.; Xie, C.-L.; Zeng, T.; Sun, J.-W.; Zhang, S.; Zhang, X.; Fan, F.-L. Bearing PGA-Net: A Lightweight and deployable bearing fault diagnosis network via decoupled knowledge distillation and FPGA Acceleration. IEEE Trans. Instrum. Meas. 2023, 73, 1–14. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaria, J.; Fadhel, M.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 1–74. [Google Scholar] [CrossRef]
Liu, H.; Zhang, F.; Tan, Y.; Huang, L.; Li, Y.; Huang, G.; Luo, S.; Zeng, A. Multi-scale quaternion CNN and BiGRU with cross self-attention feature fusion for fault diagnosis of bearing. Meas. Sci. Technol. 2024, 35, 086138. [Google Scholar] [CrossRef]
Kulevome, D.K.B.; Qiu, M.; Cao, F.; Opoku-Mensah, E. Evaluation of time-frequency representations for deep learning-based rotating machinery fault diagnosis. Int. J. Eng. Technol. Innov. 2025, 15, 314–331. [Google Scholar] [CrossRef]
Protas, E.; Bratti, J.; Gaya, J.; Drews, P.; Botelho, S. Visualization methods for image transformation convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 2231–2243. [Google Scholar] [CrossRef]
Rao, S. Mechanical Vibrations; Pearson/Prentice Hall: Singapore, 2001. [Google Scholar]
Sundararajan, D. The Discrete Fourier Transform: Theory, Algorithms and Applications; World Scientific: Singapore, 2001. [Google Scholar]
Allen, J.; Rabiner, L. A unified approach to short-time Fourier analysis and synthesis. Proc. IEEE 1977, 65, 1558–1564. [Google Scholar] [CrossRef]
Li, Y.; Gu, X.; Wei, Y. A deep learning-based method for bearing fault diagnosis with few-shot learning. Sensors 2024, 24, 7516. [Google Scholar] [CrossRef] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Staudemeyer, R.; Morris, E. Understanding LSTM—A tutorial into long short-term memory recurrent neural networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
Zargar, S. Introduction to Sequence Learning Models: RNN, LSTM, GRU. Preprint, 2021; 37988518. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Z.; Liu, Y. Error accumulation in recurrent neural networks: Analysis and mitigation for time-series prediction. Neural Netw. 2023, 165, 505–519. [Google Scholar]
Li, Q.; Tan, Z.; Chen, H. Limitations of recurrent architectures for long-horizon sequence modeling: An empirical study. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 1823–1836. [Google Scholar]
Cai, C.; Tao, Y.; Zhu, T.; Deng, Z. Short-term load forecasting based on deep learning bidirectional LSTM neural network. Appl. Sci. 2021, 11, 8129. [Google Scholar] [CrossRef]
Park, S.; Park, K. A pre-trained model selection for transfer learning of remaining useful life prediction of grinding wheel. J. Intell. Manuf. 2023, 35, 2295–2312. [Google Scholar] [CrossRef]
Caesarendra, W.; Tjahjowidodo, T. A review of feature extraction methods in vibration-based condition monitoring and its application for degradation trend estimation of low-speed slew bearing. Machines 2017, 5, 21. [Google Scholar] [CrossRef]
Case Western Reserve University Bearing Dataset. Available online: https://engineering.case.edu/bearingdatacenter (accessed on 1 March 2023).
Pietrzak, P.; Wolkiesicz, M. Stator winding fault detection of permanent magnet synchronous motors based on the short-time Fourier transform. Power Electron. Drives 2022, 7, 112–133. [Google Scholar] [CrossRef]
Ruiz, J.; Rosero, J.; Espinosa, A.; Romeral, L. Detection of demagnetization faults in permanent-magnet synchronous motors under nonstationary conditions. IEEE Trans. Magn. 2009, 45, 2961–2969. [Google Scholar] [CrossRef]
Rosero, J.; Ortega, J.; Urresty, J.; Cárdenas, J.; Romeral, L. Stator short circuits detection in PMSM by means of higher order spectral analysis (HOSA). In Proceedings of the 2009 Twenty-Fourth Annual IEEE Applied Power Electronics Conference and Exposition, Washington, DC, USA, 15–19 February 2009; pp. 964–969. [Google Scholar]
Belbali, A.; Makhloufi, S.; Kadri, A.; Abdallah, L.; Seddik, Z. Mathematical Modelling of a 3-Phase Induction Motor; IntechOpen: London, UK, 2023. [Google Scholar]
Smith, W.; Randall, R. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64, 100–131. [Google Scholar] [CrossRef]
Iunusova, E.; Gonzalez, M.K.; Szipka, K.; Archenti, A. Early fault diagnosis in rolling element bearings: Comparative analysis of a knowledge-based and a data-driven approach. J. Intell. Manuf. 2023, 35, 2327–2347. [Google Scholar] [CrossRef]
Erhan, D.; Bengio, Y.; Courville, A.; Vincent, P. Visualizing higher-layer features of a deep network. Univ. Montr. 2009, 1341, 1. [Google Scholar]
Harris, T.A.; Kotzalas, M.N. Rolling Bearing Analysis: Essential Concepts of Bearing Technology, 5th ed.; CRC Press: Boca Raton, FL, USA, 2006. [Google Scholar]
Grandini, M.; Bagli, E.; Visani, G. Metrics for multi-class classification: An overview. arXiv 2020, arXiv:2008.05756. [Google Scholar] [CrossRef]
Yoo, Y.; Jo, H.; Ban, S. Lite and efficient deep learning model for bearing fault diagnosis using the CWRU dataset. Sensors 2023, 23, 3157. [Google Scholar] [CrossRef]
Lu, W.; Liang, B.; Cheng, Y.; Meng, D.; Yang, J.; Zhang, T. Deep model based domain adaptation for fault diagnosis. IEEE Trans. Ind. Electron. 2017, 64, 2296–2305. [Google Scholar] [CrossRef]
Liu, R.; Yang, B.; Zio, E.; Chen, X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mech. Syst. Signal Process. 2022, 165, 108376. [Google Scholar] [CrossRef]
Cheng, Y.; Yan, J.; Zhang, F.; Li, M.; Zhou, N.; Shi, C. Surrogate modeling of pantograph–catenary system interactions. Mech. Syst. Signal Process. 2025, 224, 112134. [Google Scholar] [CrossRef]
Cheng, Y.; Zhou, N.; Wang, Z. CFFsBD: A Candidate Fault Frequencies-based Blind Deconvolution. IEEE Trans. Instrum. Meas. 2023, 72, 3506412. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, S.; Chen, B.; Mei, G.; Zhang, W.; Peng, H.; Tian, G. An improved envelope spectrum via candidate fault frequency optimization-gram. J. Sound Vib. 2022, 523, 116746. [Google Scholar] [CrossRef]

Figure 1. Experimental setup for low-speed slewing bearing vibration measurement.

Figure 2. Slightly defected bearing components: (a) outer race fault, (b) inner race fault, and (c) rolling element fault.

Figure 3. Representative vibration signal samples measured at 5 RPM in the axial direction: (a) normal, (b) outer race fault, (c) inner race fault, and (d) rolling element fault.

Figure 4. Representative vibration signal samples measured at 20 RPM in the radial direction: (a) normal, (b) outer race fault, (c) inner race fault, and (d) rolling element fault.

Figure 5. STFT results of Exp. B dataset with selected parameters: radial direction (top row) and axial direction (bottom row) for a (a) normal, (b) outer race fault, (c) inner race fault, and (d) rolling element fault.

Figure 6. Zoomed-in STFT results of the Exp. B dataset highlighting low-frequency components: radial direction (top row) and axial direction (bottom row) for the (a) normal, (b) outer race fault, (c) inner race fault, and (d) rolling element fault.

Figure 7. Architecture of the proposed simplified CNN model.

Figure 8. Architecture of the proposed simplified LSTM model using handcrafted features.

Figure 9. Confusion matrix for multi-class bearing fault classification.

Figure 10. Confusion matrices of the three models for experimental datasets: (a) LSTM-T, (b) LSTM-F, and (c) CNN-S.

Figure 11. Summed feature map representations obtained from the CNN model: (a–d) CWRU dataset (normal, outer race fault, inner race fault, and rolling element fault), (e–h) Exp. A dataset, and (i–l) Exp. B dataset.

Figure 12. Deconvolution-based reconstructed feature maps for fault localization: (a–d) CWRU dataset, (e–h) Exp. A dataset, and (i–l) Exp. B dataset.

Figure 13. Zoomed-in views of the deconvolution-based reconstructed feature maps corresponding to Figure 12: (a–d) CWRU dataset, (e–h) Exp. A dataset, and (i–l) Exp. B dataset..

Figure 14. Average LSTM output activations across time or frequency indices: (a) CWRU LSTM-F, (b) Exp. A LSTM-F, (c) Exp. B LSTM-F, (d) CWRU LSTM-T, (e) Exp. A LSTM-T, and (f) Exp. B LSTM-T.

Table 1. Handcrafted statistical features extracted from vibration signals.

Feature Name	Formula
Mean: $f_{1}$	$\frac{1}{N} \sum_{i} x_{i}$
Mean amplitude: $f_{2}$	$\frac{1}{N} \sum_{i} \|x_{i}\|$
Root mean square: $f_{3}$	$\sqrt{\frac{1}{N} \sum_{i} x_{i}^{2}}$
Square root amplitude: $f_{4}$	${[\frac{1}{N} \sum_{i} \sqrt{\|x_{i}\|}]}^{2}$
Peak to peak: $f_{5}$	$\max (x) - \min (x)$
Standard deviation: $f_{6}$	$\sqrt{\frac{1}{N - 1} {\sum_{i} (x_{i} - μ)}^{2}}$
Kurtosis: $f_{7}$	$\frac{1}{N - 1} {\sum_{i} (\frac{x_{i} - μ}{σ})}^{4}$
Skewness: $f_{8}$	$\frac{1}{N - 1} {\sum_{i} (\frac{x_{i} - μ}{σ})}^{3}$
Crest factor: $f_{9}$	$\frac{\max (x) - \min (x)}{\sqrt{\frac{1}{N} \sum_{i} x_{i}^{2}}}$
Shape factor: $f_{10}$	$\frac{\sqrt{\frac{1}{N} \sum_{i} x_{i}^{2}}}{\frac{1}{N} \sum_{i} \|x_{i}\|}$
Clearance factor: $f_{11}$	$\frac{\max (x) - \min (x)}{\frac{1}{N} \sum_{i} \|x_{i}\|}$
Entropy: $f_{12}$	$- \sum_{i} p (x_{i}) \log_{2} (x_{i})$

Table 2. Configuration of the Case Western Reserve University (CWRU) bearing dataset under different rotational speeds.

Dataset	RPM	Failure Mode	Training Size	Validation Size	Test Size
Dataset 1	1730	Normal, OF, IF and RF	40	281	281
Dataset 2	1750	Normal, OF, IF and RF	40	281	281
Dataset 3	1773	Normal, OF, IF and RF	40	281	281
Dataset 4	1797	Normal, OF, IF and RF	40	231	231

Table 5. Input dimensions of preprocessed features for each dataset.

	Exp. A STFT	Exp. A Handcrafted	Exp. B STFT	Exp. B Handcrafted	CWRU STFT	CWRU Hand- crafted
Input size $(H, W)$	(1024, 128)	(128, 12)	(256, 512)	(512, 12)	(32, 32)	(32, 12)

Table 6. Kernel sizes, striding intervals and channels of CNN-based models.

	Exp. A	Exp. B	CWRU
Kernel size ( $K_{f}$ , $K_{t}$ , $N$ )	(256, 16, 64)	(64, 31, 64)	(8, 5, 64)
Striding interval ( $S_{f}$ , $S_{t}$ )	(32, 4)	(8, 4)	(1, 1)
Number of parameters	262,208	127,040	2624

Table 7. Number of hidden units and parameters for LSTM-based models.

	Exp. A	Exp. B	CWRU
Number of hidden units	250	172	20
Number of parameters	263,000	127,008	2640

Table 8. Accuracy and F1 macro of all experimental results.

	Dataset 1 (Acc./F1)	Dataset 2 (Acc./F1)	Dataset 3 (Acc./F1)	Dataset 4 (Acc./F1)	Exp. A (Acc./F1)	Exp. B (Acc./F1)
CNN	0.92/0.92	0.91/0.91	0.91/0.90	0.92/0.92	0.99/0.99	0.99/0.99
LSTM-T	0.80/0.79	0.80/0.80	0.81/0.81	0.80/0.80	0.90/0.90	0.78/0.76
LSTM-F	0.99/0.99	0.99/0.99	0.99/0.99	1.0/1.0	0.90/0.90	0.96/0.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, J.-W.; Lee, J.-H.; Son, D.-H.; Choi, S.-H.; Park, K.-S. Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis. Lubricants 2026, 14, 12. https://doi.org/10.3390/lubricants14010012

AMA Style

Kim J-W, Lee J-H, Son D-H, Choi S-H, Park K-S. Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis. Lubricants. 2026; 14(1):12. https://doi.org/10.3390/lubricants14010012

Chicago/Turabian Style

Kim, Jung-Woo, Jong-Hak Lee, Dong-Hun Son, Sung-Hyun Choi, and Kyoung-Su Park. 2026. "Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis" Lubricants 14, no. 1: 12. https://doi.org/10.3390/lubricants14010012

APA Style

Kim, J.-W., Lee, J.-H., Son, D.-H., Choi, S.-H., & Park, K.-S. (2026). Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis. Lubricants, 14(1), 12. https://doi.org/10.3390/lubricants14010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Comparative Analysis of CNN and LSTM for Bearing Fault Mode Classification and Causality Through Representation Analysis

Abstract

1. Introduction

2. Theoretical Backgrounds

2.1. Convolutional Neural Network

2.2. Short-Time Fourier Transform

2.3. Long Short-Term Memories

2.4. Handcrafted Features

3. Datasets

3.1. Benchmark Dataset

3.2. Experimental Setup

3.2.1. Test Setup Configuration

3.2.2. Slightly Defected Bearings

3.2.3. Data Acquisition

4. Validation

4.1. Preprocessing

4.2. Interpretable Model Design for Comparative Analysis

4.2.1. CNN–STFT

4.2.2. LSTM–Handcrafted

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI