3.1. Dataset
In this study, a multi-domain vibration dataset under compound machine fault scenarios was utilized [
36]. This dataset provides a comprehensive collection of vibration signals obtained using a deep groove ball bearing (MOCHU 6204) under various fault conditions for fault diagnosis in rotating machinery. The dataset includes three different singular bearing faults, seven different singular rotating component faults, and 21 combined fault scenarios [
37]. Data were collected at rotational speeds of 600, 800, 1000, 1200, 1400, and 1600 RPM, with sampling rates of 8 kHz and 16 kHz, and for different bearing types. Each vibration signal was recorded for 160 s at an 8 kHz sampling rate and 80 s at a 16 kHz sampling rate, with each recording containing a total of 1,280,000 samples. The dataset was structured hierarchically based on sampling rate and rotational speed, with 32 data files available for each speed category.
The dataset defines various conditions that allow for the examination of rotating components and bearings under different fault scenarios. For rotating components, the system categorizes faults as H, M, U, and L. Misalignment faults include three severity levels (M1, M2, M3), corresponding to shaft displacements of 0.6 mm, 0.8 mm and 1.0 mm, respectively. Similarly, imbalance faults have three severity levels (U1, U2, U3), corresponding to additional masses of 3 g, 4 g and 5 g attached to the rotor disk, respectively. Bearing conditions are classified into H, B, IR, and OR.
In this study, data collected at a 16 kHz sampling frequency and 1000 RPM rotational speed were utilized. The higher sampling frequency of 16 kHz compared to 8 kHz was chosen to ensure that high-frequency components associated with faults could be accurately captured without aliasing, thereby preserving critical fault signatures. The rotational speed of 1000 RPM was selected as it represents an average operating condition within the dataset, providing a balanced scenario where fault-related vibration amplitudes are sufficiently pronounced for reliable detection while avoiding the excessive noise and harmonic distortions observed at higher speeds and the weak fault signatures typical of lower speeds.
Rotating component and bearing faults were identified separately. For the classification of rotating component faults, all data files containing the same type of rotating component fault were combined. This dataset included both different healthy bearing data and data with various bearing faults. In other words, while detecting rotating component faults, the bearing condition in the dataset varies: some data contain healthy bearings, while others include ball faults, inner ring faults, or outer ring faults. Similarly, for the classification of bearing faults, all data files containing the same type of bearing fault were combined. This dataset included records with different rotating component faults or healthy rotating components. That is, while identifying bearing faults, the condition of rotating components varied, with some data containing entirely healthy rotating components, while others included different faults such as misalignment, imbalance, or mechanical looseness.
Subsequently, various transformation algorithms were applied to the dataset to obtain different time–frequency representations for feature extraction based on time–frequency analysis. First, STFT was employed to analyze the frequency components of signals within specific time intervals, generating spectrogram images. Then, CWT was utilized to produce scalogram images, offering an adaptive frequency resolution. Additionally, WVD transformation was applied to visualize the signal’s autocorrelation-based analysis, resulting in the Wigner–Ville spectrum. Finally, the HHT was used to determine the instantaneous frequency components of the signals, producing the Hilbert spectrum. The time–frequency images obtained through these transformations were used to train a deep learning-based model for machine fault diagnosis and condition monitoring.
In the time–frequency analysis conducted using STFT, a sampling frequency of 16,000 Hz was utilized. The Hann window, which is provided as the default option in the scipy.signal.spectrogram function and a built-in method in the SciPy library used to compute a spectrogram via STFT [
38], was employed, with the window length set to 256 samples. To maintain temporal continuity between successive frames, an overlap of 32 samples—equivalent to one-eighth of the window length—was applied. The length of the fast Fourier transform (FFT) was configured to match the window length, specifically 256 points, in order to achieve a compromise between computational efficiency and frequency resolution. The spectrograms that were generated through this process were subsequently log-scaled and visualized using Gouraud shading, which was applied to enhance the smoothness and clarity of the time–frequency representation.
In the scalogram-based time–frequency analysis, CWT was carried out using the Morlet wavelet, which is known for providing a favorable trade-off between time and frequency localization. Each segment of the signal was composed of 16,000 samples. A range of scales, corresponding to wavelet widths varying from 1 to 30, was employed in the analysis. This range of scales was selected to enable the effective extraction of both low-frequency and high-frequency components from the signal, thereby ensuring a comprehensive representation of its spectral content.
For the time–frequency analysis based on the HHT, EMD was applied using the mask sift approach, with the decomposition limited to a maximum of five intrinsic IMFs. Each input segment was composed of 16,000 data samples, thereby maintaining consistency with the analyses conducted using the STFT and CWT methods. After the decomposition process was completed, the normalized Hilbert transform was subsequently applied to each extracted IMF in order to derive instantaneous frequency and amplitude information. The resulting time–frequency representations were then constructed using 150 frequency bins that were logarithmically spaced across a range from 1 Hz to 8000 Hz. This configuration was chosen to enable a detailed and comprehensive characterization of both low-frequency and high-frequency components present in the signal.
For the WVD analysis, each original signal segment consisting of 16,000 samples was divided into smaller subsegments of 2000 samples in order to reduce the computational and memory demands typically associated with generating the full WVD matrix. If the WVD were to be computed over the entire segment, a matrix of size 16,000 × 16,000—containing 256 million elements—would be required, which is considered impractical due to significant memory constraints. By employing subsegments of 2000 samples, the matrix size was effectively reduced to 2000 × 2000, resulting in only 4 million elements and thereby allowing for more efficient and feasible computation. The time–frequency representations that were obtained through this method were subsequently normalized by scaling the absolute values to the [0, 1] range, ensuring consistency, comparability, and interpretability across different signal samples.
The image samples obtained are presented in
Figure 1. Among these, the WVD representation initially resulted in a much larger dataset due to the subsegmentation strategy applied to handle its quadratic time and memory complexity. To ensure a fair comparison with the STFT, CWT, and HHT representations, an equal number of samples were randomly selected from the WVD dataset to create the training, validation, and testing sets. Consequently, for all four methods, a total of 2560 samples were used for each representation, distributed as 1792 for training, 384 for validation, and 384 for testing.
3.2. Method
The overall structure of the ViT-based fault classification approach used in this study is illustrated in
Figure 2. Vibration signals are first transformed into time–frequency representations using one of four methods: STFT, CWT, HHT or WVD. The resulting 2D image is resized to 224 × 224 pixels and fed into a ViT architecture. The input image is divided into non-overlapping patches of size 16 × 16 pixels, flattened, and linearly projected. Positional embeddings are then added, and the embedded patches are processed through transformer encoder blocks consisting of multi-head self-attention and feed-forward (MLP) layers.
To provide additional clarity, the detailed internal mechanism of the multi-head self-attention block is illustrated in
Figure 3. In this mechanism, each embedded patch is first linearly projected into three distinct representations: queries (Q), keys (K), and values (V). The dot product between Q and K determines the similarity (attention score) between patches, indicating how much focus each patch should receive relative to others. These scores are scaled and passed through a softmax function to obtain attention weights, which are then used to combine the V vectors. Multiple attention heads perform this operation in parallel, capturing diverse relationships among patches. The outputs from all heads are concatenated and passed through a linear transformation to generate the final attention output. This process allows the model to effectively learn global dependencies between different fault-related features in the time–frequency representations.
The output token is finally passed to a classification head (MLP head) to predict the corresponding fault class. A pre-trained ViT-Base model was employed, and a transfer learning strategy was applied. To reduce computational cost and training time, all layers were frozen except for the final classification head, which was fine-tuned on the target dataset. For visualization clarity, only 9 patches are shown in the figure. In practice, the input image is divided into 14 × 14 = 196 patches.
To evaluate the model’s performance, two independent classification tasks were conducted. In the first task, an eight-class rotating component fault classification (H, L, M1, M2, M3, U1, U2, U3) was performed, and in the second a four-class bearing fault classification (H, B, IR, OR) was carried out. In both tasks, the samples were intentionally constructed to include varying conditions of the other component. That is, bearing fault samples were collected under different rotating component states, while rotating component fault samples were collected under varying bearing conditions. This structure ensured that the model learned to identify fault types independently of variations in other mechanical components, allowing for a more realistic and robust evaluation.