1. Introduction
Gearboxes are widely used in automobiles, industrial machinery, wind turbines, and electric motors to transfer power from one rotating component to another [
1]. A gearbox is made from a set of gears that are used to control torque and rotational speed based on the gear ratio. A gearbox can be manual, automatic, or semi-automatic [
2]. Gearbox failure can result in a number of issues, such as loss of power transfer, mechanical damage, overheating, sudden stoppage, lower efficiency, safety hazards, increased vibration [
3,
4], and many others. A number of variables, such as inadequate lubrication, contamination, excessive stress or overloading, improper gear alignment, poor components, and wear and tear (aging), can cause the gearbox to fail [
5]. A gearbox can be prevented from failing by employing several preventative measures, such as ensuring the gears are properly lubricated, checking the lubricant frequently, avoiding overloading, preventing overheating, performing routine inspections, and so on. These methods fall under the umbrella of preventive maintenance, which involves inspecting and replacing components on a regular basis [
6]. Some of these methods are time-consuming, while others are not cost-effective.
Predictive maintenance is a method that uses condition monitoring, data analytics, and ML to predict equipment or a machine’s Remaining Useful Life (RUL) [
7]. The RUL can be used to estimate when a component needs to be repaired or replaced. This method offers several advantages, including reduced downtime, cost savings, increased equipment useful life, high safety, and data-driven decision-making ability [
8]. A number of steps, such as data collection, data analysis and modelling, condition evaluation, and maintenance decision-making, can be employed to perform predictive maintenance for gearbox condition monitoring. A gearbox’s predictive maintenance can be efficiently performed by collecting vibration, temperature, lubricant condition, acoustic, load, and torque data [
9]. An appropriate sensor can be used to record the data for each data type. The collected data can be processed using several methods for extracting useful insights. The processing methods can be classified as signal processing-based, thresholding, or AI-based methods [
10]. For efficient gearbox defect prediction, AI-based techniques generally require huge datasets; however, they often rely less on explicit domain knowledge than physics-based methods. On the other hand, the thresholding-based method uses thresholding logic on a signal parameter, such as amplitude, to monitor the condition of the gearbox [
11]. However, signal processing-based methods analyze the signals in a more in-depth way, especially through visualization, such as frequency response, fluctuations, etc. [
12].
A problem with constantly monitored machines is that the collected signals have a long interval. When data is collected through continuous monitoring, it remains in that form for months, years, etc. Because of its high dimensions, analyzing such long-interval data with ML is challenging. The goal of this research is to analyze such collected data using a moving window approach, in which a subset of the signals are chosen for analysis. The objective here is to evaluate ML classifiers and determine whether increasing the moving window size used for feature extraction could improve their performance.
1.1. Motivation
In manufacturing and industrial applications, the gearbox is responsible for transmitting power from the motor to other machine components [
13]. The rate of power transfer from the motor to other components is controlled by controlling the rotational speed and torque of the gears inside the gearbox [
14]. A defective gearbox could cause issues such as power loss, more energy utilization, and inefficient machine performance [
15]. A fault-free gearbox can increase the rate of production in manufacturing and industrial operations by transmitting power more effectively and consuming energy more efficiently. Therefore, maintaining a gearbox in the best possible condition could result in increased efficiency, reduced downtime, improved product quality, more safety, and cost savings. Following a safety inspection, a gearbox can be preserved to remain in good working condition [
16].
1.2. Problem Statement
A gearbox’s condition can be monitored using gearbox operational data. When a gearbox is operating, a number of parameters, including vibration, temperature, speed, and torque, can be monitored [
17]. An appropriate method can be selected to process the collected data for extracting useful insights from the collected operational data [
18]. In most cases, the data is collected as signals, which can be processed either in the time domain, frequency domain, or time-frequency domain. These domain-based methods require domain knowledge, or a domain specialist, to be able to differentiate between faulty and healthy data samples. Meanwhile, gearboxes in different environments have different patterns for healthy and defective samples that require domain specialties.
These domain-based techniques do not allow predictive monitoring, because they require an analyst to continuously keep an eye on signals for detecting faults. These methods also failed to predict the RUL. In the era of digital transformation, Industry 4.0, Industry 5.0, and other technological innovations—where processes are moving from manual to automatic processing—humans are prone to errors; this makes existing techniques time-consuming, costly, and insufficient [
19].
1.3. Our Contribution
This research analyzed a publicly available benchmark dataset [
20] to enable an AI-based gearbox fault detection system. The dataset was collected by recording vibration data from a drivetrain using various load sizes in both healthy and faulty scenarios. This vibration data was recorded using four sensors placed on a gearbox at four different positions. These vibration signals were further processed for feature extraction, which was used to train ML classifiers for gearbox fault detection. The following are the research contributions (RCs) of this paper.
- RC-1:
Moving window: In this research, the dataset was processed using a moving window approach for feature extraction. During feature extraction, the moving window size was set to six different values: 300, 400, 500, 600, 700, and 800, that correspond to raw vibration samples. Window sizes of 300, 400, 500, 600, 700, and 800 were chosen to look into how increasing temporal context affects the stability of statistical feature extraction in varied load conditions. This resulted in six datasets, with each one corresponding to the moving window size.
- RC-2:
Statistical feature extraction: During the feature extraction, ten statistical features were extracted from each moving window (moving window sizes 300, 400, 500, 600, 700, and 800). These features included the mean, standard deviation, variance, skewness, kurtosis, peak-to-peak (P2P), crest factor, impulse factor, and root mean square (RMS).
- RC-3:
Data partitioning: This research methodology is validated using five-fold cross-validation, which allows for reliable performance evaluation. This approach improves generalization, decreases bias from a single train–test split, and allows a more reliable evaluation of model performance for real-world applications.
- RC-4:
ML model performance: This study compares the performance of seven ML classifiers based on accuracy, precision, and recall for gearbox fault detection. The ML classifiers were Decision Tree, GBC, KNN, Logistic Regression, Naïve Bayes, Random Forest, and SVM.
The experimental results revealed that the SVM, Logistic Regression, and GBC outperformed the other classifiers. The experimental results also revealed that using a larger moving window size for feature extraction improved the performance of ML classifiers.
The rest of the paper is organized as follows:
Section 2 describes the background and literature review. The proposed methodology, which includes the data description, experimental setup, preprocessing, ML classifiers, and evaluation metrics, is discussed in
Section 3. The results are presented in
Section 4, and the discussion is given in
Section 5. Finally, the paper is concluded in
Section 6, followed by recommendations for future work.
3. Proposed Framework
This section presents the proposed framework, where
Figure 2 shows the overall methodology diagram. Initially, the vibration data from the gearbox was collected using four sensors and was obtained in a dataset from Kaggle [
20]. The features were extracted from these raw signals using a moving window-based method, with moving window sizes of 300, 400, 500, 600, 700, and 800 that correspond to raw vibration signal samples. The data in this dataset was collected at a sampling frequency of 20 KHz. This led to window durations of 15 ms for 300 samples, 20 ms for 400 samples, 25 ms for 500 samples, 30 ms for 600 samples, 35 ms for 700 samples, and 40 ms for 800 samples. By setting the moving window size to a specific value, one dataset, which included features extracted from the vibration signals, was created in line with the moving window size. The k-fold cross-validation method was used to partition the resulting datasets, with “k” selected as 5, so that each fold was used only once as a test set, while the remaining folds were used for training, and performance was averaged across all folds.
3.1. Data Description
The gearbox fault diagnosis dataset [
20] was used in this study, which was acquired from Kaggle. The data in the dataset was collected by placing vibration sensors on SpectraQuest’s Gearbox Fault Diagnostics simulator with a sampling frequency of 20 KHz, as shown in
Figure 3. For measuring vibration signals along four different directions, four vibration sensors were placed in different positions. The data was collected in two different scenarios: healthy and broken tooth conditions (faulty). During the data collection, the load was varied from 0% to 90% with a 10% step size, i.e., 0%, 10%, 20%, and so on. The collected data in terms of file name and each file dimension are summarized in
Table 2, where the first number is the number of samples in the file and the second number represents the attributes (in this case, the sensor, and there are a total of four sensors).
3.2. Experimental Environment
The experiments in this study were performed on a powerful laptop, the Alienware Core i9 12th generation [
37]. The laptop’s clock cycle rate was 2.50 GHz, meaning that a single cycle of the processor allowed for 2.50 billion operations per second. The laptop was equipped with an SSD and a 64-bit operating system, as well as Windows 11 version 23H2. The laptop also included an NVIDIA GeForce RTX 3080 Ti Graphic Processing Unit (GPU), which allows faster training of ML models. Python version 3.10.9 was used for experimentation, which involved everything from preprocessing to training ML models. All programming implementation was carried out in a Jupyter notebook.
3.3. Preprocessing
The preprocessing pipeline employed in this case study is shown in
Figure 4. Initially, a single file was picked from the dataset for feature extraction. Features are used to train ML models, which is very helpful when the data is high-dimensional. As mentioned in
Section 3.1, the vibration data from the gearbox was acquired using four sensors; therefore, each file includes readings from these four sensors. For feature extraction, we used a moving window-based technique with window sizes of 300, 400, 500, 600, 700, and 800 that correspond to raw vibration signal samples. For feature extraction, the moving window-based approach is helpful because it reduces noise, improves temporal resolution, captures local fluctuations, and allows for short-term patterns that may be missed if the whole signal is considered. The moving window size was varied to capture gradually larger sizes of the vibration signal for feature extraction. In particular, window sizes of 300, 400, 500, 600, 700, and 800 samples were chosen, which correspond to 15 ms, 20 ms, 25 ms, 30 ms, 35 ms, and 40 ms at a sampling frequency of 20 kHz. These window sizes ensure that each window covers a significant portion or several cycles of the shaft rotation, considering that the standard frequency of a rotating shaft in a gearbox is usually between 10 Hz and 50 Hz (periods of 20 ms to 100 ms). Longer windows improve the stability and physical relevance of the extracted statistical features, whereas the shortest window (300 samples, 15 ms) was employed to evaluate model sensitivity with low temporal aggregation. The moving window size was initially set to 300; that allowed for 300 raw vibration signal samples to be read from each sensor, where each window was considered for feature extraction. A total of 40 features were generated by extracting 10 statistical features from each sensor window recording, with 10 features corresponding to each sensor. The extracted features included mean (
), root mean square (RMS), standard deviation (
), variance (
), skewness (
), kurtosis (
), peak-to-peak value (P2P), crest factor (CF), shape factor (SF), and impulse factor (IF). As each file has a load value and a class label as well, the extracted 40 features along with the load value and class label were all considered to be a single record in the dataset, corresponding to a moving window size of 300 raw vibration signal samples. This feature extraction process was repeated by taking 300 more raw vibration signal samples until the entire file and dataset were processed. At the end, a dataset corresponding to the moving window size of 300 was formed by combining the extracted features from healthy and faulty vibration signals. For the next iteration of experiments, the moving window’s size was set to 400, in which 400 raw vibration signal samples of each sensor were considered for feature extraction at the same time. After that, the moving window’s size was set to 500, 600, 700, and 800, and the same experiments were repeated for feature extraction. In all of the experiments, ten features were extracted from each sensor’s data, resulting in a total of 40 features, four of which were the same, with each feature corresponding to a particular sensor. The extracted features are defined in the below equations from (
1) to (
10). These statistical features were retrieved because they are most suited to capturing changes in signal properties caused by mechanical issues such as tooth cracking, gear wear, pitting, and misalignment [
38]. The gearbox vibration signal’s variability, amplitude, and impulsive characteristics could be captured by such features.
- 1.
Mean: The mean is an important statistical feature because of the central tendency, feature normalization, and data imputation [
39]. The average value of the signal is represented by the mean, and it provides a measure of central tendency. Apart from that, normalizing or scaling the features to have a mean of zero and a standard deviation of one may improve the performance of ML models, especially those that are sensitive to feature scale, while sometimes, a dataset has possibilities for the missing values for a number of reasons. In this case, the mean can be used to fill in the missing values. In our case, we computed the moving window’s mean value and used it as a feature to train ML models. This is because a signal’s mean value could indicate the signal’s average vibration intensity over time, which is a helpful indicator for steady-state misalignment and load imbalance. Since we have four sensor readings and are processing four moving windows at the same time, we obtained a total of four mean values, one for each sensor moving window.
- 2.
RMS: The RMS value is an important metric that weights the magnitude of an error or fluctuation, giving a higher weight to the higher deviations [
40]. As indicated in Equation (
2), it is measured by first squaring the signal values, then averaging these squared values, and finally obtaining the square root of the average. The RMS value represents the total vibration energy, which could indicate gearbox fault severity and overall mechanical degradation.
- 3.
Standard deviation: The standard deviation is an important measure because it indicates how far a data point deviates from the mean value [
41]. A low standard deviation value indicates the data points are closer to the mean, whereas a high standard deviation value indicates that the data points are spread out across the mean. Understanding the data distribution, consistency, and variability helps for better model development and evaluation. It is computed by taking the squared deviations of the data points from the mean, averaging the squared deviations, and then taking a square root, as indicated by Equation (
3). This is because the standard deviation of a signal could indicate amplitude dispersion, and in the case of a gearbox fault, the standard deviation is sensitive to surface degradation and wear.
- 4.
Variance: The variance is another metric that measures the degree to which the data points in a dataset deviate from the mean [
42]. The variance is calculated as the average of the squared deviations between each data point and the mean, as indicated in Equation (
4). A higher variance value indicates the data points are more widely distributed in the dataset. This is because the variance of a signal shows its energy fluctuation, and in the case of a gearbox fault, it can increase with wear and irregular gear contact.
- 5.
Skewness: Skewness is another metric for measuring the asymmetry of a data distribution [
40]. It can determine whether data points are positively or negatively skewed. Equation (
5) can be used to calculate skewness. A positive value of skewness indicates that the tail is longer on the right side; hence, it is right-skewed (positively skewed). The negative value of skewness can indicate negative skewness, which denotes that the tail is longer on the left and thus left-skewed. A balanced distribution has a skewness value of zero, indicating normal distribution. Skewness can be used to demonstrate the asymmetry of signal distribution, which can be used to detect localized faults causing uneven impacts in gearbox vibration signals.
- 6.
Kurtosis: The kurtosis measures the sharpness of a probability distribution’s peak [
43]. It is useful in determining the variation caused by the extreme values (outliers) in the dataset. A high kurtosis value indicates that the distribution is highly tailed, which means there will be more outliers. On the other hand, a low value of kurtosis could indicate that the distribution has light tails and, as a result, will have fewer outliers. Similarly, a normal kurtosis value may indicate that the distribution is normal (for example, Gaussian distribution). The kurtosis value can be computed via Equation (
6). It describes impulsiveness and sharp transients in signals, which can detect highly sensitive indications of early-stage problems in gearbox vibration signals.
- 7.
P2P: The P2P value measures the difference between the maximum and minimum values in the signal [
44]. In our case, we measured the P2P value of each moving window. It is useful in capturing the variability, detecting outliers, and performing normalization. Equation (
7) is used to measure the P2P value. In this research, it can be used to indicate the range of vibration extremes, which is useful for detecting abnormal gear meshing and backlash in gearbox vibration signals.
- 8.
CF: The ratio of a signal’s peak value to its RMS value is known as the CF [
45]. The CF measures the correlation between the peaks and the signal’s total energy. Equation (
8) can be used to calculate the CF. The CF in gearbox vibration signals highlights the impact severity regardless of signal energy.
- 9.
SF: The SF can be used to characterize signal distribution. It is defined as the ratio of the RMS value to the mean absolute value [
46] and can be calculated with Equation (
9).
- 10.
IF: The IF can be used to detect sharp peaks in a signal. It is defined as the ratio of the maximum absolute value of the signal to its mean absolute value [
45]. It is highly useful for fault detection, signal processing, and vibration analysis. The IF can be calculated via Equation (
10).
3.4. Data Distribution
The number of samples generated in each dataset is dependent on the window size, because different sizes of moving windows are used for feature extraction. The dataset corresponding to moving window size 300 has 6676 samples, with 3328 healthy and 3348 faulty samples. The dataset with a moving window size of 400 includes 5006 samples, 2496 of which are healthy and 2510 of which are faulty. Similarly, the dataset with a moving window size of 500 comprises 3998 samples, with 1994 healthy samples and 2004 faulty samples. As seen in
Figure 5, increasing the size of the moving window reduces the number of samples in the corresponding dataset. This is because increasing the window size for feature extraction considers more raw data points for feature extraction. Therefore, fewer moves are required to traverse the sensor’s complete readings, where features are extracted at each move.
An imbalanced dataset can also affect the ML model’s performance. An imbalanced dataset is a dataset that has fewer samples from a few classes and more samples from other classes, causing the resulting model to fail to generalize well. The imbalanced dataset-based trained model could face several challenges, such as bias toward the majority class, poor generalization, and misleading accuracy [
47]. Several data-balancing techniques, including undersampling, oversampling, and SMOTE, can be used to address the problem of an imbalanced dataset [
48]. Apart from data balancing, evaluation metrics other than accuracy, such as precision, recall, F1 score, and the AUC-ROC curve, can be used to evaluate the performance of a model trained on an imbalanced dataset. Therefore, we checked whether the generated datasets corresponding to the moving window sizes of 300, 400, 500, 600, 700, and 800 were balanced or not.
Figure 5 indicates that these datasets were balanced.
To ensure a strong and unbiased evaluation of the models, five-fold cross-validation (K = 5) was used instead of traditional predefined train–test splits (e.g., 90:10, 80:20, or 70:30 ratio splits). Fixed train–test splits were unfeasible due to the small test sets that can result from making use of larger window sizes (15–40 ms), which substantially reduced the total number of samples in each dataset. In five-fold cross-validation, each dataset was divided into five folds; in each iteration, one fold acted as the test set, while the remaining four folds were used for training. This technique was carried out until each fold had been tested once, after which performance metrics were averaged across all folds. This method ensured that all samples were analyzed, avoided temporal data leakage, and provided more reliable estimates of model generalization than traditional ratio-based splits.
3.5. Classification Models
3.5.1. Decision Tree
A Decision Tree is a supervised ML classifier used for solving classification problems. It has a flowchart-like design that includes concepts such as root node, internal nodes, leaf nodes, splitting criteria, maximum depth, and pruning [
49]. The dataset is divided at each node according to the attribute values that best split the dataset. The root node (topmost node) in the tree splits the entire dataset based on the attribute that best splits the dataset on the selected criterion. Variance reduction, entropy, and Gini impurity are possible splitting criteria [
50]. The level of impurity used in classification tasks is measured by the Gini impurity. A dataset is considered more pure if its Gini impurity score is low, showing that it mostly consists of a single class. On the other hand, entropy is used as a splitting criterion in information gain-based Decision Trees (such as ID3). Entropy measures the degree of uncertainty of a dataset and decreases when the classes are separated. Variance reduction can be used with Decision Trees while solving a regression problem. It minimizes the variance between nodes by selecting splits, which result in more homogeneous groups. The max depth specifies the maximum number of splits or levels in a tree. A tree can be overfitted with maximum depth if it is complicated and only memorizes the training data. Pruning is one approach to dealing with overfitting in Decision Trees, which removes some of the nodes or subtrees in a tree.
3.5.2. SVM
SVM is a supervised ML classifier that is commonly used for solving classification problems. It is efficient and robust for high-dimensional data as well as for linear and non-linear classification problems [
51]. The hyperplane, support vectors, margin, and kernel are some of the key concepts of SVM [
52]. SVM works by creating a hyperplane between data points that best separates them into the classes that they belong to. In 2D, the hyperplane is a line, whereas in 3D, it is a plane and a high-dimensional boundary for larger feature spaces. The best hyperplane is the one that maximizes the distance between the closest data point in each class and itself. The support vectors are the data points nearest to the hyperplane. These points are very helpful when defining the margin as well as the hyperplane, which is why they are so important. The margin is the distance between the hyperplane and the nearest data points of any class. SVM tries to improve this margin, thereby increasing the distance between the class data points and the hyperplane, which improves the classification accuracy. When the data is non-linearly separable, the SVM maps the input data into a higher-dimensional space using a kernel function [
53], allowing for a linear separation.
3.5.3. Random Forest
Random Forest is a supervised ML classifier that can solve classification problems. It is an ensemble learning technique, as it combines multiple Decision Trees to improve results and creates a model that performs better than a single Decision Tree model [
54]. Random Forest uses a subset of the training data to train the Decision Tree model, which is selected with replacement. This ensures that each tree is trained on a slightly different dataset, thus reducing overfitting. The final classification is determined by majority voting. It combines the predictions of each tree in the forest and then uses majority voting to obtain the final outcome [
55].
3.5.4. KNN
KNN is a non-parametric approach that can be applied to both regression and classification problems. KNN predicts the class of a new data point by counting the labels of the k closest data points in the training dataset. The key parameter in the KNN is ‘k’, which defines the number of nearest neighbors to be considered. A smaller value of ‘k’ may make the model more vulnerable to noise. However, a high value of ‘k’ can smooth the predictions, possibly blurring them. It assigns the class label based on a majority vote of the ‘k’ nearest neighbors [
56].
3.5.5. Logistic Regression
Logistic Regression is a popular supervised learning algorithm that works as a classifier and is best suited to binary classification tasks. Apart from its name, it predicts categorical outputs rather than continuous values. Logistic Regression uses the sigmoid function for mapping predictions to a probability score between 0 and 1 [
56]. It uses the binary cross-entropy or log loss function, which is refined throughout training. The goal is to find the weights that minimize the average log loss across all instances.
3.5.6. Naïve Bayes
Naïve Bayes is a probabilistic algorithm used to solve problems in supervised learning settings. It makes naïve assumptions and applies Bayes’ theorem to each pair of features with respect to the target class. Naïve Bayes states that all the features are conditionally independent of one another [
57]. Naïve Bayes uses Bayes’ theorem to find the probability of the target class based on prior knowledge of conditions, which can be useful when related to that class.
3.5.7. GBC
GBC combines many weak classifiers (mostly Decision Trees) to create a strong classifier through ensemble learning [
58]. In GBC, each classifier focuses on the previous classifier’s errors and updates its weights accordingly.
3.6. Evaluation Metrics
The proposed solution was evaluated based on accuracy, precision, and recall. The F1 score was not considered, because it is the harmonic mean of precision and recall, and we already considered precision and recall for our comparison. In Equations (
11)–(
13), the True Positives (TPs) are samples that the model correctly predicts as positive samples, whereas True Negatives (TNs) are samples that are correctly predicted as negative samples by the model. On the other hand, False Positives (FPs) are samples incorrectly predicted by the model as positive samples that are actually negative samples. Similarly, False Negatives (FNs) are samples that the model incorrectly predicted as negative but are actually positive samples.
3.6.1. Accuracy
Accuracy is a metric used for evaluating the performance of an ML classifier. It calculates the percentage of correctly classified samples with respect to the total samples and can be calculated using the equation given below (
11).
3.6.2. Precision
Precision measures the ratio of actually positive samples to the positive predictions made by the model, as computed using Equation (
12):
3.6.3. Recall
Recall is a measure of how many actual positive samples the model accurately predicts as positive. Recall can be calculated using the equation given below (
13).
4. Results
4.1. Exploratory Data Analysis
The exploratory data analysis of the datasets obtained using moving window sizes of 300, 400, 500, 600, 700, and 800 from the gearbox fault dataset showed a few interesting facts about the data. The correlation technique was employed to identify the relationships among the attributes (or features) in the attribute (feature) set that can influence whether this feature vector is healthy or faulty.
Figure 6 presents the correlation values of features sorted in descending order in each dataset corresponding to the moving window size. The bar height represents the correlation value of that feature that contributes to the class label. The correlation values of the dataset features, calculated using moving window sizes from 300 to 800 (15 ms to 40 40 ms), are presented in
Figure 6a–f, with each bar representing a feature’s correlation value. The features peak-to-peak, RMS, standard deviation, variance, shape factor, kurtosis, and impulse factor—all computed from sensor 1 (designated as _S1)—have consistently high correlations across all datasets, whereas the correlation values of the remaining features can be observed from
Figure 6a–f.
4.2. Confusion Matrix
The number of True Positives (TPs), True Negatives (TNs), False Positives (FPs), and False Negatives (FNs) is a reliable indicator of how well a method performs across different evaluation metrics. Although the percentage is presented via accuracy, precision, and recall, the raw numbers are as interesting to examine.
Table 3 presents the number of TPs, TNs, FPs, and FNs.
4.3. Accuracy
Table 4 presents the accuracy scores of the classifiers that were trained using five-fold cross validation on datasets corresponding to moving window sizes of 300, 400, 500, 600, 700, and 800. At a moving window size of 300 (equivalent to a 15 ms window), the SVM outperformed all other classifiers with an accuracy score of 99.51%. At the same time, the GBC, Logistic Regression, and Random Forest achieved accuracy scores of 99.18%, 99.13%, and 99.07%, respectively, outperforming the Decision Tree, Naïve Bayes, and KNN. The Naïve Bayes classifier performed poorly as compared to other classifiers, with an accuracy score of 87.07%. When the moving window size of 400 (equal to 20 ms) dataset was used, all of the classifier’s accuracy scores improved by only a slight margin. The SVM outperformed all other classifiers with an accuracy score of 99.80%, which was improved from an earlier score of 99.51%. The Logistic Regression, GBC, and Random Forest classifiers also improved their accuracy scores to 99.66%, 99.50%, and 99.36%, up from 99.13%, 99.18%, and 99.07%, respectively. On the other hand, classifiers including KNN, Decision Tree, and Naïve Bayes, which scored poorly compared to all other classifiers, improved their accuracy scores to 98.76%, 97.38%, and 90.25% from 97.98%, 91.21%, and 87.07%, respectively.
The same trend of improved classifier accuracy was observed when the corresponding dataset with a moving window size of 500 (equivalent to 25 ms) was employed. The highest accuracy score of 99.90% was achieved by Logistic Regression, which is higher than the prior score of 99.66%. This time, the SVM, which previously achieved the highest accuracy score of 99.80%, continued to perform similarly. The GBC, Random Forest, and KNN also improved their accuracy scores from 99.50%, 99.36%, and 99.66% to 99.57%, 99.55%, and 99.02%, respectively, surpassing Decision Tree and Naïve Bayes. On the other hand, the poorly performing Naïve Bayes and Decision Tree algorithms also improved their accuracy scores from 90.25% and 97.38% to 92.37% and 97.85%, respectively. Following that, when the moving window size of 600 (which is equal to 30 ms) dataset was used, the SVM achieved a higher accuracy score of 100%, an improvement over its previous score of 99.80%, whereas Logistic Regression, Random Forest, and GBC improved their accuracy scores to 99.94%, 99.76%, and 99.70%, respectively, from their previous accuracy scores of 99.90%, 99.50%, and 99.57%. Similar to this, KNN and Decision Tree improved their results from 99.02% and 97.85% to 99.25% and 98.08%, respectively, outperforming Naïve Bayes. The Naïve Bayes classifier, which scored poorly with an accuracy score of 94.66%, improved from its previous accuracy score of 92.37% upon increasing the size of the moving window from 500 to 600 (25 ms to 30 ms).
However, when the moving window size was extended from 600 (30 ms) to 700 (35 ms) and 800 (40 ms), the accuracy scores of some classifiers improved, while those of others marginally reduced. Out of all the classifiers, Logistic Regression improved its accuracy score from 99.94% to 100% on moving window sizes of 700 and 800. On moving window sizes of 700 (35 ms) and 800 (40 ms), the Naïve Bayes accuracy score increased from 94.66% to 96.22% and 97.72%, respectively, despite its poor performance. On window sizes of 700 and 800, the SVM’s accuracy score dropped from 100% to 99.96%. Upon increasing the window size to 700 and 800, Decision Tree, Random Forest, and GBC also showed slightly reduced accuracy scores.
4.4. Precision
Table 5 presents the precision scores resulting from Random Forest, Decision Tree, Naïve Bayes, SVM, Gradient Boosting Classifier, KNN, and Logistic Regression for moving window sizes of 300 to 800. This table demonstrates that increasing the moving window size used for feature extraction could improve the precision score of different classifiers. Initially, when the classifiers were trained on the moving window size of 300 (corresponding to 15 ms), SVM achieved a precision score of 99.51%, outperforming all other classifiers at the time. At that time, GBC, Logistic Regression, and Random Forest outperformed Decision Trees, Naïve Bayes, and KNN, achieving precision scores of 99.18%, 99.13%, and 99.07%, respectively. With precision scores of 97.99% and 97.22%, respectively, KNN and Decision Tree performed better than the Naïve Bayes score of 87.29%. However, all of the classifiers’ precision scores increased when the moving window size increased from 300 (15 ms) to 400 (ms). The SVM achieved the highest precision score of 99.80% on the moving window size of 400 (corresponding to 20 ms), which was an improvement over the prior precision score of 99.51%. However, at the same time, the precision scores of Logistic Regression, GBC, and Random Forest also increased from 99.13%, 99.18%, and 99.07% to 99.66%, 99.50%, and 99.36%, respectively. The precision scores of Decision Tree and KNN similarly increased from 97.99% and 97.22% to 98.77% and 97.39%, respectively, when the moving window size increased from 300 to 400. From 87.29% to 90.35%, the lowest-performing classifier, Naïve Bayes, also showed an increased precision score.
Following that, all of the classifiers showed an improvement in their precision scores when the moving window size of 500 (which is equivalent to 25 ms) was taken into consideration. With precision scores of 99.90% and 99.80%, respectively, Logistic Regression and SVM outperformed all other classifiers. As compared to the previous precision scores of 99.50%, 98.77%, and 99.36% obtained on moving window of size 400, GBC, Random Forest, and KNN achieved precision scores of 99.58%, 99.55%, and 99.03%, respectively. Although the Naïve Bayes and Decision Tree classifiers had the lowest precision scores, they also improved as the moving window size increased, going from 97.39% and 90.35% to 97.85% and 92.47%, respectively. Following that, all of the classifiers showed the same pattern of slightly increasing their precision scores when the moving window size 600 (corresponding to 30 ms) dataset was considered. SVM and Logistic Regression improved their precision scores slightly from 99.80% and 99.90% to 100% and 99.94%, respectively, on the moving window size of 600. With these scores, SVM and Logistic Regression performed better than other classifiers. However, Random Forest, GBC, and KNN outperformed Decision Trees and Naïve Bayes, improving their precision scores to 99.76%, 99.70%, and 99.26% from 99.55%, 99.58%, and 99.03%, respectively. On the other hand, Decision Tree and Naïve Bayes, which had the lowest precision scores of 98.09% and 94.71%, also improved their precision from 97.85% and 92.47%, respectively.
Furthermore, it was found that some classifiers had improved precision scores, while others had worse scores when the moving window size increased to 700 (equivalent to 35 ms) and 800 (equivalent to 40 ms). Logistic Regression improved from its prior score of 99.94% on window size 600 to a precision score of 100% on moving window sizes 700 and 800. Meanwhile, for window sizes of 700 and 800, SVM’s precision scores slightly decreased to 99.97% and 99.96%, respectively, from its previous 100% precision scores. KNN’s score improved from 99.26% to 99.27% and 99.76%, whereas Naïve Bayes’ scores improved from 94.71% to 96.23% and 97.72%, respectively. The GBC precision score dropped from 99.70% to 99.69% and 99.48% on moving window sizes 700 and 800, respectively.
4.5. Recall
The recall scores of classifiers trained on datasets corresponding to moving window sizes 300, 400, 500, 600, 700, and 800 partitioned using k-fold cross-validation are presented in
Table 6 below. With a recall score of 99.51%, the SVM outperformed all other classifiers, whereas at the same time, GBC, Logistic Regression, and Random Forest outperformed Decision Tree, Naïve Bayes, and KNN by recall scores of 99.18%, 99.13%, and 99.70%. KNN and Decision Tree surpassed Naïve Bayes, which has a recall score of 87.07%, with recall scores of 97.98% and 97.21%, respectively. Following that, all of the classifiers’ recall scores improved when the corresponding dataset with a moving window size of 400 was used. With a recall score of 99.80%, SVM outperformed all other classifiers. This score was an improvement on the prior recall score of 99.51% on the dataset corresponding to a moving window size of 300. On the other hand, with recall scores of 99.66%, 99.50%, and 99.36%, respectively, Logistic Regression, GBC, and Random Forest performed better than Decision Tree, KNN, and Naïve Bayes. These scores also improved from prior recall scores obtained on the moving window size 300 dataset, with Logistic Regression improving from 99.13% to 99.66%, GBC improving from 99.18% to 99.50%, and Random Forest improving from 99.07% to 99.36%. Compared to this, KNN and Decision Trees outperformed Naïve Bayes, which has a recall score of 90.25%, with recall scores of 98.76% and 97.38%, respectively. In addition, these classifiers outscored previous recall scores, with KNN improving from 97.98% to 98.76%, Decision Tree improving from 97.21% to 97.38%, and Naïve Bayes improving from 87.07% to 90.25%.
The same pattern of improving recall score with moving window size was observed when considering moving window sizes of 500 and 600. At the moving window size of 500, Logistic Regression outperformed other classifiers, achieving a recall score of 99.90%, which was also higher than the prior recall score of 99.66% obtained on the dataset corresponding to a moving window size of 400. On the other hand, Naïve Bayes, which had the lowest recall score of 92.37% on the moving window size 500-associated dataset, improved its recall score from 90.25% on the moving window size 400-corresponding dataset. On the other hand, SVM outperformed all other classifiers and achieved a 100% recall score on the dataset corresponding to a moving window size of 600, whereas SVM previously achieved a recall score of 99.80% on the dataset corresponding to moving window size 500, which improved in that case. However, with a recall score of 94.66%, Naïve Bayes performed poorly compared to other classifiers on the dataset regarding a moving window size of 600. However, Naïve Bayes also improved its recall score from 92.37% on the dataset corresponding to a moving window size of 500, similar to other classifiers.
On the other hand, some classifiers showed a slight decrease in recall scores on the moving window sizes of 700 and 800, while others showed improvements. On the corresponding datasets with moving window sizes of 700 and 800, the Logistic Regression classifier achieved 100% recall scores. Compared to the prior recall score of 99.94% on the dataset corresponding to moving window size 600, this recall score was improved, whereas the SVM’s recall score dropped to 99.96% on moving window sizes of 700 and 800 from 100% on the dataset corresponding to a moving window size of 600. Similarly, the GBC recall score decreased from 99.70% on moving window sizes of 600 to 99.68% and then 99.48% on moving window sizes of 700 and 800, respectively. Similarly, the Random Forest recall score dropped from 99.76% on moving window sizes of 600 to 99.75% then 99.64% on moving window sizes of 700 and 800, respectively. A pattern of increasing recall scores with increasing moving window size has been shown by KNN and Naïve Bayes.
5. Discussion
Signals obtained from gearboxes in real-world settings can often be polluted by operations and noise outside from surrounding machinery. The statistical features extracted from moving windows could provide robustness to high-frequency noise by averaging temporal variations. Another important feature of industrial environments is load variation, because changes in load can have an effect on vibration amplitude and frequency.
This study focuses on time-domain statistical features extracted using window lengths corresponding to sub-rotational, single-rotational, and multi-cycle rotational intervals, despite the fact that frequency-domain and time-frequency domain features are frequently used in vibration-based defect detection. In particular, the window sizes of 400 samples (one full rotation), 500–800 samples (covering prolonged rotational intervals, including up to two full cycles), and 300 samples (a little less than one full rotation) are taken under consideration. Prior comparative analyses have shown that across such window sizes, time-domain statistical features perform similarly to or better than frequency-domain and fused feature representations for several machine learning classifiers. In addition, frequency-domain features do not always improve performance and, in some cases, they result in reduced classification performance. As a result, time-domain features were chosen in this study to ensure methodological clarity, computing efficiency, and reliable fault discrimination.
In this research, the dataset was initially processed for feature extraction, which involved extracting statistical features from the dataset using a moving window-based approach. The moving window-based approach uses a fixed-size window to traverse the signal at each step of the analysis [
59]. The dataset employed for this study was collected under a variety of load conditions ranging from 0% to 100%, with a step size of 10%, which corresponds to realistic operational scenarios commonly observed in industrial systems. The size of the moving window used for feature extraction was determined by the gearbox’s physical properties and the shaft’s rotational frequency. In an average gearbox system, the shaft rotational frequency ranges between 10 and 50 Hz, corresponding to rotational periods of 20 and 100 ms. To ensure that the extracted features could capture both partial and complete rotational information, multiple window sizes were attempted. Window sizes of 300, 400, 500, 600, 700, and 800 were evaluated, with time durations ranging from about 15 to 40 ms. A window size of 300 could indicate a duration shorter than one complete shaft rotation, but it was deliberately included to look into the behavior of the feature extraction method during partial cycles. This allows for a sensitivity analysis of the proposed framework to sub-cycle information, which could provide information regarding the effects of incomplete rotational coverage on feature stability and fault discriminability. A window size of 400 samples and above could span at least one fundamental rotating phase. A window size of 400 samples could include at least one rotational cycle, whereas a window size of 800 could capture two rotational cycles, which enhances feature robustness at the cost of temporal resolution; however, a window size of 400 to 800 could capture a rotational cycle ranging from single to double. A balanced trade-off between rotational completeness and temporal resolution can be achieved by evaluating window lengths of one, two, and three cycles.
This manuscript employs several ML classifiers evaluated using five-fold cross validation to systematically analyze the effect of the window size used for feature extraction on classifier performance. The results, which are presented in terms of accuracy, precision, and recall, show that the temporal length of the input window significantly impacts classifier performance. When the moving window size was increased from 300 samples (equivalent to 15 ms) to 600 samples (equivalent to 30 ms), performance for a significant number of classifiers progressively improved. This pattern was especially prominent for linear and probabilistic models such as Logistic Regression and Naïve Bayes. Due to incomplete rotational information, Naïve Bayes achieved an accuracy, precision, and recall of about 87% at a 300 moving window, demonstrating poor class separability. As the window size increased to 600–800 samples, performance improved significantly, reaching 97.7%, indicating that larger windows generate more stable statistical feature distributions, which are required for probabilistic classifiers that assume conditional independence. This behavior demonstrates that sub-cycle windows fail to capture enough rotational dynamics for accurate probability calculation.
With accuracies over 97% at 300 samples, tree-based models such as Decision Trees, Random Forests, and GBC performed well even at smaller window sizes. Their ability to accommodate noisy or partially informative inputs and model non-linear interactions between features is responsible for their stability. However, as the window size increased, their performance continued to improve, reaching its highest point between moving window sizes 600 and 700. The marginal gains in performance saturated or slightly declined beyond this range, especially for GBC, indicating that excessively large windows could create duplicate information and reduce sensitivity to localized fault signals. With 100% accuracy, precision, and recall at a window size of 600, SVM consistently obtained the best performance over all window sizes. When the window size corresponded with at least one full shaft rotation, the feature space became significantly separable, showing the robust margin maximization capability of SVM. Beyond the a moving window size of 600, performance stabilization suggests that this temporal range already contains the discriminative data required for the most accurate classification. With increasing window size, the KNN classifier showed an average improvement in performance, achieving almost 98% accuracy at a moving window size of 300 and approximately 99.8% at a moving window size of 800. Larger window sizes can generate more smooth and consistent feature representations, thus improving neighborhood consistency. This pattern shows KNN’s sensitivity to feature-space geometry. However, the relatively small improvements after a moving window size of 600 indicate a decline as a result of more redundant features.
Logistic Regression performs efficiently with all window sizes, surpassing 99% accuracy from window size 400 onwards and obtaining perfect classification (100%) at window sizes of 700 and 800. This indicates that when enough rotational information is captured, the features obtained are highly linearly separable. The rapid increase in Logistic Regression’s performance also shows how well the feature extraction method captures the features of gearbox faults. Although it is less than one full shaft rotation, the moving window size of 300 provides useful information about classifier sensitivity in a partial-cycle configuration. Several classifiers still achieve useful accuracy, indicating the presence of relevant transient patterns, even if performance frequently drops at this window size, particularly for Naïve Bayes. This sensitivity analysis shows that while smaller cycle windows could include discriminative information, optimal and stable performance happens when the window size is at least one full rotating cycle. The reported patterns of performance have been shown to be reliable and independent of a specific train–test split due to the use of five-fold cross-validation. The evaluation minimizes overfitting and provides an accurate estimation of generalization performance for each window size by averaging the results over several folds. The credibility of the experimental results is further confirmed by the continued improvement patterns observed across classifiers.
In general, window sizes between 400 and 600 samples represent an ideal balance between feature robustness and temporal resolution, approximating the gearbox’s core rotational dynamics while maintaining sensitivity to fault-related features. These results show that window size, feature stability, and model features act together to significantly affect classifier performance and demonstrate the significance of physically informed window selection.
The purpose of this research was to evaluate a general and computationally lightweight statistical feature-based framework. This framework can be used without requiring extensive knowledge of gearbox design, rotational speed, or accurate synchronization information. That is why Time-Synchronous Averaging (TSA) and Envelope Analysis, which are effective tools for isolating periodic fault signals and noise suppression, are not considered. This research focused on simplicity, interpretability, and ease of deployment, which are important considerations in a number of industrial situations where access to exact shaft speed measurement or consistent operating conditions could be limited.
6. Conclusions and Future Work
In this research, the performance of several traditional ML classifiers was evaluated for gearbox fault detection using vibration signals. The vibration signals were processed using moving window sizes of 300, 400, 500, 600, 700, and 800 to generate features that were then used to train the ML classifiers. In conclusion, SVM, Logistic Regression, and GBC could outperform other classifiers for predicting gearbox defects using vibration signals. The results showed that increasing the moving window size used to extract features from the gearbox vibration signal could improve the ML classifier’s performance.
The proposed framework was designed with practical industrial deployment in mind. Signals obtained from gearboxes in real-world settings can often be polluted by operations and noise outside from surrounding machinery. The statistical features extracted from moving windows could provide robustness to high-frequency noise by averaging temporal variations.
Another important feature of industrial environments is load variation, because changes in load can have an effect on vibration amplitude and frequency. The dataset employed for this study was collected under a variety of load conditions ranging from 0% to 100%, with a step size of 10%, which corresponds to realistic operational scenarios commonly observed in industrial systems.
In the future, the moving window segmentation method will be employed to generate data segments for training unsupervised or one-class learning models for fault detection. During training, only healthy samples will be provided that will allow the model to gather information on the gearbox’s normal working patterns. Based on deviations from the learned healthy behavior, the trained model will then identify defective or unusual samples. This method will allow for early identification of defects without having to look for labeled faulty data, and it could potentially be extended to improve detection robustness and sensitivity using traditional ML or some lightweight deep learning-based approaches.