A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance

Alnahhal, Mohammed; Tabash, Mosab I.; Safi, Samir K.; Al-Absy, Mujeeb Saif Mohsen; Mamadiyarov, Zokir

doi:10.3390/computation14040088

Open AccessArticle

A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance

by

Mohammed Alnahhal

¹

,

Mosab I. Tabash

^2,*

,

Samir K. Safi

^3,*

,

Mujeeb Saif Mohsen Al-Absy

⁴

and

Zokir Mamadiyarov

^5,6

¹

Mechanical Engineering Department, American University of Ras Al Khaimah, Ras Al Khaimah P.O. Box 10021, United Arab Emirates

²

Department of Business Administration, College of Business, Al Ain University, Al Ain P.O. Box 64141, United Arab Emirates

³

Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates

⁴

Accounting and Financial Science Department, College of Administrative and Financial Science, Gulf University, Sanad 26489, Bahrain

⁵

Department of Economics, Mamun University, Khiva P.O. Box 220900, Uzbekistan

⁶

Department of Finance and Tourism, Termez University of Economics and Service, Termez P.O. Box 190100, Uzbekistan

^*

Authors to whom correspondence should be addressed.

Computation 2026, 14(4), 88; https://doi.org/10.3390/computation14040088

Submission received: 7 March 2026 / Revised: 31 March 2026 / Accepted: 4 April 2026 / Published: 7 April 2026

(This article belongs to the Section Computational Engineering)

Download

Browse Figures

Versions Notes

Abstract

Predictive maintenance plays a key role in digitalization initiatives; however, in real settings, issues related to failure prediction occur when failure instances are rare compared to normal instances, leading to class imbalance. In this study, we systematically compare five machine learning (ML) models—random forest, XGBoost, support vector machine, k-nearest neighbors, and multinomial logistic regression (MLR)—to detect multiclass rare failures using four imbalance-handling approaches (i.e., no handling, manual oversampling, selective manual oversampling, and class weighting), forming 20 configurations. Using the AI4I 2020 predictive maintenance dataset, which contains five failure types, we determined that XGBoost with no handling achieved the highest macro-averaged F1 (macro-F1) score (0.842) but obtained 0% recall for tool wear failure (TWF). MLR with selective manual oversampling achieved approximately 50% TWF recall with lower overall performance (0.636 macro-F1) than top-performing models such as XGBoost. We also found that very rare classes remain difficult to detect. Even high-performing models fail to consistently detect all five failure types. Overall, no single strategy can achieve a high detection rate across all performance measures.

Keywords:

predictive maintenance; class imbalance; oversampling strategies; multiclass classification; rare failure detection

1. Introduction

Industry 4.0 initiatives lead to the generation of big data that can be utilized to predict machine failures, thereby helping to reduce maintenance costs and machine downtime [1]. Although many machine learning (ML) approaches exist for predictive maintenance (PdM), selecting the best method to predict rare failures still requires further investigation [2]. This is because models biased toward predicting all instances as normal may appear efficient but can result in rare, costly failures remaining undetected [3].

Many approaches exist to address class imbalance (i.e., rare instances being outnumbered by numerous normal instances), such as resampling and synthetic oversampling that introduce variations rather than simple duplication of rare data points. One widely used synthetic oversampling method is the Synthetic Minority Oversampling Technique (SMOTE) [4]. The majority of previous research has used a binary assumption (normal versus failure); however, it is sometimes necessary to determine the failure type to facilitate appropriate intervention in advance. In other words, multiclass PdM is needed, but such settings involve extremely rare failure classes that are difficult to predict.

Many studies have utilized different types of imbalance handling to address rare failure instances; however, they do so without a systematic method of detecting certain types of failures or an ability to evaluate their effect on different performance measures across various ML approaches. The present study seeks to close this knowledge gap.

In this study, we employ five ML approaches with and without imbalance-handling preprocessing, namely random forest (RF), XGBoost, support vector machine (SVM), k-nearest neighbors (k-NN) algorithm, and multinomial logistic regression (MLR). The main objective is to examine the effect of the imbalance-handling methods of these approaches on their ability to detect various types of failures, such as tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), and random failures (RNF). The main focus of this study is to analyze the trade-offs between overall performance and rare failure detection. Through this analysis, we also provide practical guidance for selecting imbalance-handling strategies.

The key contributions of this work are as follows. While existing studies often aggregate failure types or focus on binary classification, our work provides a systematic evaluation of five ML models and four imbalance-handling strategies in a multiclass setting with five distinct failure types. Key insights are learned through this systematic and practical comparison. We identify clear trade-offs between overall performance and detection of specific failures, and we highlight practical detection limits. We show that even aggressive oversampling fails to identify the rarest failure classes. This offers insights for realistic deployment decisions.

A comparison of different ML approaches for PdM based on various imbalance-handling strategies.
An analysis of tradeoffs between overall accuracy and effective failure detection in imbalanced PdM tasks.
Guidance for industry practitioners in terms of selecting imbalance-handling methods for the construction of reliable PdM systems.

Recent studies have applied a range of ML models to publicly available PdM datasets, such as the AI4I 2020 dataset, providing a synthetic benchmark reflecting real-world milling processes with imbalanced failure classes. In the present study, the AI4I 2020 dataset [5,6] is employed to achieve the above contributions. A particularly challenging aspect of this dataset is that although, in theory, it contains seven categories, two critical failure types—TWF and RNF—are sufficiently rare that most models fail to detect them, despite being operationally notable failures. It is important to note that the dataset used can affect the results obtained. For example, Autran et al. [7] critically reviewed public PdM datasets and found that most lack real-world complexities. Consequently, they introduced the AI4I-PMDI dataset—an enhanced version of AI4I 2020—to incorporate some realistic irregularities (such as missing data, fleet context, and irregular timestamps). However, in this study, we use the original dataset because our research focuses specifically on evaluating class imbalance-handling strategies.

2. Literature Review

This section focuses primarily on three areas: (1) PdM in the era of Industry 4.0, (2) the application of ML approaches in PdM, and (3) the systematic investigation of different imbalance-handling methods and their influence on the performance of ML approaches, as covered or overlooked by previous studies. This review also presents the specific contributions of this study.

2.1. Predictive Maintenance in the Industry 4.0 Era

Among the positive effects of Industry 4.0 is the generation of large amounts of data, which can be used effectively to predict failures prior to their occurrence using effective ML approaches [1]. Consequently, modern industrial environments have shifted from preventive maintenance toward PdM, leading to better machine utilization, higher quality, improved system reliability, and optimized maintenance schedules [8]. Some studies have conducted literature reviews on different PdM practices for predicting rare but serious failures [9]. For example, Hassan et al. [10] found that AI-based PdM requires data labeling, which is often difficult in practice because failures are rare; therefore, models with high overall accuracy can often fail to detect serious failures.

2.2. Evolution of ML Models in PdM Research

Many previous studies have focused on the development and comparison of various AI models for failure prediction. Ensemble methods have been widely adopted due to their robustness and accuracy [11]. These methods, including bagging (e.g., RF usage) and boosting (e.g., XGBoost usage), combine multiple models to enhance robustness and performance; these approaches are useful in the case of class imbalance. More recently, gradient-boosting machines, such as XGBoost, have demonstrated excellent performance in various classification tasks, including PdM [12]. For example, Farooq et al. [13] compared XGBoost, an RF, an SVM, and long short-term memory (LSTM) for ball bearing fault detection using vibration data, demonstrating that XGBoost offers the best tradeoff between accuracy and computational efficiency. Alternative approaches, such as the use of an SVM and a k-NN algorithm, have also been successfully applied, particularly in scenarios with clear class separation or for which their simplicity and interpretability are advantageous [14,15].

Recent comparative studies have benchmarked these algorithms using public datasets. Hosseinzadeh et al. [16] evaluated ML, DL, and DHL models using the AI4I 2020 dataset, finding that tree-based ensemble methods such as LightGBM achieved the highest accuracy (greater than 90%). Using the same dataset, Assagaf et al. [17] found that an accuracy of approximately 90% can be achieved when using a tuned SVM. In a subsequent study [18], they found that an accuracy of 96% can be achieved through a multilayer perceptron (MLP). Extending beyond standard classification, Yürek and Birant [19] introduced an ordinal classification approach (OPMEB) demonstrating superior performance over nominal classification on several datasets, including an ordinal-transformed version of AI4I 2020.

In recent studies, ensemble learning has been applied to improve PdM performance. For example, Çiftpinar et al. [20] compared bagging and majority voting ensembles using logistic regression, decision trees, SVMs, and k-NN algorithms on the AI4I 2020 dataset, showing that ensemble strategies are superior to both individual models and RFs; however, they did not compare imbalance-handling methods or their effects on detecting certain rare failures. This gap is addressed in our study.

2.3. Pervasive Challenge of Class Imbalance

A key challenge involves how severe imbalances can be handled; such instances lead PdM approaches toward the majority class, which is typically the class representing normal working conditions. This can lead to failures in detecting rare but costly failures [21]. Despite recent progress in PdM, this class imbalance continues to present a challenge for modelers [2].

Three categories of methods have been proposed to handle the class imbalance problem: data-level, algorithm-level, and hybrid-level methods. Data-level methods are represented by techniques such as oversampling, where rare instances are simply repeated to balance data instances. SMOTE and its variants (e.g., Borderline-SMOTE and ADASYN) represent the most important synthetic oversampling methods, in which some data mutation is performed. At the algorithm level, ML training assigns more weight (cost) to minority class instances.

Recent studies have applied class imbalance approaches—such as manual oversampling—to the AI4I 2020 dataset. For example, Stow [3] employed manual oversampling before training a Conv-LSTM model. Ucar et al. [22] investigated class imbalance in a general context, comparing data- and algorithm-level methods. Although manual oversampling does not introduce any synthetic instances, it remains a practical baseline in many studies because it is transparent, easy to interpret, and can be used in many practical fields of PdM [21]. However, these studies considered class imbalance handling as a preliminary step for ML, and did not include it in a systematic comparative investigation to detect certain rare failures or assess the influence of different imbalance approaches on different models.

Although oversampling methods such as SMOTE and its variants are widely adopted in recent PdM studies [4,23], they are usually tested using aggregate performance measures without assessing their impact on detecting certain rare failures. This is because SMOTE is primarily used for binary class problems; for example, studies such as Bektasoglu et al. [24] and Ghasemkhani et al. [25] used SMOTE with the AI4I 2020 dataset to enhance general classification performance, but not to improve the detection of specific rare failures, such as TWF or RNF. Furthermore, Ghasemkhani et al. [25] proposed a Balanced Hoeffding Tree Forest combining SMOTE with a novel undersampling technique. However, these studies focus on algorithms rather than the development of deployable strategies, especially for multiclass problems in which rare failure types must be distinguished. Moreover, the interaction between models and imbalance-handling strategy still requires further exploration. This study addresses these gaps by evaluating four class imbalance strategies across five ML approaches, with a focus on rare failure detection and the tradeoffs between general performance metrics and the recall of specific rare failures.

In PdM, false negatives (missed failures) typically incur far higher costs than false positives. Algorithm-level approaches, such as class weighting, address this asymmetry by emphasizing minority failure classes during training [26]. In addition to modeling strategies, the choice of evaluation metrics is important in PdM. Many studies rely on accuracy or aggregate measures, which do not focus on the detection of rare failures. Consequently, balanced and failure-specific metrics are recommended, particularly in multiclass settings [27].

2.4. Recent Advances in RUL Prediction

While our study focuses on multiclass failure type detection, recent advances in remaining useful life (RUL) prediction have introduced sophisticated deep learning methods that are relevant to the broader PdM landscape. For example, methods based on interpretable serialized variational autoencoders have been proposed to model uncertainty in RUL prediction using generative neural networks [28]. Similarly, Zhang et al. [29] developed a graph convolutional neural network-based method for RUL prediction, which includes a two-stage process to maintain data privacy without requiring access to source data. In addition, Xu et al. [30] addressed the issue of varying working conditions by using subdomain adaptation, dynamically adjusting subdomain boundaries, assigning higher weights to important features, and clustering similar features during training. In contrast, the present study systematically evaluates classical ML models and four imbalance handling strategies on a tabular dataset with multiple rare failure classes, providing insights for real world industrial settings where simple and efficient solutions are often preferred.

2.5. Research Gaps and Contributions

Recent research has improved PdM through different approaches: using ensemble learning, which combines multiple models to increase effectiveness [20], and applying metaheuristic optimization to select the most important features [31]. For example, the optimal number of trees in an RF can be determined to enhance performance. However, previous studies have overlooked a key problem: how severe class imbalance and the way it affects the detection of different failure types may be systematically addressed.

Many studies have adopted binary classification frameworks to address this class imbalance. Studies such as Çiftpinar et al. [20] and Khedr et al. [31] aggregate all failure types (TWF, HDF, PWF, OSF, and RNF) into a single failure class. Bektasoglu et al. [24] employed a binary pairwise classification approach to analyze failure modes such as TWF-HDF and PWF-OSF in isolation. This reduces the problem to a series of binary decisions, instead of simultaneous multifailure discrimination. This binary simplification has three implications:

The combined failure class contains more samples than individual failure types.
Models can achieve high accuracy by simply predicting the most common failures while neglecting the rarest failures.
Different failure types require different interventions. Binary classification neglects these distinctions; this is a knowledge gap on which we focus.

Consequently, we employ a multiclass setup to predict different failure types. Furthermore, many studies only briefly mention the class imbalance problem without comparing different ways to fix it. To close this gap, we conducted a detailed investigation into class imbalance methods for PdM. Using the AI4I 2020 dataset, we compare five ML models with four ways of handling imbalance: no handling, manual oversampling, selective manual oversampling, and class weighting.

The contributions of this study are threefold:

We provide a comparison of imbalance-handling strategies across a diverse set of ML models for a multiclass PdM task.
We evaluate the tradeoff between high overall performance metrics (e.g., macro-averaged F1 (macro-F1)) and the ability to detect specific, rare failure types.
We examine factors such as computational feasibility and consistency across training runs.

3. Methodology

This section outlines the framework used to evaluate ML models for PdM under class imbalances by describing the dataset, preprocessing steps, imbalance-handling strategies, model selection, and evaluation metrics. Figure 1 presents a flowchart showing the methodological steps, which are explained in the following subsections. The framework consists of seven steps. First, the AI4I 2020 dataset is preprocessed by removing irrelevant data and creating a multiclass target variable with seven categories: NoFailure, TWF, HDF, PWF, OSF, RNF, and Multiple. Second, the dataset is split into training (80%) and test (20%) sets to preserve the original class distribution. Third, four imbalance-handling strategies are applied to the training data: no handling (baseline), manual oversampling of all minority classes, selective oversampling of only the rarest classes, and class weighting as an algorithm-level adjustment. Fourth, five machine learning models—RF, XGB, SVM, k-NN, and MLR—are trained on the processed data. Fifth, all trained models are evaluated on the same test set using multiple metrics, including macro-F1 for overall performance and class-specific recall for failure detection. Sixth, each configuration is repeated three times with different random seeds. This results in a total of 60 experimental runs across 20 configurations (5 models × 4 imbalance strategies). Finally, the results are compared to identify trade-offs between overall performance and rare failure detection, practical detection limits, and recommendations for deployment.

3.1. Dataset and Preprocessing

This study utilizes the publicly available AI4I 2020 dataset, which is a synthetic dataset mimicking real-world conditions and including data relating to five types of machine failures in milling processes. The dataset contains 10,000 rows and 14 features [5,6], including parameters such as air temperature, process temperature, rotational speed, and torque. The five failure modes are as follows: TWF, HDF, PWF, OSF, and RNF. By using a publicly available dataset, reproducibility and comparisons with existing literature are assured. A new column was created to represent a multiclass target variable with seven categories: “NoFailure,” “Multiple” (for concurrent failures), plus the five individual failure types. Irrelevant columns were removed. The dataset was split into a training set (80%) and a test set (20%). We kept the original severe class imbalance in the test sets.

3.2. Imbalance-Handling Strategies

To evaluate the impact of class imbalance handling, we applied four practical strategies to the training data:

No Handling: Models were trained directly on the original data.
Manual Oversampling: The minority classes were repeated to approximately three times the size of the largest minority class. Each minority class contained approximately 255–300 samples. The majority class (NoFailure) was kept unchanged at 7722 samples. Consequently, the total includes approximately 9000 training samples.
Selective Manual Oversampling (Targeted Balancing): Only the two rarest failure classes—TWF and RNF—were oversampled. Specifically, the TWF class increased from 34 samples to 340 samples through a 10× expansion (i.e., nine duplicates per original instance). Similarly, the RNF class increased from 15 training samples to 300 samples, corresponding to a 20× increase achieved through 19 duplicates per original sample. Therefore, the total number of duplicated samples added was 591 (306 + 285) per experimental run.
Class Weighting (Algorithmic Adjustment): Class weights were calculated as follows: class weights = total samples/(number of classes ∗ class counts). Weights were bounded between 0.3× and 3× to prevent extreme values capable of destabilizing the training.
These four strategies represent different theoretical approaches to handling class imbalance. Manual oversampling and selective oversampling modify the training data by duplicating minority class instances. These approaches are easy to implement in practice. Selective oversampling targets only the rarest classes to focus on the most critical failures. Class weighting is an algorithm-level method that does not change the training data and therefore has no computational overhead. Therefore, it is a practical alternative when data duplication is undesirable.

3.3. ML Models

Five ML models were selected: RF, XGBoost, SVM, k-NN, and MLR. While deep learning approaches have gained popularity, we selected conventional ML models for several reasons. First, our focus is on comparing imbalance-handling strategies and understanding error transfer mechanisms, which is easier to understand and analyze. Second, the AI4I 2020 dataset is tabular with 10,000 samples, where tree-based models often outperform deep learning [32]. Third, conventional ML models offer greater interpretability and computational efficiency. More details about the models are as follows:

RF: It is an ensemble of decision trees that provides stable performance and is easy to understand.
XGBoost (XGB): It is a gradient boosting model that builds models sequentially, with each new model correcting the errors of its predecessors. It is known for its rapid and accurate performance.
SVM: It is a classifier that performs well even on complex data.
k-NN algorithm: It is a simple classifier that can decide the class of an object based on the classes of its closest neighbors in the data.
MLR: It is a linear probabilistic model used for multiclass classification problems, where the target variable has more than two categories.

We tested these ML models using four different strategies to handle class imbalance. This resulted in 20 possible configurations (5 × 4). To ensure the reliability of our results, we executed each configuration thrice with different random seeds (1, 2, and 3), yielding a total of 60 experimental runs.

3.4. Implementation Details

All experiments were conducted using R version 4.4.3 with the following packages: caret (6.0-94), XGBoost (1.7.5.1), randomForest (4.7-1.1), e1071 (1.7-13), and nnet (7.3-19). A ThinkPad computer (manufactured by Lenovo, Beijing, China) with an Intel Core i7-8565U processor with 8 GB RAM running Windows 10 was used.

Hyperparameters were set to default values from the respective R packages to establish a consistent baseline. No extensive tuning was performed, as the focus of this study is on comparing imbalance-handling strategies rather than optimizing individual model performance. The following hyperparameter settings were used:

RF: 100 trees, minimum terminal node size = 20;
XGBoost: 50 boosting rounds, multiclass softmax objective, max depth = 6, eta = 0.3;
k-NN algorithm: k = 5, Euclidean distance with feature scaling;
SVM: radial basis function kernel trained on original or subsampled data; and
MLR: default parameters with multinomial logit, maximum iterations = 2000.

To ensure that the results were reliable, we repeated each experiment three times using different random seeds. The performance scores are reported as the averages across these three runs.

3.5. Performance Metrics

In case of class imbalance, traditional accuracy can be misleading because, for example, a model can achieve 96% accuracy by consistently predicting the majority class. Therefore, a comprehensive set of metrics was employed as follows:

Primary metrics for overall performance:

Macro-F1 Score: the harmonic mean of precision and recall across all classes, that accounts for both false positives (low precision) and false negatives (low recall).
Geometric Mean (G-mean): in the multiclass setting, computed as the geometric mean of per-class recall values, which generalizes the binary √(True Positive Rate × True Negative Rate) formulation. It can be calculated using Equation (1):

G - m e a n = e x p (\frac{1}{C} \sum_{c = 1}^{C} l o g ({R e c a l l}_{c}))

(1)

where C denotes the total number of classes, c indexes the class (c = 1, …, C), and Recall_c represents the recall of class c, defined as

{R e c a l l}_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F N}_{c}}

where TP_c and FN_c denote the number of true positives and false negatives for class c, respectively. This formulation penalizes the poor detection of any failure mode by decreasing the overall score.

Secondary metrics for specific insights:

Class-specific recall: evaluation of detection of extremely rare failures, such as TWF and RNF.
Balanced accuracy: mean of per-class recall rates, weighting all classes equally.
Kappa coefficient: showing how the model represents an improvement of random guessing.

These metrics were selected for specific reasons. Macro-F1 is chosen because it avoids the bias of accuracy toward the majority class. G-mean is included because it is suitable for imbalanced multiclass problems, where failing to detect one rare class should significantly reduce the score. Class-specific recall is essential for evaluating detection of the extremely rare TWF and RNF. Balanced accuracy is used because it gives equal weight to each class regardless of sample size. Finally, Kappa is reported to provide a baseline-adjusted view of performance.

Given the practical nature of our imbalance-handling strategies, we also introduced validation checks. These included verifying the training completion rate to ensure that models were successfully trained on processed data, and assessing the number of distinct failure types predicted by each model, which ranged from two to seven out of the maximum seven possible types.

4. Results and Analysis

The experimental framework was executed across three random seeds, resulting in the successful completion of 57 out of 60 runs (a 95% success rate). The total of 60 runs came from the combination of 5 ML models, 4 imbalance-handling methods, and 3 seeds. The three runs that failed involved an SVM using the manual oversampling method. In such cases, the dataset exceeded computational (memory and runtime) limits. Performance metrics were averaged across the three successful runs for each configuration.

4.1. Overall Performance: Macro-F1 Perspective

Table 1 shows the overall performance of the different configurations. It shows the trade-off between overall performance and rare failure detection. As outlined above, the SVM with manual oversampling was not successful.

XGBoost exhibited the best performance, demonstrating the highest macro-F1 score (0.842) when no imbalance handling was used or when class weighting was applied. However, in both tests, XGBoost failed to detect TWF. This represents a clear tradeoff: when oversampling methods were used to enhance failure detection, the overall macro-F1 of XGBoost dropped (to 0.762 with full oversampling and 0.714 with selective oversampling), while its ability to detect rare failures exhibited a slight improvement (16.7% TWF recall at best). This shows that improving sensitivity to rare failures usually decreases the overall performance.

MLR revealed the clearest performance shifts with different methods. While maintaining a strong performance with no handling (0.729 macro-F1), MLR achieved its best TWF detection with selective oversampling (50.0% recall and 0.636 macro-F1) and class weighting (45.8% recall and 0.614 macro-F1). Figure 2 (presenting the top configurations for TWF detection) shows that an inverse relation exists between overall performance (macro-F1) and critical failure detection. The top performing model for macro-F1, as shown in Table 1 (XGBoost with class weighting: 0.842), achieves 0% TWF recall, whereas the best TWF detector achieves a 24.6% lower macro-F1 (0.636).

Random failures (15 training samples) are challenging to detect. None of the tested configurations achieved greater than 22.2% RNF recall (k-NN algorithm with manual oversampling), and most methods detected 0% RNF instances. This highlights the practical detection limit of ML for classes with fewer than 20 samples, even when aggressive selective oversampling (20× duplication) was applied.

4.2. Effects of Model Architecture on Performance

Table 2 shows the performance tradeoff according to model architecture. Different models responded distinctively to imbalance handling. For example, MLR is the most responsive to selective oversampling for TWF detection (up to 50% recall), with favorable trade-off ratio (5.38). In contrast, XGBoost shows only modest improvement in TWF detection with oversampling (max 16.7% recall) and a poor tradeoff ratio (1.31). Similarly, RF demonstrates moderate TWF detection improvement (12.5% recall) with manual oversampling. Finally, the K-NN algorithm shows consistent TWF detection (16.7% recall) and best RNF detection (22.2% recall) but poor overall performance.

4.3. Practical Implications and Deployment Recommendations

Based on the comprehensive performance analysis, we propose clear deployment strategies (Table 3):

4.4. Average Number of Classes Predicted per Configuration

Table 4 shows the number of classes predicted per configuration. The k-NN algorithm with manual oversampling is the only configuration capable of predicting all seven classes in every seed. In contrast, the SVM with manual oversampling showed a complete failure (i.e., skipped in all seeds). The SVM with no handling predicts only three classes on average (i.e., severe underprediction). Class weighting generally helps models to predict more classes compared to no handling. For example, for the RF, it increases from 4.33 to 6. Similarly, selective oversampling improves class coverage for some models (e.g., from 3 to 4.33 for the SVM).

4.5. Overall Performance Across Multiple Metrics

Table 5 shows performance patterns across failure types and imbalance-handling strategies. MLR with class weighting achieved the highest overall failure recall (0.624), exhibiting a consistent performance across HDF, PWF, OSF, and multiple failures. In contrast, XGBoost excelled at HDF and PWF detection but failed on TWF and RNF regardless of the method applied. The k-NN algorithm detected RNF (22.2% recall) with manual oversampling but performed poorly on other failure types. No single model performed best for all failure types, highlighting a serious challenge facing the real-world application of PdM: it is difficult to balance reliable general monitoring with the ability to detect rare failures.

Table 6 shows different performance metrics for different configurations (Accuracy, Macro-F1, Balanced Accuracy, G-Mean, and Kappa). The difference between “Average” in Table 5 and “Balanced Accuracy” in Table 6 is that the first is the average across only failure types (excluding the majority “No Failure” class), whereas the second includes all classes (including “No Failure”). For this reason, the balanced accuracy is higher than the simple average; consequently, it is useful to consider both. The Kappa coefficient was also used to understand to what degree the models outperformed random guessing. Although XGBoost scored highest in both accuracy and macro-F1, it failed to detect TWF. In contrast, balanced accuracy, G-Mean, and Kappa revealed that MLR with class weighting and the SVM with class weighting provided a more balanced performance across classes. The SVM with class weighting exhibited the highest G-Mean, i.e., it provided the optimal balance of detection between the common (normal) and rare (failure) classes. XGBoost with manual oversampling exhibited the highest Kappa score (0.770). In contrast, all k-NN configurations featured lower values, with Kappa values ranging from 0.301 to 0.408. These findings show that it is important to select evaluation metrics aligned with operational priorities: accuracy and macro-F1 for general monitoring versus balanced accuracy, G-Mean, and Kappa for ensuring detection across all failure types.

4.6. Error Transfer Analysis

To understand why models with high aggregate performance fail to detect rare failures, we analyzed confusion matrices for key configurations. Error transfer is the destination class of misclassified instances. For example, if a TWF sample is predicted as NoFailure, we say the error is “transferred” to the NoFailure class. We selected four representative configurations from the 20 evaluated. These configurations were chosen to capture:

Baseline behavior: XGBoost with no handling (highest macro-F1, 0.842) and XGBoost with class weighting (algorithm-level handling without data modification).
Best failure detection: MLR with selective oversampling (highest TWF recall, 37.5%).
Alternative model behavior: k-NN with manual oversampling.

Table 7 illustrates the confusion matrix for MLR with selective oversampling. It shows that 3 out of 8 TWF instances are correctly classified (37.5% recall). The remaining 5 TWF errors transfer to NoFailure. However, all 3 RNF instances are misclassified as NoFailure (0% recall). That means that even 20× oversampling fails to enable detection. Moreover, 2 out of 4 multiple instances are correctly classified, while the other 2 are misclassified as OSF.

Table 7 shows that PWF and OSF are relatively easy to detect, with recalls of 91% and 92% (10 out of 11 and 11 out of 12 detected), respectively. In contrast, Multiple, TWF, and RNF are more difficult to detect, with recalls of 50%, 38%, and 0%, respectively. Table 8 shows error transfer analysis for key configurations. Both XGBoost with no handling and with class weighting exhibit the same patterns. All the TWF and RNF errors transfer to NoFailure, while Multiple errors transfer to OSF. This explains why these configurations achieve high macro-F1 scores (0.842) and fail to detect rare failures. The model simply predicts the majority class for the rare instances. This demonstrates that high aggregate metrics can mask failure to detect critical failures. Moreover, the configuration shows that after selective oversampling, only 62.5% go to NoFailure, while the remaining 37.5% are correctly detected. However, this improvement (62.5%) comes at the cost of redistributing errors rather than eliminating them, where the macro-F1 score decreases to 0.636. The k-NN algorithm exhibits that TWF errors transfer to both NoFailure (62.5%) and HDF (25%), indicating different decision boundary dynamics compared to tree-based models.

RNF remained undetected across all configurations, even with 20× oversampling. This suggests a practical detection limit. Classes with fewer than 20 original samples may require different approaches rather than simple oversampling.

5. Discussion

The results obtained provide evidence (within the scope of this study) that class imbalance handling is necessary for the development of effective PdM systems. Models trained without balancing achieve high accuracy but have a lower ability to detect failure.

5.1. Tradeoff Between Overall Performance and Failure Detection

A clear tradeoff exists between overall performance and critical failure detection. XGBoost with class weighting gives the best overall score (0.842 macro-F1); however, it fails to detect TWF. Meanwhile, MLR, with selective oversampling, exhibits a lower overall performance (0.636 macro-F1) but much better TWF detection (50% recall). Across all tested configurations, for every 1% increase in TWF recall, the macro-F1 score drops by an average of approximately 0.470 points. This percentage varies between models, with MLR exhibiting the best balance, i.e., providing the greatest improvement in TWF recall for the smallest decrease in macro-F1 (ratio of 5.38). In contrast, XGBoost was less efficient (ratio of 1.31).

This difference comes from how each model works. XGBoost builds trees one after another and focuses on getting most predictions right by correcting the errors made by previous trees. Because TWF has only 34 samples, the model simply labels all TWF cases as normal. MLR works differently. It is a linear model that calculates probabilities for each class. When we add more TWF samples through oversampling, MLR shifts its decision line to catch some TWF cases. But this shift causes more normal cases to be wrongly labeled as failures. This shows a basic trade-off: helping a model detect rare failures often means it will make more mistakes on normal cases.

5.2. Toward a Decision Framework for PdM System Selection

To move beyond empirical observations and provide a systematic way to think about the trade-off between overall performance and rare failure detection, we propose the following decision framework. The challenge in PdM is balancing overall performance against the detection of rare but critical failures. This involves fundamental considerations of risk, cost, and decision priority. Failing to detect a rare failure such as TWF or RNF can lead to high operational or safety costs, while false alarms waste resources. Figure 3 presents a decision framework defined by two dimensions: overall performance (macro-F1) on the x-axis and priority for rare fault detection on the y-axis. Four quadrants emerge:

Efficiency Focus (Bottom-Right): High overall performance but low priority for rare fault detection. Configurations in this zone (e.g., XGBoost with class weighting) achieve high macro-F1 but may completely miss rare failures. This zone is suitable when the cost of missing a rare failure is low and the decision priority is general system monitoring.
Critical Fault Focus (Top-Left): Low overall performance but high priority for rare fault detection. Configurations in this zone sacrifice overall accuracy to achieve better detection of rare failures. This zone includes MLR with selective oversampling, which achieved 37.5% TWF recall. It is suitable when missing a rare failure has very high operational or safety impact.
Underdetection (Bottom-Left): Low overall performance and low detection priority. Configurations in this zone should be avoided. This zone includes k-NN with manual oversampling.
Balanced Monitoring (Top-Right): High overall performance and high detection priority. No configuration in this study achieved this zone, highlighting the fundamental trade-off in imbalanced multiclass PdM.

This framework guides practitioners to select configurations based on their priorities rather than aggregate metrics alone.

5.3. Selective Versus Full Oversampling

Our results also have implications related to differences between selective and full oversampling. Selective oversampling, which duplicates only the rarest classes (TWF and RNF), was the most efficient strategy because it adds minimal data (591 samples) and enables MLR to detect 50% of TWF. Although full oversampling improves the multiclass balance (e.g., 29.2% TWF recall with MLR), it requires more samples (1249) and causes SVM to fail entirely.

5.4. Multidimensional Analysis of Model Behavior and Metric Tradeoffs

Based on the results shown in Table 3, Table 4 and Table 5, further insights for strategy selection in imbalanced PdM can be obtained.

5.4.1. Class Coverage and Model Confidence

Table 4 shows that several models predicted only a subset of failure types, ignoring the rarest examples. This behavior is common for models trained without imbalance handling, where minority classes are ignored despite their high aggregate performance. The k-NN algorithm with manual oversampling was the only configuration to predict seven classes in every experimental run; however, this did not translate to high recall for those classes (Table 5). This is because duplicated samples created local clusters that influenced predictions for every class, but these clusters were often in the wrong places, leading to incorrect predictions. Predicting a class does not guarantee predicting it correctly. The SVM with no handling predicted only three classes on average, exhibiting severe “class blindness” toward minority failures and tending to ignore underrepresented categories. Class weighting consistently improved class coverage; for example, the coverage obtained by the RF increased from 4.33 to 6 classes, and SVM coverage increased from 3 to 5 classes. This demonstrates that algorithm-level adjustments encourage models to consider all classes. Selective oversampling also boosted coverage for some models (e.g., from 3 to 4.33 classes for the SVM).

Ensuring that a PdM system attempts to recognize all failure types requires either data-level oversampling or algorithm-level weighting. However, class coverage alone is insufficient; it must be paired with recall evaluation to confirm actual detection capability.

5.4.2. Failure-Specific Detection Patterns

Table 5 breaks down recall by individual failure type, exposing which strategies work for which failures. MLR with class weighting achieved the highest average recall across all failure types (0.624), performing well on HDF, PWF, OSF, and multiple failures. XGBoost shows a good performance in detecting HDF and PWF but failed to detect TWF and RNF across imbalance-handling methods, indicating that tree-based boosting models may overlook extremely rare classes regardless of the balancing strategy used. The k-NN algorithm with manual oversampling was the only configuration capable of detecting RNF (22.2% recall); however, it exhibited poor performance in detecting other failure types. This aligns with the instance-based learning approach of the k-NN algorithm, which can capture localized rare patterns when they are artificially reinforced through duplication. No single configuration exhibited the best performance across all failure types. Practitioners must therefore select imbalance-handling approaches according to the failure type of highest concern. For example, if TWF detection is the most important, MLR with selective oversampling is recommended; in contrast, if RNF is the priority, the k-NN algorithm with manual oversampling may be the only feasible option among the tested methods.

5.4.3. Metric Consistency and Operational Alignment

Table 6 presents five performance metrics. Although XGBoost with class weighting achieved the highest accuracy (0.9853) and macro-F1 (0.8421), it registered 0% TWF recall. This shows that it is not good practice to rely only on aggregate metrics in imbalanced settings. Using balanced accuracy, G-Mean, and Kappa provided a more balanced view. The SVM with class weighting has the highest G-Mean (0.8498), indicating a better balance between majority- and minority-class detection. In contrast, MLR with class weighting achieved the highest balanced accuracy (0.6732), indicating stronger overall recall across classes. Meanwhile, XGBoost, with manual oversampling, achieved the highest Kappa (0.770), indicating that its predictions were much better than random guessing. By comparison, k-NN configurations consistently yielded the lowest Kappa values (0.301–0.408), indicating that their predictions were only marginally superior to chance, despite exhibiting reasonable recall for some rare failures.

The choice of evaluation metric should therefore be made carefully. Systems optimized for accuracy or macro-F1 will favor majority-class performance, whereas those optimized for balanced accuracy, G-Mean, or Kappa will better reflect detection capability across all failure types. Organizations must decide whether their PdM system is intended for general health monitoring (prioritizing macro-F1) or critical-failure alerts (prioritizing balanced metrics).

5.4.4. Practical Recommendations for Model Selection

Table 4, Table 5 and Table 6 indicate that selecting a model and imbalance-handling strategy for PdM requires a multidimensional assessment. To guide this process, we recommend first verifying that the model predicts all failure classes of interest. Next, it is important to evaluate recall for vital failures (e.g., TWF and RNF), and to choose the model strategy that meets the minimum detection thresholds. At the same time, overall performance metrics, including accuracy and macro-F1, should remain within acceptable bounds. Finally, balanced metrics, such as G-Mean, balanced accuracy, and Kappa, should be examined to ensure that no hidden weaknesses exist, particularly when the system is expected to detect a wide range of failures.

5.4.5. Implications for Practice

The findings in Section 5.4.1, Section 5.4.2, Section 5.4.3 and Section 5.4.4 have several practical implications. First, class coverage (Table 4) shows that models without imbalance handling often ignore rare failure types. Practitioners should verify that their model predicts all failure classes before evaluating performance. Second, Table 5 shows that no single configuration performs best across all failure types. If TWF detection is critical, MLR with selective oversampling is recommended despite lower overall metrics. If RNF detection is the priority, k-NN with manual oversampling may be the only available good option. Third, Table 6 shows that relying only on accuracy or macro-F1 can mask failure detection. Balanced accuracy, G-Mean, and Kappa provide a more realistic view of model performance in imbalanced settings.

5.5. Theoretical Analysis of Model Behavior

The observed differences in how models respond to imbalance handling come from how each model learns and makes decisions. For example, MLR is a linear model that learns global decision boundaries by maximizing the multinomial likelihood. Selective oversampling reweights the loss function in favor of rare classes such as TWF. This leads to a global adjustment of the decision boundaries toward minority classes. This explains why MLR shows the biggest improvement in TWF detection under oversampling. However, because this adjustment is global, it increases the likelihood of misclassifying majority class instances as minority classes, resulting in higher false positives.

XGBoost builds trees one after another, where each new tree tries to fix the mistakes of the previous ones. The algorithm always picks the split that best separates the data at each step. When a class is very rare (like TWF with only 34 samples), splitting on features that separate that class adds very little benefit compared to splits that separate the majority class. So, the algorithm never learns to recognize these rare failures, which is why it gets 0% recall.

k-NN classifies a new sample by finding its k nearest neighbors in the training set and choosing the most common class among those neighbors. Oversampling makes copies of rare samples, which creates small groups of identical points. These groups can influence nearby predictions. This allows k-NN to occasionally detect RNF because duplicated samples can dominate local neighborhoods. However, this same mechanism leads to over-prediction (false positives) as duplicated samples also influence neighborhoods where they should not.

SVM finds a decision boundary by looking at the most important samples, called support vectors. As the dataset gets larger, the time and memory needed grow very fast. When oversampling made the dataset bigger, the number of support vectors grew so much that the computer ran out of memory and could not finish training. This is why SVM failed with manual oversampling.

These differences come from how each model handles duplicated samples. For linear models like MLR, duplication acts like giving more weight to those samples, which directly shifts the model. For tree-based models like XGBoost, duplication only matters if it changes which splits the tree chooses. For instance-based models like k-NN, duplication creates local clusters that influence nearby predictions. This explains why the same oversampling method affects each model differently. These theoretical distinctions highlight that no single imbalance-handling strategy works uniformly across models.

5.6. Comparison with Existing Literature

Our results agree with previous studies showing that XGBoost exhibits a good overall performance [13]; however, we go further by testing this approach under conditions of severe class imbalance and showing that it can fail to detect specific rare failures. We confirm the importance of handling imbalance, showing that ensemble methods are strong overall [3,16]. Our work also makes a key distinction. Although previous research such as Atere and Kivrak [23] achieved near-perfect recall using SMOTE with an RF on the same AI4I 2020 dataset, they used binary classification by aggregating all failure types into a single minority class for easier detection. Ghasemkhani et al. [25] also addressed concurrent failures through a multilabel framework using hybrid resampling and incremental learning; however, our study is multiclass. It requires a choice to be made for one failure type at a time. This is more challenging, especially for rare failures such as TWF or RNF, because each type has is characterized by unique data patterns and scarcity. For this reason, simpler setups do not always work in practice, and it is necessary to identify the exact failure occurrence to plan the right repair.

Previous studies have used oversampling methods in PdM (e.g., [24,25]); in contrast, our work adds two important practical contributions. First, we highlight real-world limits that are often ignored in theory (SVM failure with oversampled data and the unique RNF capability of the k-NN algorithm). Second, we extend the approach beyond simply comparing methods (cf. [20]) by testing which aspects of the approach work and/or fail when handling multiple, very rare failure types.

Our error transfer analysis (Section 4.6) provides a novel contribution. While prior work such as Çiftpinar et al. [20] and Hosseinzadeh et al. [16] demonstrated that ensemble methods achieve high accuracy on the AI4I 2020 dataset, they did not conduct error transfer analysis to check where errors go when models fail. Findings show that high macro-F1 scores are achieved by transferring rare failure errors to the majority class. This insight explains why models with high aggregate performance can completely miss critical failures.

5.7. Managerial Implications

The findings of this study show that it is important for researchers to consider the following aspects:

Despite the fact that PdM fails in many cases to predict a certain failure, even a low success rate can provide great savings when it predicts failures.
Determining whether the system prioritizes general monitoring or specific failure detection is important. Moreover, it should be recognized that improving critical failure detection typically reduces overall metrics.
Imbalance-handling approaches should be selected based on model architecture and computational constraints.
The practical detection limits for very rare failure classes must be understood.

6. Conclusions and Future Work

This study shows that class imbalance handling is important for developing effective PdM systems; however, different strategies should be used for different objectives. By evaluating four practical approaches (no handling, manual oversampling, selective manual oversampling, and class weighting) across five ML models on a multiclass PdM task, we provide insights for both researchers and practitioners.

This study identifies when and why different strategies should be selected based on operational requirements. For PdM with rare failure classes, achieving reliable critical failure detection implies computational costs (manual oversampling), methodological adaptations (selective approaches), and/or performance tradeoffs. For example, to manage critical failures with very few examples (<50), selective manual oversampling provides the best strategy for their reliable detection. Our results demonstrate a tradeoff between overall performance metrics and critical failure detection. XGBoost with class weighting delivers the optimal overall performance (0.842 macro-F1); however, it fails to detect TWF. Meanwhile, MLR with selective manual oversampling achieves the best TWF detection (50% recall) but with lower overall metrics (0.636 macro-F1).

Different models respond differently to imbalance handling. MLR shows the best response to selective oversampling. XGBoost maintains a robust overall performance but low TWF detection improvement (16.7%). A k-NN algorithm provides a balanced but modest performance (16.7% TWF, 22.2% RNF recall). An RF provides inconsistent responses across methods. An SVM has computational constraints that limit its practical applicability. This study therefore reveals the practical detection limits for extremely rare failure classes. Random failures (15 training samples) remained difficult to detect even with 20× duplication (maximum 22.2% recall). Models with high aggregate scores may fail to recognize certain failure types, highlighting the importance of balanced evaluation and failure-specific analysis in imbalanced multiclass settings.

There are some inherent limitations. First, it relies on a single synthetic dataset (AI4I 2020), which may not capture all real-world complexities or time-series PdM data. Second, hyperparameters were not tuned; default settings from the R packages were used to ensure fair comparison across imbalance-handling strategies. This may underrepresent the potential performance of some models. Third, only four imbalance-handling strategies were evaluated. Fourth, the selective oversampling strategy used fixed numbers that were chosen arbitrarily, not based on data-driven rules. This method does not consider how similar different failure classes are or the different costs of different types of errors. Future work should develop smarter oversampling methods that adjust based on the data and address these issues. Fifth, while our error transfer analysis (Section 4.6) reveals patterns suggesting blurred inter-class boundaries and feature overlap (e.g., TWF errors transferring to HDF, Multiple errors to OSF), we did not systematically quantify these factors using metrics such as inter-class distances or feature space measures. The impact of sample noise on detection difficulty was also not analyzed. Future work should address these aspects. Future work should also extend evaluation to time-series PdM data where imbalances may exist across classes and time. Additionally, future work should test strategies across different industrial datasets. Another direction of future research could involve multiclass synthetic generation, where the development of multiclass oversampling using interpolation is required. It may also be advantageous to apply and validate the findings of this study across different industrial domains with varying failure patterns. Future work could also compare the findings from conventional ML models with more advanced approaches such as graph neural networks or transfer learning methods, particularly for time-series PdM data or cross-domain applications.

Author Contributions

Conceptualization, M.A.; methodology, M.A.; formal analysis, M.A.; investigation, M.A.; resources, M.I.T., M.S.M.A.-A. and Z.M.; writing—original draft preparation, M.A.; writing—review and editing, M.I.T. and S.K.S.; visualization, M.A.; supervision, M.I.T.; project administration, M.I.T.; proofreading and validation of similarity and AI-assisted content compliance, S.K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are publicly available. The AI4I 2020 Predictive Maintenance Dataset can be accessed from the publication by Matzka [5] and the associated repository. No new datasets were generated during the current study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, J.; Bagheri, B.; Kao, H.A. A cyber-physical systems architecture for Industry 4.0-based manufacturing systems. Manuf. Lett. 2015, 3, 18–23. [Google Scholar] [CrossRef]
Sipos, R.; Fradkin, D.; Moerchen, F.; Wang, Z. Log-based predictive maintenance. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2014; pp. 1867–1876. [Google Scholar] [CrossRef]
Stow, M.T. Hybrid deep learning approach for predictive maintenance of industrial machinery using convolutional LSTM networks. Int. J. Comput. Sci. Eng. 2024, 12, 1–11. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Matzka, S. Explainable artificial intelligence for predictive maintenance applications. In 2020 Third International Conference on Artificial Intelligence for Industries (AI4I); IEEE: Piscataway, NJ, USA, 2020; pp. 69–74. [Google Scholar] [CrossRef]
Torcianti, A.; Matzka, S. Explainable artificial intelligence for predictive maintenance applications using a local surrogate model. In 2021 4th International Conference on Artificial Intelligence for Industries (AI4I); IEEE: Piscataway, NJ, USA, 2021; pp. 86–88. [Google Scholar] [CrossRef]
Autran, J.; Kuhn, V.; Diguet, J.; Dubois, M.; Buche, C. AI4I-PMDI: Predictive maintenance datasets with complex industrial settings’ irregularities. Procedia Comput. Sci. 2024, 246, 1201–1209. [Google Scholar] [CrossRef]
Achouch, M.; Dimitrova, M.; Ziane, K.; Sattarpanah Karganroudi, S.; Dhouib, R.; Ibrahim, H.; Adda, M. On predictive maintenance in Industry 4.0: Overview, models, and challenges. Appl. Sci. 2022, 12, 8081. [Google Scholar] [CrossRef]
Dalzochio, J.; Kunst, R.; Pignaton, E.; Binotto, A.; Sanyal, S.; Favilla, J.; Barbosa, J. Machine learning and reasoning for predictive maintenance in Industry 4.0: Current status and challenges. Comput. Ind. 2020, 123, 103298. [Google Scholar] [CrossRef]
Hassan, I.; Panduru, K.; Walsh, J. Predictive maintenance in Industry 4.0: A review of data processing methods. Procedia Comput. Sci. 2025, 257, 896–903. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Farooq, U.; Ademola, M.; Shaalan, A. Comparative analysis of machine learning models for predictive maintenance of ball bearing systems. Electronics 2024, 13, 438. [Google Scholar] [CrossRef]
Widodo, A.; Yang, B.S. Support vector machine in machine condition monitoring and fault diagnosis. Mech. Syst. Signal Process. 2007, 21, 2560–2574. [Google Scholar] [CrossRef]
Saidi, L.; Ali, J.B.; Fraiech, F. Application of higher order spectral features and support vector machines for bearing faults classification. ISA Trans. 2015, 54, 193–206. [Google Scholar] [CrossRef] [PubMed]
Hosseinzadeh, A.; Chen, F.F.; Shahin, M.; Bouzary, H. A predictive maintenance approach in manufacturing systems via AI-based early failure detection. Manuf. Lett. 2023, 35, 1179–1186. [Google Scholar] [CrossRef]
Assagaf, I.; Ga, J.L.; Sukandi, A.; Abdillah, A.A.; Arifin, S. Machine predictive maintenance by using support vector machines. Recent Eng. Sci. Technol. 2023, 1, 31–35. [Google Scholar] [CrossRef]
Assagaf, I.; Sukandi, A.; Abdillah, A.A. Machine failure detection using deep learning. Recent Eng. Sci. Technol. 2023, 1, 26–31. [Google Scholar] [CrossRef]
Yürek, O.E.; Birant, D. Ordinal predictive maintenance with ensemble binary decomposition (OPMEB). Turk. J. Electr. Eng. Comput. Sci. 2024, 32, 534–554. [Google Scholar] [CrossRef]
Çiftpinar, A.B.; Kanar, P.; Erzurum Cicek, Z.I. Failure prediction using ensemble learning: A comparative study with synthetic and real world datasets. Afyon Kocatepe Univ. Fen Ve Mühendis. Bilim. Derg. 2025, 25, 785–797. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Ucar, A.; Karakose, M.; Kırımça, N. Artificial intelligence for predictive maintenance applications: Key components, trustworthiness, and future trends. Appl. Sci. 2024, 14, 898. [Google Scholar] [CrossRef]
Atere, A.; Kivrak, H. A comparative evaluation of interpolation and generative oversampling techniques for predictive maintenance. In Proceedings of the International Symposium on AI-Driven Engineering Systems (ISADES), Tokat, Türkiye, 19–20 June 2025; SETSCI Conference Proceedings. Volume 22, pp. 20–26. [Google Scholar] [CrossRef]
Bektasoglu, N.; Narin, A.; İleri, U. Performance analysis of the cheetah optimization algorithm in predictive maintenance forecasting. In Proceedings of the 5th International Artificial Intelligence and Data Science Congress (ICADA 2025); Necmettin Erbakan University Press: Konya, Türkiye, 2025; pp. 547–564. [Google Scholar]
Ghasemkhani, B.; Kut, R.A.; Birant, D.; Yilmaz, R. Balanced Hoeffding Tree Forest (BHTF): A novel multi-label classification with oversampling and undersampling techniques for failure mode diagnosis in predictive maintenance. Mathematics 2025, 13, 3019. [Google Scholar] [CrossRef]
Araf, I.; Idri, A.; Chairi, I. Cost-sensitive learning for imbalanced medical data: A review. Artif. Intell. Rev. 2024, 57, 80. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Zhang, J.; Chen, K.; He, R.; Huang, T.; Tian, J.; Wu, S.; Yan, P.; Cheng, Y. Remaining Useful Life Prediction Based on Interpretable Serialized Variational Autoencoder: A Drift-Diffusion Stochastic Equation Perspective. IEEE Trans. Ind. Inform. 2026. early access. [Google Scholar] [CrossRef]
Zhang, J.; Wang, C.; Quan, Q.; Shen, Y. Source-Free Domain Adaptation for Cross-Domain Remaining Useful Life Prediction: A Distributed Federated Learning Perspective. Reliab. Eng. Syst. Saf. 2026, 271, 112271. [Google Scholar] [CrossRef]
Xu, Z.; Chow, C.W.K.; Rahman, M.M.; Rameezdeen, R.; Law, Y.W. Remaining Useful Life Prediction for Bearings Across Domains via a Subdomain Adaptation Network Driven by Spectral Clustering. Sensors 2025, 25, 6919. [Google Scholar] [CrossRef]
Khedr, A.B.A.; P V, P.R.; Khedr, A.M. Optimizing predictive maintenance in Industrial IoT using enhanced Genghis Khan Shark Optimizer. Procedia Comput. Sci. 2025, 270, 6076–6085. [Google Scholar] [CrossRef]
Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]

Figure 1. Study Methodology Steps.

Figure 2. Top Configurations for TWF Detection.

Figure 3. Decision framework for PdM system selection.

Table 1. Overall Performance Ranking Using Macro-F1 Score.

Model	Method	Macro-F1	TWF Recall	RNF Recall
XGBoost	Class weighting	0.842	0	0
XGBoost	None	0.842	0	0
XGBoost	Manual oversampling	0.762	0.083	0
RF	None	0.732	0	0
MLR	None	0.729	0	0
XGBoost	Selective oversampling	0.714	0.167	0
SVM	Class weighting	0.696	0	0
RF	Selective oversampling	0.678	0	0
RF	Manual oversampling	0.669	0.125	0
SVM	Selective oversampling	0.658	0	0
SVM	None	0.656	0	0
RF	Class weighting	0.656	0	0
MLR	Selective oversampling	0.636	0.5	0
MLR	Manual oversampling	0.628	0.292	0
MLR	Class weighting	0.614	0.458	0
k-NN algorithm	None	0.56	0	0
k-NN algorithm	Class weighting	0.559	0	0
k-NN algorithm	Manual oversampling	0.489	0.167	0.222
k-NN algorithm	Selective oversampling	0.471	0.167	0.222

Table 2. Performance Tradeoff by Model Architecture.

Model	ΔMacro-F1 (None to Best TWF)	ΔTWF Recall (from None to the Optimal TWF)	Tradeoff Ratio
MLR	−0.093	0.5	5.38
k-NN algorithm	−0.071	0.167	2.35
XGBoost	−0.128	0.167	1.31
RF	−0.063	0.125	5.04

Table 3. Recommended Configurations by Operational Priority.

Priority	Recommended Configuration	Performance	Rationale
General monitoring	XGBoost + class weighting	0.842 macro-F1, 0% TWF recall	Optimal overall metrics with minimal overhead
Critical TWF detection	MLR + selective oversampling	50% TWF recall, 0.636 macro-F1	Only viable path to reliable TWF detection
Balanced multiple failure	MLR + manual oversampling	29.2% TWF recall, 0.628 macro-F1	Compromise approach for multiple failure types
RNF detection focus	k-NN algorithm + manual oversampling	22.2% RNF recall, 0.489 macro-F1	Best RNF detection despite poor overall metrics

Table 4. Average Number of Classes Predicted Across Three Random Seeds by Model and Imbalance-Handling Method.

Method	Model	Seed 1	Seed 2	Seed 3	Avg Predicted Classes
No handling	RF	4	4	5	4.33
No handling	XGBoost	6	6	6	6
No handling	SVM	3	3	3	3
No handling	k-NN algorithm	5	6	5	5.33
No handling	MLR	5	5	5	5
Manual oversampling	RF	6	5	6	5.67
Manual oversampling	XGBoost	6	6	7	6.33
Manual oversampling	k-NN algorithm	7	7	7	7
Manual oversampling	MLR	6	6	6	6
Selective oversampling	RF	6	6	5	5.67
Selective oversampling	XGBoost	6	7	6	6.33
Selective oversampling	SVM	5	4	4	4.33
Selective oversampling	k-NN algorithm	6	7	7	6.67
Selective oversampling	MLR	6	6	6	6
Class weighting	RF	6	6	6	6
Class weighting	XGBoost	6	6	6	6
Class weighting	SVM	5	5	5	5
Class weighting	k-NN algorithm	5	6	5	5.33
Class weighting	MLR	6	6	6	6

Table 5. Average Recall Across All Failure Types by Model and Imbalance-Handling Method.

Method	Model	TWF	HDF	PWF	OSF	RNF	Multiple	Average
None	XGBoost	0	0.889	0.813	0.644	0	0.25	0.433
None	RF	0	0.635	0.75	0.4	0	0.083	0.311
None	MLR	0	0.556	0.708	0.822	0	0.417	0.417
None	SVM	0	0	0.438	0.311	0	0	0.125
None	k-NN algorithm	0	0.206	0.354	0.444	0	0.167	0.195
Manual oversampling	XGBoost	0.083	0.905	0.833	0.756	0	0.417	0.499
Manual oversampling	RF	0.125	0.762	0.771	0.578	0	0.333	0.428
Manual oversampling	MLR	0.292	0.698	0.854	0.822	0	0.667	0.556
Manual oversampling	k-NN algorithm	0.167	0.556	0.625	0.733	0.222	0.5	0.467
Selective oversampling	XGBoost	0.167	0.873	0.813	0.689	0	0.333	0.479
Selective oversampling	RF	0	0.429	0.688	0.422	0	0.083	0.270
Selective oversampling	MLR	0.5	0.556	0.708	0.822	0	0.333	0.487
Selective oversampling	SVM	0	0	0.438	0.311	0	0	0.125
Selective oversampling	k-NN algorithm	0.167	0.222	0.354	0.444	0.222	0.083	0.249
Class weighting	XGBoost	0	0.889	0.813	0.644	0	0.25	0.433
Class weighting	RF	0	0.889	0.875	0.756	0	0.417	0.490
Class weighting	MLR	0.458	0.921	0.938	0.844	0	0.583	0.624
Class weighting	SVM	0	0.952	0.896	0.889	0	0.333	0.512
Class weighting	k-NN algorithm	0	0.206	0.354	0.444	0	0.167	0.195

Table 6. Comparative Performance Metrics by Model and Imbalance-Handling Method.

Method	Model	Accuracy	Macro-F1	Balanced Accuracy	G-Mean	Kappa
Class weighting	XGBoost	0.9853	0.8421	0.5133	0.8105	0.749
None	XGBoost	0.9853	0.8421	0.5133	0.8105	0.749
Manual oversampling	XGBoost	0.986	0.7623	0.57	0.6653	0.77
Selective oversampling	XGBoost	0.9861	0.7139	0.5531	0.5735	0.766
None	RF	0.9808	0.7318	0.4095	0.6207	0.63
Class weighting	RF	0.969	0.6555	0.5593	0.7459	0.603
Manual oversampling	RF	0.9823	0.6694	0.5093	0.5695	0.696
Selective oversampling	RF	0.9781	0.6783	0.3743	0.5433	0.56
None	MLR	0.9818	0.7287	0.4999	0.6624	0.679
Class weighting	MLR	0.9623	0.6142	0.6732	0.7551	0.588
Manual oversampling	MLR	0.9731	0.6279	0.6166	0.671	0.638
Selective oversampling	MLR	0.9751	0.6355	0.5582	0.6041	0.63
None	SVM	0.972	0.6561	0.2498	0.506	0.315
Class weighting	SVM	0.9653	0.6964	0.5777	0.8498	0.591
Selective oversampling	SVM	0.9706	0.6583	0.2496	0.5058	0.307
None	k-NN algorithm	0.9736	0.5597	0.31	0.3802	0.406
Class weighting	k-NN algorithm	0.9736	0.5588	0.31	0.3802	0.408
Manual oversampling	k-NN algorithm	0.9439	0.4888	0.5372	0.5327	0.401
Selective oversampling	k-NN algorithm	0.9539	0.4707	0.3529	0.3696	0.301

Table 7. Confusion Matrix for MLR with Selective Oversampling (Seed 1).

Actual/Predicted	NoFailure	TWF	HDF	PWF	OSF	RNF	Multiple
NoFailure	1914	5	9	5	2	3	0
TWF	5	3	0	0	0	0	0
HDF	5	0	12	1	0	0	2
PWF	1	0	0	10	0	0	0
OSF	1	0	0	0	11	0	0
RNF	3	0	0	0	0	0	0
Multiple	0	0	0	0	2	0	2

Table 8. Error Transfer Analysis for Key Configurations.

Configuration	True Class	Total	Correct	Recall	Error Transfer (Destination/Count/%)
XGBoost + None	TWF	8	0	0%	NoFailure: 8 (100%)
	RNF	3	0	0%	NoFailure: 3 (100%)
	Multiple	3	0	0%	OSF: 3 (100%)
XGBoost + Class Weighting	TWF	8	0	0%	NoFailure: 8 (100%)
	RNF	3	0	0%	NoFailure: 3 (100%)
	Multiple	3	0	0%	OSF: 3 (100%)
MLR + Selective Oversampling	TWF	8	3	37.50%	NoFailure: 5 (62.5%)
	RNF	3	0	0%	NoFailure: 3 (100%)
	Multiple	4	2	50%	OSF: 2 (50%)
k-NN + Manual Oversampling	TWF	8	1	12.50%	NoFailure: 5 (62.5%), HDF: 2 (25%)
	RNF	3	0	0%	NoFailure: 3 (100%)
	Multiple	4	2	50%	NoFailure: 1 (25%), PWF: 1 (25%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alnahhal, M.; Tabash, M.I.; Safi, S.K.; Al-Absy, M.S.M.; Mamadiyarov, Z. A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance. Computation 2026, 14, 88. https://doi.org/10.3390/computation14040088

AMA Style

Alnahhal M, Tabash MI, Safi SK, Al-Absy MSM, Mamadiyarov Z. A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance. Computation. 2026; 14(4):88. https://doi.org/10.3390/computation14040088

Chicago/Turabian Style

Alnahhal, Mohammed, Mosab I. Tabash, Samir K. Safi, Mujeeb Saif Mohsen Al-Absy, and Zokir Mamadiyarov. 2026. "A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance" Computation 14, no. 4: 88. https://doi.org/10.3390/computation14040088

APA Style

Alnahhal, M., Tabash, M. I., Safi, S. K., Al-Absy, M. S. M., & Mamadiyarov, Z. (2026). A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance. Computation, 14(4), 88. https://doi.org/10.3390/computation14040088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Study of Imbalance-Handling Methods in Multiclass Predictive Maintenance

Abstract

1. Introduction

2. Literature Review

2.1. Predictive Maintenance in the Industry 4.0 Era

2.2. Evolution of ML Models in PdM Research

2.3. Pervasive Challenge of Class Imbalance

2.4. Recent Advances in RUL Prediction

2.5. Research Gaps and Contributions

3. Methodology

3.1. Dataset and Preprocessing

3.2. Imbalance-Handling Strategies

3.3. ML Models

3.4. Implementation Details

3.5. Performance Metrics

4. Results and Analysis

4.1. Overall Performance: Macro-F1 Perspective

4.2. Effects of Model Architecture on Performance

4.3. Practical Implications and Deployment Recommendations

4.4. Average Number of Classes Predicted per Configuration

4.5. Overall Performance Across Multiple Metrics

4.6. Error Transfer Analysis

5. Discussion

5.1. Tradeoff Between Overall Performance and Failure Detection

5.2. Toward a Decision Framework for PdM System Selection

5.3. Selective Versus Full Oversampling

5.4. Multidimensional Analysis of Model Behavior and Metric Tradeoffs

5.4.1. Class Coverage and Model Confidence

5.4.2. Failure-Specific Detection Patterns

5.4.3. Metric Consistency and Operational Alignment

5.4.4. Practical Recommendations for Model Selection

5.4.5. Implications for Practice

5.5. Theoretical Analysis of Model Behavior

5.6. Comparison with Existing Literature

5.7. Managerial Implications

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI