Next Article in Journal
A Study on the Evolution of Online Public Opinion During Major Public Health Emergencies Based on Deep Learning
Previous Article in Journal
Applications of Optimization Methods in Automotive and Agricultural Engineering: A Review
Previous Article in Special Issue
An Analysis of Nonlinear Axisymmetric Structural Vibrations of Circular Plates with the Extended Rayleigh–Ritz Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance

1
Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey
2
Department of Computer Engineering, Dokuz Eylul University, Izmir 35390, Turkey
3
Department of Electrical and Electronics Engineering, Dokuz Eylul University, Izmir 35390, Turkey
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(18), 3019; https://doi.org/10.3390/math13183019
Submission received: 5 August 2025 / Revised: 4 September 2025 / Accepted: 9 September 2025 / Published: 18 September 2025
(This article belongs to the Special Issue Artificial Intelligence for Fault Detection in Manufacturing)

Abstract

Predictive maintenance (PdM) is essential for reducing equipment downtime and enhancing operational efficiency. However, PdM datasets frequently suffer from significant class imbalance and are often limited to single-label classification, which fails to reflect the complexity of real-world industrial systems where multiple failure modes can occur simultaneously. As the main contribution, we propose the Balanced Hoeffding Tree Forest (BHTF)—a novel multi-label classification framework that combines oversampling and undersampling strategies to effectively mitigate data imbalance. BHTF leverages the binary relevance method to decompose the multi-label problem into multiple binary tasks and utilizes an ensemble of Hoeffding Trees to ensure scalability and adaptability to streaming data. In particular, BHTF unifies three learning paradigms—multi-label learning (MLL), ensemble learning (EL), and incremental learning (IL)—providing a comprehensive and scalable approach for predictive maintenance applications. The key contribution of the proposed method is that it incorporates a hybrid data preprocessing strategy, introducing a novel undersampling technique, named Proximity-Driven Undersampling (PDU), and combining it with the Synthetic Minority Oversampling Technique (SMOTE) to effectively deal with the class imbalance issue in highly skewed datasets. Experimental results on the benchmark AI4I 2020 dataset showed that BHTF achieved an average classification accuracy of 97.44%, outperformed by a margin of the state-of-the-art methods (88.94%) with an improvement of 11% on average. These findings highlight the potential of BHTF as a robust artificial intelligence-based solution for complex fault detection in manufacturing predictive maintenance applications.

1. Introduction

Predictive maintenance (PdM) has emerged as a critical strategy in modern industrial systems to ensure operational reliability, minimize unplanned downtime, and reduce maintenance costs. Unlike traditional maintenance approaches—either reactive (performed after a failure) or preventive (conducted at scheduled intervals)—PdM adopts a proactive approach by analyzing real-time operational data to detect early signs of equipment degradation. This enables maintenance interventions to be scheduled precisely when needed, thereby extending equipment lifespan and refining overall productivity. The increasing availability of sensor data, coupled with advancements in artificial intelligence (AI), has considerably enriched the applicability of PdM solutions across various industrial domains, particularly in fault prediction [1].
Modern predictive maintenance frameworks employ data-driven methods to continuously monitor machinery conditions, detect operational anomalies, and make informed, real-time decisions regarding maintenance scheduling. These systems process vast amounts of high-frequency sensor data embedded within industrial equipment, uncovering patterns and insights that are often imperceptible to human operators. A typical PdM workflow encompasses several critical stages: data acquisition through IoT-enabled sensors, preprocessing to clean and transform raw signals, machine learning (ML)-driven fault detection (FD) and prognosis to detect current failures and predict future breakdowns, and maintenance planning based on predictive analytics [2]. By seamlessly integrating these components, PdM solutions improve equipment reliability, minimize downtime, and enable proactive maintenance in complex industrial environments [3].
Despite notable advances in predictive maintenance, current PdM models predominantly rely on single-label classification frameworks, where each data instance is assigned only one failure mode or a binary label such as failure versus no failure. This simplification ignores the complexity of real-world industrial environments, where multiple failure modes often occur simultaneously—such as concurrent tool wear and thermal faults—making single-label models insufficient for capturing the true condition of equipment. Moreover, PdM datasets typically exhibit severe data imbalance, with failure events being rare compared to normal operation, which can lead to biased models and reduced diagnostic accuracy, in a mathematical sense [4,5]. These challenges underscore the need for advanced multi-label classification methods that can handle class imbalance and support robust predictive maintenance. Our study focuses on filling these gaps by introducing a novel multi-label classification method with both oversampling and undersampling techniques.
Unlike traditional single-label methods, multi-label classification models enable each data instance to be associated with multiple categories simultaneously [6]. Multi-label learning (MLL) is the machine learning methodology designed for such cases, where an observation may belong to more than one class at the same time, making it particularly suitable for complex domains like predictive maintenance. In our study, this capability allows for more nuanced modeling, captures interdependencies among failure modes, and supports targeted maintenance strategies. For instance, a machine component might simultaneously exhibit symptoms of both overheating and vibration-related wear—treating these as separate but co-occurring failure modes allows maintenance teams to apply a more accurate and efficient intervention. Incorporating multi-label learning into predictive maintenance frameworks can significantly boost their capacity to respond to the intricate failure patterns commonly encountered in modern industrial environments.
Although multi-label classification can provide a more accurate and comprehensive framework for failure diagnosis in predictive maintenance, its efficacy can be hindered by the prevalent issue of data imbalance. In industrial manufacturing datasets, certain failure modes appear far less frequently than others, resulting in biased learning, where rare but critical failure types may be overlooked. This problem becomes even more pronounced in multi-label scenarios, where some combinations of various classes are scarcely represented. To mitigate these challenges, we employed data-level balancing strategies—namely oversampling and undersampling. The oversampling technique increases the presence of minority labels by generating synthetic or duplicated instances, thereby increasing the model’s sensitivity to rare conditions. Conversely, undersampling reduces the dominance of majority classes by selectively removing redundant examples, contributing to preventing overfitting. When integrated appropriately, these artificial intelligence-powered techniques can significantly improve the performance of multi-label models, leading to more balanced and reliable fault detection in predictive maintenance applications.
To bridge the gap between the complexity of real-world failure scenarios and the limitations of current PdM approaches, this study introduces a novel method called the Balanced Hoeffding Tree Forest (BHTF). Tailored for multi-label learning in predictive maintenance, BHTF builds on the Hoeffding Tree algorithm [7]—a fast, incremental learning (IL)-based decision tree well-suited for high-volume data as it continuously updates the model as new data streams in without requiring complete retraining—and extends it into an ensemble learning (EL)-based framework, where multiple classifiers are combined to improve stability, robustness, and predictive accuracy. To model multiple co-occurring failure modes, BHTF applies the binary relevance strategy, decomposing the multi-label problem into a set of independent binary classification tasks. This decomposition allows the ensemble to learn each failure type separately, while still capturing their potential co-occurrence patterns, thus developing both interpretability and diagnostic detail. Another key innovation of BHTF lies in its integrated handling of imbalanced classes—a common challenge in PdM datasets, where certain failure types are significantly underrepresented. BHTF involves a novel technique to solve data imbalance. By incorporating both oversampling and undersampling techniques at the data preprocessing stage, BHTF achieves a more balanced class distribution, improving model performance without introducing excessive noise or overfitting. BHTF was evaluated on the AI4I 2020 dataset, which includes the co-occurrence of four industrially critical failure types, namely tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), and overstrain failure (OSF), demonstrating consistent results across all categories.
The main contributions of this study are as follows:
(i).
Three learning paradigms integration: The proposed Balanced Hoeffding Tree Forest (BHTF) uniquely combines Multi-Label Learning (MLL), Incremental Learning (IL), and Ensemble Learning (EL), within a single framework. This integration allows BHTF to simultaneously handle multiple co-occurring failure modes, continuously update with streaming data, and leverage ensemble strategies for robust predictive performance.
(ii).
Introduction of BHTF for predictive maintenance: BHTF is a novel artificial intelligence-based method that applies both oversampling and undersampling techniques for multi-label classification in manufacturing environments, addressing challenges of data imbalance and real-world complexity for the first time.
(iii).
Multi-label failure mode diagnosis: BHTF predicts multiple failure types simultaneously using the binary relevance strategy, enabling detection of co-occurrence patterns and providing more detailed diagnostic insights for targeted maintenance actions.
(iv).
Hybrid class balancing: The method incorporates a hybrid data preprocessing strategy by proposing a novel undersampling technique, named Proximity-Driven Undersampling (PDU), and combining it with the Synthetic Minority Oversampling Technique (SMOTE), effectively mitigating class imbalance in highly skewed datasets.
(v).
Outperformance of existing methods: BHTF achieved an average accuracy of 97.44% to simultaneously predict failure modes with 11% improvement over state-of-the-art approaches. This result underscores its high potential for deployment in industrial predictive maintenance systems, particularly within manufacturing sectors.
The remainder of this article is organized as follows. Section 2 reviews related work on predictive maintenance that use machine learning methods. Section 3 describes the proposed BHTF method in detail, including the model architecture and sampling strategies. Section 4 presents the experimental setup, dataset characteristics, evaluation metrics, and implementation details. Section 5 gives the results, and Section 6 compares the performance of BHTF with existing state-of-the-art methods. Finally, the last section concludes this study and outlines potential directions for future research.

2. Related Works

Based on the scope of this study, Table 1 presents a collection of representative predictive maintenance works from the past five years [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32], focusing on various tasks, methods, and applications. The table includes key aspects of each study, organized into the following columns: machine, C, R, label, sampling, and purpose. The machine column indicates the type of equipment or component (e.g., wind turbine, conveyor belt, and bearing) to which the PdM methodology was applied for providing insight into the domain diversity. The C and R columns specify whether the addressed task is a classification or regression problem, respectively. The label column distinguishes between single-label (S) and multi-label (M) prediction tasks. The sampling column describes the data balancing techniques employed, namely oversampling (O) or undersampling (U), indicating whether the studies explicitly deal with class imbalance. The purpose column summarizes the specific goal of each PdM study, including failure prediction, fault detection, or anomaly detection. In addition, these PdM studies have been validated using various evaluation measures such as accuracy, precision, recall, F-measure, mean absolute error (MAE), root mean squared error (RMSE), and others. This structured overview enables a comparative understanding of recent trends and challenges in PdM research.
The methods employed span a wide spectrum of traditional and advanced artificial intelligence algorithms. Ensemble learning techniques like random forest (RF), extreme gradient boosting (XGBoost), LightGBM, and AdaBoost are frequently used [9,10,14,15,19,20,27]. Deep learning models, including convolutional neural networks (CNN), long short-term memory (LSTM), Autoencoders, and ResNet, appear in recent works [16,24,25,26,28,31] to show a shift toward learning complex temporal or image-based signals. In addition, classical models such as logistic regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), decision tree (DT), and naive Bayes (NB) remain widely used across various PdM datasets [10,13,22,23].
The reviewed works apply PdM to a wide variety of machine types to indicate the cross-domain applicability of predictive maintenance. For instance, PdM was applied to vehicle components [8,10,11], conveyor belts [12,15], wind turbines [16,19,31], bearings [21,25], aircraft systems [9,26], and industrial machinery like rotors, gearboxes, electric motors, and pumps [17,20,27,29,30]. Some works focus on current transformers [14], compressors [13], and lumber machines [22] to reveal growing interest in applying PdM to smart and connected environments.
The studies reviewed span a wide range of task types. A significant number of works addressed classification tasks [8,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,27,29,30,31,32], aiming to detect, diagnose, or classify various fault types or failure modes. In contrast, regression tasks [16,28,30] focused on estimating continuous outcomes such as the remaining useful life (RUL) or time deviation. Some studies, such as [9,30], incorporated both classification and regression objectives simultaneously.
Regarding sampling strategies, some studies explicitly addressed data imbalance in the data. Techniques such as SMOTE [8,14,18,24,25,27,31] and ADASYN [19] were used for oversampling minority class instances, while a few works applied undersampling [10,21], typically through random selection. Some works instead employed cost-sensitive learning (e.g., class weighting in [11]) or tackled imbalance through algorithm-level adjustments.
The purposes pursued across these studies vary, yet several key categories emerge. Failure prediction is the most prevalent objective [8,12,22,31], where the models forecast whether a failure will happen in the near future. Others focus on failure or fault detection and classification [10,14,15,17,19,20,21,25,27,30,32], where the goal is to identify the specific type or cause of failure. Anomaly detection [26] appears in cases where failures are rare and abnormal behavior is learned indirectly. Some papers discussed diagnostic monitoring [18] or RUL estimation [9,16,28], which are particularly relevant for long-term asset condition forecasting.
In terms of performance evaluation, the studies adopted a range of classification and regression mathematical metrics. For classification, accuracy, precision, recall, F-measure, AUC-ROC, TPR, and FPR are the most commonly reported ones. Some works also employed confusion matrices (CM), hamming loss (HL), and Matthews’s correlation coefficient (MCC) [17,19,21,27]. For regression, studies used MAE, RMSE, MSE, and R2 [16,28,30]. These diverse metrics reflect different emphases on precision, robustness, or time-based performance, depending on the application domain.
Overall, the reviewed literature demonstrates a growing diversity of predictive maintenance methods, broader application domains, and increasing attention to practical challenges such as data imbalance. Nevertheless, regarding label structure, previous studies commonly formulated PdM as a single-label problem, predicting one failure or health state at a time. However, in this study, we explored multi-label learning, reflecting cases where multiple failure types can co-occur or where the system needs to predict several outputs in parallel. Only a few studies have simultaneously tackled the dual challenges of multi-failure diagnosis and severe class imbalance using a combination of both oversampling and undersampling techniques. In contrast, our proposed BHTF framework directly addresses these limitations by coupling multi-label classification with a hybrid resampling strategy—integrating PDU and SMOTE. This approach boosts learning from imbalanced data while preserving failure diversity, making BHTF particularly effective for diagnosing complex failure modes in industrial manufacturing machinery. These priorities differentiate our method and establish the foundation for the contributions detailed in the following sections.
An additional key design choice in our work is the adoption of the Hoeffding Tree (HT) [7] classifier as the base learner in the proposed ensemble. HTs are specifically designed for high-speed data streams and support instance-wise, incremental learning in constant time, making them highly suitable for real-time, artificial intelligence-based predictive maintenance scenarios. A major advantage of HTs is their ability to handle uncertainty in learning time by offering a fixed computational cost per instance, while producing decision trees that closely approximate those built by conventional batch learners. This makes HTs exceptionally efficient for mining continuous or large-scale industrial data.
The potency of the HT classifier has been demonstrated in various studies [33,34,35,36,37,38,39,40,41,42]. Prior study [33] had shown that HT outperformed traditional classifiers such as NB, MLP, LR, logistic model trees (LMT), and sequential minimal optimization, due to its strong generalization ability. It obtained higher mathematical accuracy rates even over ensemble methods like AdaBoost and Random Forests in various classification tasks [33,34,35,36,37]. It achieved better results compared to alternative approaches, including Random Tree, Reduced Error Pruning Tree, Decision Stump, J48, RF, and LMT [36]. Similarly, HT surpassed a range of algorithms—J48, LR, RF, SVM, PART, K-Star, and OneR—in diabetes detection tasks [37]. In [38], a Hoeffding Tree-based model outperformed neural networks, decision trees, SVM, KNN, and ensemble methods in detecting network anomalies. Furthermore, in oil-well drilling applications, HT demonstrated superior predictive accuracy and adaptability to concept drift compared to XGBoost for rate of penetration (ROP) prediction [39]. For the validation dataset, the highest AUC value belonged to HT (0.802) against LMT and Bayesian Networks models, which produced lower AUC values (0.761 and 0.764, respectively) [40]. Likewise, HT achieved superior accuracy compared to models like Logistic Regression in the detection of security attacks for IoT devices [41]. In [42], it was stated that the Hoeffding Tree performed best compared to a series of its counterparts, such as J48, DT, RF, NB, Bayesian Network, and KNN.
Further, a critical review of the literature highlights that while prior studies have leveraged multi-label learning, incremental learning, or ensemble learning individually, none have integrated all three paradigms concurrently in predictive maintenance tasks. For instance, one of the few attempts at multi-label learning in PdM is represented in [17], which employs BR, CC, LP, and multi-label KNN to capture co-occurring fault types. Several works have utilized algorithms with incremental learning capabilities to adapt to streaming or sequential data—for example, LSTM [8,9,16,24], DT [10,11,12,15,19,30], NB [10] and KNN [30], LR [12,13,15,23], and BGRU [26]—yet these studies remained limited to single-label fault prediction. On the other hand, studies have used ensemble learning methods, such as RF [9,10,11,14,15,19,21,27,30,31], boosting techniques, including AdaBoost, XGBoost, CatBoost, LightGBM, and GB [9,10,11,15,20,21,27,29], and Bagging [20], to improve classification robustness, but again within a single-label context. Importantly, none of these prior works integrate all three paradigms simultaneously.
In contrast, the proposed BHTF method uniquely combines multi-label learning, incremental learning, and ensemble learning in a single framework, enabling multi-label fault diagnosis in dynamic data streams while effectively handling imbalance through hybrid sampling. By extending Hoeffding Trees into an ensemble and applying both oversampling and undersampling techniques (SMOTE + PDU), the BHTF method enhances predictive performance while preserving the advantages of incremental learning, offering a more comprehensive solution than existing approaches that rely on only one or two paradigms.

3. Materials and Methods

3.1. Proposed Method

The overall architecture of the proposed Balanced Hoeffding Tree Forest (BHTF) method is illustrated in Figure 1. The model is designed to address classification challenges in the domain of predictive maintenance, where different types of machinery failures can occur simultaneously. The framework consists of several interconnected stages, from data preparation to model training and evaluation, as described below:
  • Data collection: The predictive maintenance dataset usually contains sensor-based data collected from industrial machinery operating under various conditions. The dataset can include input features such as temperature, rotational speed, and torque related to machinery to reflect real-time machine behavior. These data are typically gathered in manufacturing environments, where operational precision is critical.
  • Multi-label dataset construction: The dataset includes several target variables, each representing a different failure type that may occur concurrently. This setup naturally forms a multi-label learning problem, where each machine instance can be associated with multiple failure types. To address this, the classification task is modeled using the binary relevance (BR) [43] strategy, which decomposes the multi-label problem into independent binary classification tasks—one for each failure. For each label, the value is 1 if the corresponding failure occurs, and 0 otherwise.
  • Data preprocessing: To improve data quality and prepare it for modeling, the following preprocessing steps can be performed:
    Cleaning: It involves detecting and removing errors, duplicates, and inconsistencies in the data to ensure that the model is trained on high-quality and reliable input. In addition, it also includes the removal of unique identifiers since they do not contribute predictive value. Some data cleaning techniques can also be applied to handle any missing values in the features.
    Feature selection: The redundant or irrelevant features were removed to reduce overfitting, improve accuracy, and decrease computational cost.
    Oversampling and undersampling (hybrid resampling): Due to the inherent data imbalance in the PdM dataset—where failure cases are considerably underrepresented compared to the healthy class—a hybrid resampling strategy was employed after identifying the minority and majority classes dynamically based on their frequencies.
    First, the synthetic minority over-sampling technique (SMOTE) [44] was applied to generate synthetic samples for minority failure classes. It is a widely used method for mitigating class imbalance by generating synthetic samples for minority classes rather than merely duplicating existing ones. Unlike basic oversampling methods such as random oversampling, which risk overfitting by repeating identical instances, SMOTE generates diverse new samples through interpolation between similar minority class instances in the feature space.
    Then, our proposed PDU method was used to reduce the number of healthy (non-failure) instances, resulting in a more balanced and learnable dataset. These steps are essential to prevent model bias toward the majority class and enhance the model’s ability to identify rare failure types, thereby contributing to more effective fault detection.
  • Label-wise dataset separation: Following data balancing, the multi-label dataset was decomposed into multiple binary datasets, one for each of the four selected failure types. Each dataset contains the same feature set but is independently labeled according to whether the corresponding failure occurred. This decomposition aligns with the binary relevance framework and enables independent model training for each failure mode.
  • Model training—Hoeffding Tree forest construction: Each balanced dataset was used to train several Hoeffding Trees, which are well-suited for efficiently processing large-scale scenarios. Hoeffding Trees inherently support incremental learning, allowing the model to adapt continuously to streaming or sequential data without retraining from scratch. The result in a collection of Hoeffding Tree models, each focused on detecting a specific failure mode. This process leverages the speed, adaptability, and online learning capabilities of lightweight artificial intelligence algorithms in dynamic industrial environments.
  • Model aggregation—BHTF: The individually trained Hoeffding Trees for each failure mode were combined to form BHTF. This ensemble learning structure utilizes the efficiency and scalability of Hoeffding Trees while enabling multi-label predictions across multiple failure types simultaneously. Although each tree is trained independently, the ensemble facilitates an integrated diagnosis of probable co-occurring failures within a single inference step.
  • Prediction and evaluation: The BHTF model is applied to new, unseen data to predict possible failure types. The performance of the model is evaluated using standard classification metrics, including accuracy, precision, recall, F-measure and confusion matrix, measured for each label as well as overall. This mathematical evaluation framework ensures a comprehensive understanding of the model’s ability to correctly diagnose failure modes in manufacturing systems.
  • Presentation: It involves effective visualization of model outputs and integration with business rules. It includes decision making, where predictions inform or directly drive actions—ranging from automated responses to human-guided choices.

3.2. Multi-Label Learning

Given that different failure types in industrial machinery can occur simultaneously, the classification task in this study naturally aligns with a multi-label learning (MLL) framework. In this setting, each instance may be associated with one or more target labels, corresponding to different failure modes. Formally, a MLL problem is defined over a training dataset D = { x i , Y i } i = 1 n , where each x i = ( x i 1 , x i 2 ,   ,   x i m ) is a feature vector, with   m attributes, and Y i L   is a label subset drawn from the full label set L = { y 1 , y 2 ,   ,   y q } , with   q   denoting the total number of possible labels. The objective is to learn a mapping function G   that predicts the appropriate label subset for an unseen instance   Y ^   for an unseen instance x ,   G S Y ^ .
To manage this complexity, our proposed method utilizes the binary relevance strategy—one of the most widely adopted approaches for multi-label classification. The core idea of BR is to decompose the multi-label task into q   independent binary classification problems, one for each label. Each classifier is trained to predict the presence or absence of a particular label, treating all other labels as irrelevant.
This process begins by transforming the original dataset into   q   binary-labeled datasets   D y j   for   j = 1 , 2 ,   ,   q . Each dataset retains the same feature vectors as the original data but replaces the multi-label targets with binary labels: a sample is marked positive if it includes the target label y i , and negative otherwise. A separate binary classifier   h i , j   is then trained for each dataset   D y j . During inference, a new instance is evaluated across all   q models, and the predicted label set Y ^   is formed by aggregating the labels for which the corresponding classifiers output a positive result in Equation (1):
G x = y i L     h j x = 1 }
Table 2 presents a typical multi-label dataset, where each instance S i   is represented by a feature vector x i and an associated subset of labels   Y i L . For example, S 1   is linked to failure types   y 1   and   y 3 , while   S 2   corresponds to all four simultaneous failure modes:   y 1 ,   y 2 ,   y 3 , and   y 4 . This exemplifies the multi-label nature of the dataset, in which instances can belong to multiple classes concurrently. The label subsets demonstrate how complex failure conditions are encoded in the learning framework.
An example of this transformation is illustrated in Table 3, which shows how a multi-label dataset is converted into multiple binary datasets, one per label in terms of binary relevance. For instance, if a sample is associated with labels   y 1   and   y 3 , it will be treated as positive in the binary datasets for   y 1   and   y 3 , and as negative in the datasets for the remaining labels.
In this setting, each output is encoded as a binary vector (e.g., [0, 1, 0, 1]), where each position corresponds to a specific failure mode: 1 denotes the presence and 0 the absence of that mode. This representation allows classifiers to be effectively used for solving multi-label learning tasks. The binary relevance approach is modular and computationally efficient, with a complexity that scales linearly with the number of labels q   and the cost C of the base classifier, i.e., O ( q × C ) . BR is a simple and scalable technique, making it particularly suitable for our predictive maintenance scenario, where labels are sparse yet well-defined. In this study, we utilized the BR technique by incorporating failure-specific resampling strategies (SMOTE and PDU) and ensemble learning via Hoeffding Tree forest, which together enhance the model’s ability to learn from multi-label imbalanced data.

3.3. Hybrid Resampling Strategy

One of the fundamental challenges in predictive maintenance is the severe data imbalance in class distribution, where failure events are extremely rare compared to normal operating conditions. This imbalance can undermine the performance of machine learning models, as they tend to be biased toward the majority class (i.e., healthy states), leading to poor sensitivity in detecting rare but critical failure modes.
To address this issue, the proposed BHTF method integrates a hybrid resampling strategy consisting of two stages: oversampling of minority failure classes using the SMOTE algorithm, and undersampling of the majority (healthy) class via PDU as a novel filtering technique over the multi-label dataset. This combined approach enhances the model’s ability to learn from scarce failure data and increases its sensitivity in detecting multiple, potentially co-occurring failure types. The following subsections detail the oversampling and undersampling methodologies applied in our framework:

3.3.1. Oversampling with SMOTE

To address the severe imbalance between healthy and failure states in the dataset, the proposed BHTF method employs SMOTE to increase the representation of failure samples. As discussed, each fault type is modeled separately under a binary relevance transformation. For each binary dataset, SMOTE is applied to the minority class, corresponding to the presence of a specific failure mode. Mainly, the following steps are executed:
Identify the minority class in the current binary-labeled dataset.
For each minority instance x i , find its k-nearest neighbors from the same class using mathematical Euclidean distance over numeric features through Equation (2):
d x i , x j = l = 1 d ( x i l x j l ) 2
Randomly select one neighbor   x N N , and generate a new synthetic sample using linear interpolation in Equation (3):
x s y n = x i + δ × ( x N N x i )
where   x i is the original minority instance, δ [ 0,1 ]   is a random number drawn from a uniform distribution, and x N N is the selected neighbor. Alternatively, the synthetic sample can be expressed as a convex combination of the original and its neighbor through Equation (4), which highlights that the new instance lies along the line segment connecting two minority samples in the feature space:
x s y n = ( 1 δ ) × x i + δ × x N N
Repeat this process until the minority class size increases by a predefined oversampling ratio.
This oversampling process is applied to each binary dataset independently, allowing the model to better learn the subtle variations and minority patterns of each failure mode.

3.3.2. Undersampling with PDU

In parallel with oversampling, the proposed method introduces a novel undersampling technique—Proximity-Driven Undersampling (PDU)—to selectively reduce the number of majority class (i.e., healthy) instances. This step aims to balance the dataset further by removing potentially noisy or less informative majority samples located close to minority class (i.e., failure) instances, thereby preserving critical decision boundaries for accurate fault detection in manufacturing systems.
The PDU technique operates by utilizing local proximity analysis in the feature space. Specifically, for each minority instance (i.e., label = 1), the algorithm identifies its nearest neighbor. If the nearest neighbor belongs to the majority class (i.e., label = 0), it is removed from the training set. This process is repeated iteratively for up to a user-specified number of iterations, allowing the method to clean the immediate surrounding region of each minority instance from potentially ambiguous majority samples. It yields a balanced and locally denoised training dataset with reduced overlap near class boundaries, optimized for improved minority class recognition and addressing data imbalance in predictive maintenance contexts.
The PDU method operates as follows:
  • Consider the dataset D   for a given target class label as an input.
  • Compute the Euclidean distance according to Equation (2) between x i D m i n o r i t y (i.e., label = 1) and all other instances.
  • Identify its nearest neighbor, denoted by   x N N   .
  • If   x N N   belongs to the majority class (i.e., label = 0), remove it from the training set.
  • Repeat the steps 3 and 4 until a user-specified number of iterations, denoted by U
  • Return to the step 2 to repeat the same process for all minority instances in the dataset.
As visualized in Figure 2, the hybrid resampling framework proceeds in three steps. The original multi-label dataset shows a highly imbalanced distribution, with sparse minority class instances (blue circles) overshadowed by dominant majority samples (green triangles). In the oversampling stage, the SMOTE generates synthetic minority samples (yellow circles) using nearest-neighbor interpolation around an example real instance x i , forming a denser minority cluster. After that, in the undersampling stage, PDU examines the local neighborhood around each minority instance. Majority class instances found within a proximity distance (i.e., up to U = 7   neighbors) are removed from the training set. Red triangles indicate such removed samples, resulting in the final cleaned dataset with better balance and reduced local class noise.
To clarify the distinctions between SMOTE and PDU, we summarize their differences in Table 4. This table presents a comprehensive comparison of the two methods across multiple features, including their approach to sampling, affected classes, scenarios, use of k-nearest neighbors, risks, goals, sensitivity to noise, effects on decision boundaries, computational cost, and main techniques.

3.4. Hoeffding Tree Classifier

In this study, we adopt the Hoeffding Tree [7] as the base learner for each binary classification task derived via the binary relevance decomposition. The Hoeffding Tree is a streaming decision tree algorithm designed for scalable, online learning from large-scale or continuously arriving data. Unlike traditional decision trees, e.g., C4.5 or classification and regression trees (CART) that require multiple passes through the entire dataset, the Hoeffding Tree incrementally builds its structure by observing one instance at a time. This capability makes it well suited for predictive maintenance scenarios, where real-time sensor data may arrive in high volumes.
The key foundation of the Hoeffding Tree lies in the Hoeffding bound, which provides a statistical guarantee for selecting a splitting attribute based on a finite number of observations. Given that the splitting metric (e.g., information gain or Gini index) is computed on observed data, the Hoeffding bound ensures that the attribute selected using the current sample is, with high probability, the same as the one that would be chosen if the algorithm had access to an infinite dataset. Mathematically, the Hoeffding bound is expressed as Equation (5):
ϵ = θ 2 l n ( 1 δ ) 2 × n
where   ϵ   denotes the maximum difference between the true and estimated values of the splitting criterion, θ   is the range of the splitting function (e.g., for information gain, θ = log 2 c , where c is the number of classes),   δ   is the user-defined confidence parameter (e.g., 0.01 for 99% confidence), and n   is the number of observed instances at a given node. Using this bound, the Hoeffding Tree determines when it has seen enough data at a node to confidently choose the best splitting attribute. Specifically, it compares the top two attributes   G 1   and   G 2   with their evaluated scores (e.g., information gain), and chooses to split on   G 1   if Equation (6) follows as:
G 1 G 2 > ϵ
This criterion ensures that the selected split is statistically superior with high confidence, preventing premature or unreliable splits caused by insufficient data. With probability 1 δ , it guarantees that   G 1   is indeed the better splitting attribute. If the condition is not met, the algorithm defers the split and waits for more data to accumulate, thereby preventing premature or inaccurate decisions.
The Hoeffding Tree classifier is particularly well-suited for predictive maintenance tasks due to its key advantages: it supports incremental learning, meaning the model can be updated in real time as new sensor data arrives without retraining from scratch; it is memory-efficient, maintaining only summary statistics at each node instead of storing the entire dataset; it exhibits the anytime property, producing a usable model even in early training stages; and it is robust to missing values, accommodating both nominal and numeric features. These characteristics make it an ideal base learner for large-scale, real-time industrial applications. Accordingly, the Hoeffding Tree forms the foundation of the ensemble learning described in the next subsection.

3.5. Hoeffding Tree Forest

The final stage of the proposed method aggregates multiple Hoeffding Tree classifiers into an ensemble structure to improve robustness and predictive accuracy. This ensemble—referred to as the Hoeffding Tree Forest—forms the backbone of the BHTF architecture. For each failure type identified through the binary relevance transformation, a dedicated and balanced binary dataset is created using SMOTE-based oversampling and the proposed PDU undersampling techniques. On each of these balanced datasets, T   Hoeffding Tree classifiers (i.e., T = 10 ) are independently trained. The trees exploit the statistical rigor of the Hoeffding bound to incrementally build reliable models from large-scale data.
To generate predictions, each Hoeffding Tree in an ensemble produces an output for a given instance. The final decision for each failure label is obtained via majority voting among the corresponding T   classifiers. Formally, let H j = { h j , 1 , h j , 2 ,   h j , 3 ,   ,   h j , T } indicate the ensemble of T   Hoeffding Trees trained for label y j . Then, the ensemble prediction y ^ j   is defined as Equation (7):
y ^ j = m o d e ( h j , 1 x ,   h j , 2 x ,   ,   h j , T x )
where the mode operator returns the most frequently predicted class (0 or 1) for the instance x . This mechanism ensures that the ensemble prediction is determined by the consensus of the classifiers, thereby reducing the influence of individual misclassifications and improving the robustness of the final decision.
The ensemble learning strategy adopted in the proposed BHTF offers several key advantages. By aggregating predictions from multiple Hoeffding Tree classifiers, it improves generalization performance by reducing variance and mitigating overfitting. The use of resampled datasets—achieved through SMOTE and PDU techniques—ensures robust learning from minority class instances, boosting the model’s ability to detect rare and critical failure types. Furthermore, the independence of ensemble members across labels makes BHTF particularly well-suited for industrial predictive maintenance tasks characterized by multi-label failure diagnosis, severe data imbalance, and the need for efficient processing.

3.6. Algorithm

To provide a clear and structured overview of the proposed method, the complete algorithmic process behind the BHTF is presented through Algorithm 1. While the previous subsections have described each individual component in detail—including multi-label decomposition, hybrid resampling, and ensemble learning—the algorithm summarizes how these components are integrated into a unified predictive maintenance framework. Specifically, BHTF enables multi-label fault diagnosis, aggregates multiple Hoeffding Trees into a balanced ensemble, and maintains adaptability to streaming data, thereby addressing the core challenges of PdM. The algorithm outlines the data transformation, training, and inference phases involved in constructing BHTF, enabling reproducibility and better understanding of the method’s implementation.
Algorithm 1: Balanced Hoeffding Tree Forest (BHTF)
Inputs:
       D : multi-label dataset D = ( x i , Y i ) i = 1 n
       L : label set L = y 1 , y 2 , ,   y q
       T : number of Hoeffding Trees per label
       R : oversampling rate
       k : number of nearest neighbors
       U : undersampling threshold
       x : new instance to be predicted
Outputs:
       H : the ensemble of models, i.e., models for the jth label   H j = h j , 1 , h j , 2 , ,   h j , T
       Y ^ : predicted label set for input instance   x
        // Step 1—Binary Relevance Decomposition
        // Multi-label learning: Decompose the problem into   q   binary tasks to capture co-occurring failures.
          for j = 1   t o   q
                   D y j =                                                                     // Initialize binary dataset for label y j
                  for each  ( x i , Y i )  in  D
                              if y j Y i
                                       D y j .Add ( x i , 1 )                                     // Assign positive label for presence of y j
                              else
                                       D y j .Add ( x i , 0 )                                     // Assign negative label for absence of y j
                              end if
                  end for each
          end for
        // Step 2—Hybrid Resampling
        // Hybrid imbalance handling: Integrate SMOTE-based oversampling with proximity-driven undersampling (PDU).
          for j = 1   t o   q
                  // Step 2.1—Oversampling
                   c m i n o r i t y   =   M i n o r i t y C l a s s ( D y i )                                     // Identify minority class
                  for each   x i D y i  where class ( x i ) == c m i n o r i t y
                             N k x i =   k N e a r e s t N e i g h b o r s ( x i , k )                   // Find k-nearest neighbors
                            for r = 1   t o   R                                                      // Generate R synthetic samples
                                    for each  x N N N k ( x i )
                                              x s y n   = x i + δ × ( x N N x i )                  // Synthetic instance, δ [ 0,1 ]
                                             D y i .Add x s y n , 1                    // Add synthetic minority sample to dataset
                                    end for each
                            end for
                  end for each
                  // Step 2.2—Undersampling
                   c m i n o r i t y   =   M i n o r i t y C l a s s ( D y i )                        // Identify minority class
                   c m a j o r i t y   =   M a j o r i t y C l a s s ( D y i )                        // Identify majority class
                   F = // Initialize removal set
                  for each    x i D y i  where class x i = =   c m i n o r i t y
                           for j = 1   t o   U
                                     x N N =   N e a r e s t N e i g h b o r s ( x i )       // Find nearest neighbor
                                    if class x N N c m a j o r i t y   then
                                               F = F x N N                         // Flag majority neighbor for removal
                                    else break;
                                    end if
                           end for
                  end for each
                   D j =   D y i .Remove   ( F )                                  // Remove flagged majority instances
        end for
        // Step 3—Model Training
        // Ensemble learning: Construct multiple Hoeffding Trees per label to improve robustness.
         H =                                                                     // Initialize ensemble of models
        for j = 1   t o   q
                   H j =                                                           // Initialize ensemble for label y j
                  for t = 1   t o   T
                            D j T = Bootstrapping( D j )
                            h j , t =   H o e f f d i n g T r e e ( D j T )                      // Train Hoeffding Tree on resampled data
                            H j =   H j     h j , t                                       // Add trained tree to ensemble
                  end for
                   H =   H     H j
        end for
        // Step 4—Model Testing
        // Prediction across multiple labels through majority voting within each ensemble.
         Y ^ =                                                                       // Initialize predicted label set
        for j = 1   t o   q
                   V j = h j , 1 ( x ) , h j , 2 ( x ) , ,   h j , T ( x )                        // Collect predictions from ensemble H j
                   y ^ j =  mode( V j )                                              // Compute majority vote for label y j
                   Y ^ = Y ^ y j                                              // Add   y j to predicted label set if voted as present
        end for
End Algorithm

4. Experimental Setup

4.1. Dataset Description

To assess the effectiveness of the proposed BHTF method, we utilized the AI4I 2020 predictive maintenance dataset, which is publicly available through the UCI machine learning repository (University of California, Irvine, CA, USA) [45]. This dataset is widely used in predictive maintenance research due to its realistic industrial context and rich collection of sensor-based features. It provides a reliable foundation for modeling multi-label classification tasks within manufacturing systems. A summary of its main characteristics is represented in Table 5.
The dataset comprises 10,000 records and 14 variables, integrating identification fields, sensor measurements, and binary target indicators related to various machine failure types. In detail, the UID and Product ID serve as identifiers, while the Type attribute indicates the quality grade of a product as low (L), medium (M), or high (H). The primary sensor-driven features include air temperature, process temperature, rotational speed, torque, and tool wear, taking the main operational parameters of the machinery. On the output side, the dataset includes six binary potential target labels, including random failures (RNF), tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), and an aggregated machine failure flag, which signals whether any of the aforementioned failures have occurred. Table 6 provides detailed information on all dataset variables, including their name, category, type, description, and unit of measurement.
In our study, the UID and Product ID columns were removed during preprocessing, as they carry no predictive value for failure analysis. We also excluded the random failures (RNF) label due to its inherently unpredictable nature, which does not align with the structured diagnostic objective of our artificial intelligence-driven approach. Additionally, the general machine failure flag—indicating whether any failure has occurred—was omitted from the target space, since our goal was not to distinguish between failure and no-failure states. Instead, we focused on diagnosing specific failure types. Therefore, our BHTF method was trained exclusively on the four actionable failure modes, namely TWF, HDF, PWF, and OSF.
Each data instance represents the state of a manufacturing machine, defined by its sensor readings as input features and a corresponding multi-label output. The output is encoded as a binary vector, where each position indicates the presence (1) or absence (0) of a specific failure mode in the order of TWF, HDF, PWF, and OSF. A key characteristic of this dataset is that multiple failures can occur simultaneously, making it essentially suitable for multi-label classification rather than traditional single-label approaches. This complexity also introduces data imbalance, as certain combinations of failure types are rare yet critical for effective fault detection.
The distributional characteristics of the dataset’s continuous features are summarized in Table 7. These statistics—minimum, maximum, mean, and standard deviation—provide a mathematical quantitative overview of the sensor readings, which reflect the machine’s operational behavior under various conditions. Understanding the variability and range of these inputs is crucial for proper model training in predictive maintenance applications.

4.2. Evaluation Metrics

4.2.1. Standard Per-Label Metrics

To evaluate the predictive performance of the proposed BHTF model, we employed 10-fold cross-validation, a widely accepted resampling technique that offers a balanced trade-off between bias and variance in performance estimation. Several mathematical evaluation metrics were utilized to capture different aspects of classification quality in the context of multi-label and imbalanced learning tasks. The fundamental metrics include accuracy (ACC), precision (PR), recall (R), and F-measure (F), each calculated based on the confusion matrix components, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These metrics are formally defined in Equations (8)–(11):
A C C = T P + T N T P + T N + F P + F N
P R = T P T P + F P
R = T P T P + F N
F = 2 T P 2 T P + F P + F N
here,
  • TP refers to the number of correctly predicted positive instances,
  • TN to the correctly predicted negatives,
  • FP to the negative instances incorrectly classified as positive, and
  • FN to the positive instances that were missed by the classifier.
While accuracy provides an overall performance measure, precision and recall give a more profound perception of how well the model handles data imbalances, particularly in minority failure classes. The F-measure (or F1-score) balances precision and recall, making it a valuable metric when these two are in tension, as often occurs in imbalanced multi-label scenarios.

4.2.2. Weighted Metrics

Following the standard per-label metrics, we compute weighted precision (WPR) and weighted recall (WR). These metrics account for the relative importance of each class and are particularly useful in multi-label and imbalanced datasets, as they give more weight to classes with more instances while still reflecting performance on minority classes. The weighted metrics are formally defined as Equations (12) and (13), respectively:
W P R = i = 1 N P R i × S i i = 1 N S i
W R = i = 1 N R i × S i i = 1 N S i
where N   is the number of labels or classes, P R i   and R i are the precision and recall for class   S i , respectively, and   S i   is the sum of   T P i and   F N i , representing the total number of true instances of class   S i   in the dataset. This weighted formulation ensures that classes with more instances contribute proportionally more to the overall precision and recall, providing a balanced and fair assessment of model performance in imbalanced multi-label scenarios.

4.2.3. Multi-Label Metrics

In multi-label classification, each instance can be associated with multiple labels simultaneously, which makes the evaluation more complex compared to single-label tasks. Fundamental metrics such as accuracy, precision, recall, and F-measure, although informative, may not fully capture the nuances of multi-label performance. To provide a more comprehensive assessment of the proposed BHTF model, we employed additional multi-label metrics, including macro-F1, micro-F1, Hamming loss, Jaccard index, and subset accuracy. These metrics evaluate performance from different perspectives, such as per-label, per-instance, and overall prediction quality, thus confirming a balanced and rigorous analysis of the model’s effectiveness in multi-label scenarios.
Macro-F1 evaluates how well the model predicts each label individually and then averages the results equally across all labels, regardless of their frequency in the dataset. This metric provides an unbiased view of performance across both majority and minority classes, making it especially valuable in imbalanced multi-label tasks. It is formally defined in Equation (14):
M a c r o _ F 1 = 1 N i = 1 N F i
where N   is the number of labels and F i   is the F-measure (F1-score) for class i .
The micro-precision (MP) and micro-recall (MR) are obtained by aggregating true positives, false positives, and false negatives across all labels before computing the precision and recall values. This approach ensures that the contribution of each label is proportional to its number of instances, making it suitable for datasets with imbalanced label distributions. They are presented in Equations (15) and (16):
M P = i = 1 N T P i i = 1 N T P i + i = 1 N F P
M R = i = 1 N T P i i = 1 N T P i + i = 1 N F N
Using these definitions, the micro-F1-score is expressed as Equation (17):
M i c r o _ F 1 = 2 × M P × M R M P + M R
Unlike macro-F1, which treats all labels equally, micro-F1 places more emphasis on labels with a higher number of instances by aggregating predictions across all classes. This makes micro-F1 a robust and widely adopted metric for evaluating overall model effectiveness in multi-label classification, especially in scenarios with significant class imbalance.
The Hamming loss quantifies the proportion of incorrect label predictions relative to the total number of predictions across all labels. In other words, it evaluates the average number of misclassification errors (false positives and false negatives) per instance per label. This makes it especially suitable for multi-label tasks, as it provides a label-wise error perspective instead of only focusing on the entire label set. If a confusion matrix is available for each label, the Hamming loss can be directly computed from FP and FN, normalized by the total number of predictions, as shown in Equation (18):
H a m m i n g _ L o s s = i = 1 N ( F P i + F N i ) N × M
where N   is the total number of labels, M   is the total number of instances, and F P i   and F N i   correspond to the false positives and false negatives for label i . This formulation demonstrates that Hamming loss essentially captures the ratio of prediction errors to the total number of instance–label decisions, thus offering an interpretable and fine-grained measure of model performance in imbalanced multi-label scenarios.
The Jaccard index measures the proportion of correctly predicted labels relative to all labels that were either predicted or actually present, providing an instance-level assessment of multi-label prediction quality. When confusion matrices are available for each label, the Jaccard index can be computed directly from TP, FP, and FN, as expressed in Equation (19):
J a c c a r d _ I n d e x = i = 1 N T P i i = 1 N T P i + i = 1 N F P i + i = 1 N F N i
The subset accuracy, also known as the exact match ratio, evaluates multi-label predictions at the instance level by considering an instance correctly classified only if all its labels are predicted correctly. Using the confusion matrix for each label, this can be represented through Equations (20) and (21):
S u b s e t   A c c u r a c y   f o r   I n s t a n c e   i = 1                     i f   T P i +   T N i = N   f o r   a l l   l a b e l s   i 0                     o t h e r w i s e                                                                                
S u b s e t   A c c u r a c y = 1 M   i = 1 M 1                 { a l l   l a b e l s   c o r r e c t   f o r   i n s t a n c e   i }    
where N   is the total number of labels, M   is the number of instances, and T P i   and T N i   denote true positives and true negatives for label i . Equation (20) illustrates the instance-level logic: an instance contributes 1 only if every label is correct, and 0 otherwise. Equation (21) then averages these instance-level results across all M instances to compute the overall subset accuracy. Although very strict—since a single misclassified label sets the score to zero—subset accuracy provides precious insight into the model’s ability to achieve fully correct predictions, complementing other metrics such as Hamming loss and Jaccard index.

4.3. Hyperparameters

To ensure optimal performance of the proposed BHTF method, a comprehensive set of hyperparameters was empirically selected across its key components, including oversampling, undersampling, decision tree induction, ensemble training, feature selection, and feature importance. The entire implementation was developed in Java version 17 (Oracle Corp., Austin, TX, USA) using the Weka version 3.8.6 (University of Waikato, Hamilton, New Zealand) machine learning library [46]. All experiments were conducted on a standard desktop computer equipped with an Intel® Core™ i7 processor (Intel Corp., Santa Clara, CA, USA) running at 1.90 GHz and 8 GB of RAM.
  • SMOTE oversampling: To address class imbalance within each binary decomposition, SMOTE was applied. The number of nearest neighbors for generating synthetic samples was set to k = 5, and an aggressive oversampling ratio of R = 4000% was chosen. This configuration confirmed that sufficient synthetic instances were generated for underrepresented failure classes to prevent classifier bias.
  • PDU undersampling: For balancing the overrepresented majority class, a proximity-based dynamic undersampling strategy was adopted. This method removes the majority instances located close to each minority instance in the feature space. Mainly, for each minority instance, up to u = 7 nearest majority neighbors were identified using a LinearNNSearch with k = 1, a brute-force search algorithm that computes Euclidean distances linearly to find the closest neighbor. This step helped refine class boundaries and reduce overlap between majority and minority classes.
  • Ensemble configuration: The multi-label learning approach in BHTF utilized an ensemble of Hoeffding Trees for each label. Specifically, for each binary classification task related to a distinct failure mode, an ensemble of T = 10 trees was trained using a bagging meta-classifier. This strategy was implemented to improve prediction robustness, and configured with the following main parameters:
    numIterations = 10 (number of Hoeffding Trees in each ensemble);
    bagSizePercent = 100 (100% of the training data in each bootstrap sample);
    batchSize = 100 (the batch size for model updates);
    seed = 1 (reproducibility of randomized processes);
    numExecutionSlots = 1 (sequential performance of model training);
    representCopiesUsingWeights = false (explicitly sampling of each instance without relying on weighting schemes).
  • Hoeffding Tree settings: Each base classifier in the ensemble was configured using the Hoeffding Tree implementation. The following parameters were set to control tree growth and splitting behavior:
    gracePeriod = 200 (minimum number of instances seen between split attempts);
    hoeffdingTieThreshold = 0.05 (threshold to break ties for close information gains);
    leafPredictionStrategy = Naive Bayes adaptive (Naive Bayes prediction in leaf nodes when beneficial);
    minimumFractionOfWeightInfoGain = 0.01 (minimum fraction of total weight required to consider a split);
    naiveBayesPredictionThreshold = 0.0 (threshold below which Naive Bayes predictions are used);
    splitConfidence = 1.0 × 10−7 (confidence level used for splitting decisions);
    splitCriterion = Info gain split (uses information gain as the splitting metric).
  • Feature selection: We employed the Pearson correlation technique [47] as a filter-based supervised attribute selection approach. It was combined with the Ranker search method to measure the predictive relevance of each feature. Multiple configurations were empirically tested, including heuristic strategies based on logarithmic and square root formulas for determining the number of features to retain (e.g., log 2 ( m ) , m , where m is the number of original features). Among these, selecting all six features provided the most favorable balance between model accuracy and complexity. This optimal configuration (numToSelect = 6) was identified through extensive experimentation and justified further in the results section.
  • Feature importance: To further enhance interpretability and provide visual analysis, we examined the contribution of individual features to each failure mode using Pearson correlation scores. The results are presented in Figure 3, Figure 4, Figure 5 and Figure 6, where features are ranked for each of the four failure modes: TWF, HDF, PWF, and OSF. For instance, torque and tool wear emerge as dominant indicators for OSF and TWF, respectively, while air temperature and rotational speed strongly influence HDF and PWF. These findings not only validate our decision to retain six features during preprocessing but also provide explicit evidence of how different sensors contribute to specific failures. Importantly, these visualizations offer an intuitive understanding of the data characteristics prior to modeling, thereby complementing the performance-driven results of BHTF.

5. Results

5.1. Overall BHTF Performance

The performance of the proposed BHTF method was evaluated through four separate binary classification tasks, each corresponding to a distinct failure mode: TWF, HDF, PWF, and OSF. The results are summarized in Table 8. BHTF achieved an overall accuracy of 97.44%, with an average precision of 0.9939, recall of 0.9744, and F-measure of 0.9839 across all failure modes. These mathematical metrics indicate a strong and balanced classification capability to underline the efficacy of the method for fault detection in manufacturing systems powered by machine learning. Performance across individual labels also remained consistently high. The accuracy ranged from 93.94% for TWF to 98.87% for PWF. All precision scores exceeded 0.99, demonstrating the model’s ability to correctly identify positive instances with minimal false positives. The recall values ranged from 0.9394 (TWF) to 0.9887 (PWF), indicating strong sensitivity across all classes. Corresponding F-measure scores varied between 0.9663 and 0.9914, confirming the model’s robustness in balancing precision and recall. These results affirm that the BHTF method delivers reliable and accurate multi-label predictions for predictive maintenance, effectively identifying multiple concurrent failure types while maintaining high classification quality.

5.2. Confusion Matrix

To further explore the classification behavior of BHTF, Figure 7 presents confusion matrices for each failure type. Each illustration reports the number of instances predicted as failure or no failure against their actual labels. The majority of true positives and true negatives are correctly captured, with very few false negatives (e.g., 8 (17.39%) for TWF, 3 (2.61%) for HDF) and a reasonable number of false positives (e.g., 598 (6.01%) for TWF, 107 (1.08%) for PWF), which aligns with the data imbalance addressed through resampling. These matrices demonstrate that BHTF can appropriately detect rare failure events while maintaining a low rate of misclassification for healthy instances.
To provide a thorough assessment of the proposed BHTF method in a multi-label context, we further evaluated its performance using macro-F1, micro-F1, Hamming loss, Jaccard index, and subset accuracy, all of which rely on the label-wise confusion matrices presented in Figure 7. The macro-F1-score of 0.9839 indicates that BHTF performs consistently well across all failure types, averaging the F1-scores equally without being dominated by the majority labels. The micro-F1-score of 0.9869, which aggregates contributions from all instances and labels, confirms that the method maintains high overall predictive accuracy even in the presence of class imbalance. The Hamming loss of 0.0256 demonstrates that only a small fraction of label predictions are incorrect relative to the total number of label assignments, reflecting the model’s reliability at the individual label level. Complementing this, the Jaccard index of 0.9742 underlines strong overlap between the predicted and true label sets, showcasing that BHTF captures the relevant failure events effectively. Finally, the subset accuracy of 90.07% confirms that in the vast majority of instances, the model predicts all labels correctly, further evidencing its robustness in exact multi-label prediction. Collectively, these results reinforce that BHTF not only achieves high performance in standard metrics such as accuracy, precision, recall, and F-measure (Table 8) but also excels across rigorous multi-label evaluation criteria, demonstrating its suitability for predictive maintenance tasks in imbalanced and multi-label scenarios.

5.3. Resampling Performance Across Folds

Following the performance evaluation of BHTF, a deeper analysis was conducted to examine how the class distribution evolved through the hybrid resampling process in each fold of cross-validation, represented in Table 9, Table 10, Table 11 and Table 12. For each fold, 90% of the data was used for training and 10% for testing. The values shown in each table correspond to the number of healthy and failure instances in the training set, represented in the format healthy/failure, at three key stages of the resampling pipeline: (i) before SMOTE (original class imbalance), (ii) after SMOTE or before PDU (after minority class upsampling), and (iii) after PDU (after majority class reduction). This breakdown illustrates the success of the proposed resampling strategy in transforming highly imbalanced binary datasets into more balanced ones, thereby improving the learnability for each failure classification task.
The instance distributions shown in Table 9, Table 10, Table 11 and Table 12 reflect the impact of the hybrid resampling strategy—comprising SMOTE for oversampling and the proposed PDU method for undersampling—across all 10 folds for each independent failure type (TWF, HDF, PWF, and OSF). Initially, all datasets exhibited a pronounced imbalance, with failure instances ranging from as low as 41 (TWF) to 104 (HDF) compared to nearly 9000 healthy instances. After applying SMOTE, the failure class in each fold was increased to a target range (approximately 1681–4264), depending on the specific failure type. Subsequently, the PDU step efficiently reduced the majority (healthy) class to levels closely aligned with the upsampled failure counts. On average, this process resulted in nearly balanced distributions, such as 8832/1697 for TWF, 8843/4244 for HDF, 8854/3506 for PWF, and 8883/3616 for OSF (healthy/failure). This reliable balancing across folds and failure types ensured that each binary classifier was trained on data with minimal class bias, thereby increasing the fairness of BHTF across all failure diagnosis tasks.
In addition to the quantitative distributions reported in Table 9, Table 10, Table 11 and Table 12, t-distributed stochastic neighbor embedding (t-SNE) was employed to provide a visual analysis of how the proposed hybrid resampling strategy reshapes the data space. The high-dimensional feature space was projected into a two-dimensional embedding defined by component 1 and component 2 for both the imbalanced (before resampling) and balanced (after resampling) datasets. As shown in Figure 8a, before resampling the dataset is dominated by the majority class (class 0), with failure samples (red points, class 1) sparsely scattered among a dense cluster of healthy (non-failure) samples (blue points, class 0). This makes the minority class (class 1) difficult to learn. After applying SMOTE followed by the proposed PDU undersampling, as illustrated in Figure 8b, the two classes become more distributed, with minority samples forming clearer clusters and achieving improved separation from the majority ones. This visualization confirms that the hybrid resampling pipeline not only balances the class distributions numerically but also enhances the geometric separability of the classes in feature space, thereby facilitating more effective learning by the BHTF model.

5.4. Sensitivity Analysis

To determine the most effective configuration for the proposed BHTF method, an extensive hyperparameter sensitivity analysis was conducted. This process involved systematic experimentation with various parameter settings, including different SMOTE oversampling ratios, neighborhood sizes for the PDU undersampling, numbers of Hoeffding Trees in the ensemble, and subsets of input features. Although a wide range of hyperparameter combinations were explored through grid search, only representative results are presented here to illustrate the main performance trends, represented in Table 13, Table 14, Table 15 and Table 16. For each tested configuration, standard classification metrics were computed across the four failure types (TWF, HDF, PWF, and OSF) to evaluate the model’s diagnostic success in fault detection under data imbalance conditions. The final configuration adopted in BHTF reflects the best-performing combination, selected to mathematically maximize predictive performance while maintaining model simplicity and generalizability.

5.4.1. Effect of SMOTE Ratio

To illustrate the influence of various SMOTE oversampling ratios (R) on the BHTF model’s performance, representative experiments were conducted using three settings: 4000%, 5000%, and 6000%. The classification accuracy for each failure type is reported in Table 13. Among these settings, the 4000% SMOTE ratio achieved the highest overall accuracy, with an average of 97.44% across all failure types. While HDF and OSF showed slight improvements with larger ratios, the performance on TWF declined as the oversampling rate increased. This indicates that excessive SMOTE can introduce noisy or redundant synthetic samples, particularly harming minority class generalization in sensitive failure types such as TWF. As part of the broader hyperparameter search, the 4000% SMOTE setting was adopted in the final BHTF configuration, as it offered the most favorable balance between classification accuracy and result consistency.
Table 13. Accuracy results for each failure type under different SMOTE oversampling ratios (R).
Table 13. Accuracy results for each failure type under different SMOTE oversampling ratios (R).
FailureR = 4000R = 5000R = 6000
TWF93.9493.0892.77
HDF98.1298.1598.28
PWF98.8798.8498.80
OSF98.8299.0799.07
Average97.4497.2997.23

5.4.2. Effect of Number of Neighbors in PDU

The sensitivity of BHTF to the number of neighbors used in the PDU technique is thoroughly assessed in our experiments with different values for the neighborhood size parameter (u), including 1, 3, 5, 7, and 9. The resulting classification accuracies across all four failure types are indicated in Table 14. The results reveal that the overall accuracy remains consistently high across all tested neighborhood sizes, with only slight fluctuations observed—ranging narrowly from 97.39% to 97.44%. The best average accuracy (97.44%) was achieved at both u = 7 and u = 9; however, to ensure a more computationally efficient configuration, u = 7 was selected as the final setting. This value struck a favorable trade-off between the risk of excessive undersampling and model performance that can arise from overly large neighborhoods.
Table 14. Accuracy results for each failure type under different numbers of neighbors in PDU (u).
Table 14. Accuracy results for each failure type under different numbers of neighbors in PDU (u).
Failureu = 1u = 3u = 5u = 7u = 9
TWF93.9693.7593.7693.9493.80
HDF98.1398.0598.0398.1298.19
PWF98.9298.9199.0198.8798.94
OSF98.7298.8498.7598.8298.81
Average97.4397.3997.3997.4497.44

5.4.3. Effect of Number of Hoeffding Trees

The impact of ensemble size on the classification performance of the proposed BHTF method is investigated by using various numbers of Hoeffding Trees ( T ), including 10, 50, and 100. Table 15 shows the accuracy results for each failure type under these configurations. The results confirm that while increasing the number of trees leads to slight gains for some failure types—such as PWF—there is a negligible or even slightly negative effect on others, such as TWF. The average accuracy across all failure types remains relatively stable, with the highest average of 97.44% achieved when T = 10. This indicates that a larger ensemble does not necessarily improve performance and may introduce redundant computational complexity. Based on these findings, T = 10 was selected for the final ensemble model to confirm the efficiency of BHTF.
Table 15. Accuracy results for each failure type under different numbers of Hoeffding Trees (T).
Table 15. Accuracy results for each failure type under different numbers of Hoeffding Trees (T).
FailureT = 10T = 50T = 100
TWF93.9493.8593.77
HDF98.1298.1198.09
PWF98.8798.9198.95
OSF98.8298.8098.80
Average97.4497.4297.40

5.4.4. Effect of Number of Selected Features

To evaluate the impact of feature subset size on BHTF performance, we experimented with varying numbers of selected features, from 2 to 6. The corresponding accuracy results for each failure type are represented in Table 16. As the number of features increased, accuracy generally improved across all failure categories. Notably, HDF exhibited a significant gain between 2 and 3 features, and the overall average accuracy impressively rose from 96.24% (n = 2) to 97.44% (f = 6). Although minor fluctuations were observed beyond 3 features, the highest performance was gained when all six original features were retained. Therefore, this configuration was selected for the final model, proposing optimal predictive accuracy without unnecessary feature exclusion.
Table 16. Accuracy results for each failure type under different numbers of features (f).
Table 16. Accuracy results for each failure type under different numbers of features (f).
Failuref = 2f = 3f = 4f = 5f = 6
TWF93.4693.6993.8393.8093.94
HDF93.8797.2998.0298.1198.12
PWF98.8698.9598.9398.9098.87
OSF98.7898.7698.7598.7998.82
Average96.2497.1797.3897.4097.44

5.5. Computational Cost Analysis

Computational cost analysis is an essential component in evaluating new machine learning methods, particularly regarding training time for real-time or streaming applications. Therefore, in addition to predictive performance, we assess the computational efficiency of the proposed BHTF method based on training time. Each experiment was conducted using 10-fold cross-validation, and the training time was measured for every fold and for each failure type label. The unit of measurement is seconds. The results of the per-fold training times are stated in Table 17, along with the averaged values. These results indicate that the proposed BHTF method achieves high predictive performance while maintaining minimal training costs, with average training times below 0.21 s across all labels.
This analysis confirms that BHTF not only provides superior predictive accuracy across all failure types but also requires minimal computational resources, reinforcing its suitability for real-time and streaming applications.

5.6. Hoeffding Tree Structure Analysis

To gain further insight into the internal decision-making behavior of the proposed BHTF framework, this subsection presents visual analyses of representative Hoeffding Tree structures. Figure 9, Figure 10, Figure 11 and Figure 12 illustrate sample Hoeffding Trees extracted from the constructed forests corresponding to each failure mode label, namely TWF, HDF, PWF, and OSF, respectively. These trees were selected from the ensemble of 400 Hoeffding Trees (10 fold × 10 trees per fold × 4 labels) and utilized as interpretable examples for examining how decision paths are formed under the influence of the hybrid balancing strategy and multi-label decomposition. By analyzing these structures, we aim to reveal how different failure types were distinguished based on feature splits and to assess the interpretability of the proposed model in practical predictive maintenance scenarios.
Figure 9 presents a sample Hoeffding Tree for the TWF label, demonstrating the sequence of decisions the BHTF model uses to discriminate between normal and faulty instances. The root node splits on tool wear, which is thus identified as the most informative feature. When tool wear is below or equal to the root threshold (≤201.636), the model immediately classifies the instance as non-failure (0), supported by a large number of observed instances (8273.677), underlining the importance of low tool wear as a strong indicator of healthy operation. When tool wear exceeds the root threshold (>201.636), the tree evaluates the torque feature, where lower values (≤11.196 Nm) lead to a non-failure outcome based on 25.000 instances. Higher torque prompts deeper inspection through rotational speed, followed by process temperature. Only when both of these features exceed their respective thresholds (>1289.273 rpm and >307.811 K) does the tree output a failure (1) prediction—at a leaf supporting 1384.000 instances. This tree illustrates how TWF failures are detected through a hierarchical combination of features by prioritizing tool wear and torque as strong early indicators, while rotational speed and process temperature act as secondary confirmation to accurately flag true fault detections.
Figure 10 presents the Hoeffding Tree structure learned for the HDF label to show that how BHTF separates healthy and faulty instances based on thermal and operational characteristics. The root split occurs on air temperature and signals its importance in identifying overheating-related failures. When the air temperature is ≤301.845 K, the model either checks a secondary air temperature threshold (301.245 K)—where instances are classified as non-failure (0) regardless of torque—for air temperatures between 301.245 K and 301.845 K, considers rotational speed to assign failure (1) for low-speed cases (≤1376 rpm) and non-failure otherwise. When air temperature exceeds 301.845 K, rotational speed becomes the primary discriminator, with lower speeds (≤1399.736 rpm) to prompt further inspection of process temperature and tool wear. In this case, temperatures ≤ 312.095 K lead to failure (1) predictions—with both low and high tool-wear values confirming the faulty class—while higher process temperatures revert to non-failure. At high rotational speeds (>1399.736 rpm), the model again predicts non-failure. This tree displays that air temperature and rotational speed as key drivers in detecting HDF faults, using process temperature and tool wear to pinpoint critical failure scenarios under elevated thermal and mechanical stresses.
Figure 11 indicates the Hoeffding Tree constructed for the PWF label to capture how the BHTF model distinguishes between normal and failure states, based mainly on operational torque and speed. The root split occurs at the rotational speed to emphasize its significance for detecting PWF-related faults. When rotational speed is ≤1966.383 rpm, the tree focuses on torque. Low-to-moderate torque levels (≤55.182 Nm and ≤58.711 Nm) lead to non-failure (0) predictions. However, when torque exceeds 58.711 Nm, the tree switches to predicting a failure (1)—with distinct leaves for moderate-high torque ranges (58.711–63.513 Nm). Additionally, if the rotational speed itself surpasses 1966.383 rpm, the model directly classifies the instance as a failure (1) to approve that very high speeds are a strong indicator of PWF failure mode. This structure also underscores that high torque is a strong early indicators of PWF, whereas lower operating regimes are reliably associated with healthy system behavior.
Figure 12 presents the Hoeffding Tree structure created for the OSF label, showcasing that the BHTF framework uses a combination of wear, torque, speed, and machining type to control the operational state. The root node splits on tool wear and identifies it as the primary factor by wear values ≤ 183.206 min lead into a non-failure (0) state with further refinement based on torque. Within this branch, lower torque values (≤61.186 Nm) are split by rotational speed, regardless of speed (<1403.727 rpm or >1403.727 rpm), both paths lead to non-failure outcomes. Higher torque values (>61.186 Nm) also produce a non-failure prediction and consider that low tool wear is a strong signal of healthy operation, even under high torque. When tool wear exceeds 183.206 min, torque again acts as a decision attribute. Lower torque values (≤49.191 Nm) maintain a non-failure prediction, while higher torque values prompt evaluation of the type of machining operation. Here, instances with type L are assigned a failure with 2890 instances, whereas types M and H remain in the non-failure class. This tree designates that high tool wear combined with high torque, during type L operations, is a strong signature for OSF failure, whereas low wear or alternative operation types generally reflect healthy behavior.

6. Discussion

To rigorously evaluate the performance of the proposed BHTF method, it was compared against 58 state-of-the-art predictive maintenance approaches drawn from 23 recent studies [48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70], including traditional classifiers, ensemble-based learners, deep neural network architectures, hybrid techniques, and augmentation-based strategies for fault detection in manufacturing systems. The results of this comparison, conducted over the same AI4I 2020 dataset, are summarized in Table 18, which reports for each method the reference, year, method type, training protocol, dataset split, hyperparameter settings, and evaluation metrics.
The BHTF method achieved the highest overall accuracy (97.44%), precision (0.9939), recall (0.9744), and F-measure (0.9839), outperforming all other models tested on the AI4I 2020 dataset. These results underline its robust diagnostic capability in handling complex and imbalanced industrial failure patterns. Compared to the average performance of all state-of-the-art methods (accuracy: 88.94%), BHTF demonstrated a significant improvement of 11% in accuracy. These gains reflect the method’s ability to deliver consistent and reliable predictions across both majority and minority failure categories.
When benchmarked against ensemble models such as EFNC-Exp (97.30%) [61], RF (96.81%) [64], and CatBoost-based techniques [62], BHTF not only slightly improved accuracy but also delivered notably higher recall and F-measure. This performance advantage is largely attributed to BHTF’s integrated balancing strategy, which combines SMOTE-based oversampling with PDU undersampling, effectively mitigating the skew introduced by class imbalance. For instance, while RF [64] achieved high precision (0.9740), it fell short on recall (0.7639), indicating a tendency to miss minority class failures—a problem that BHTF addresses successfully.
In comparison with DNN-based approaches—including CAST, SE-ResNet 18, GE-ResNet 18, and SE-SCNet 18 [58]—as well as standard neural networks [64] and DNNs in [52], BHTF outperformed all on every metric. Although these deep models produced moderately good F-measures (e.g., SE-SCNet 18: 0.8600), they struggled with data imbalance and generalization to complex failure patterns. Similarly, TTML-based hybrid models [63], despite their conceptual innovation, achieved limited accuracy (65–78%), showing a lack of robustness in multi-failure classification without dedicated balancing.
Beyond classical ensembles and deep learning, several advanced frameworks were also considered. For example, metaheuristic-optimized ELM with PLSCO [49] and DNN models coupled with simulated annealing (SA) [57] showed competitive results (up to 97.09% accuracy), while Byzantine fault-tolerant federated learning [50] provided robustness under adversarial conditions, albeit with lower accuracy (about 89%). Likewise, LSTM models with SMOTE resampling [55] and interpretable RF + XAI approaches integrating SHAP and LIME [56] emphasized temporal dynamics and explainability, respectively. However, despite their novelty, these methods were still surpassed by the accuracy of BHTF, which also offered stronger balance across precision, recall, and F-measure.
Data augmentation methods such as SMOTENC + ctGAN + CatBoost [62] improved recall (0.9068) but failed to match BHTF’s overall diagnostic strength. In contrast, BHTF’s hybrid balancing mechanism avoids overfitting and synthetic noise—challenges commonly seen in GAN-based augmentation—while achieving both high recall and precision.
Traditional models such as SVM, KNN, decision trees, and logistic regression [52,53,54,66,67,69] generally underperformed. While decision trees reached a reasonable F-measure of 0.7766 [66], other models such as KNN and NN yielded very low recall (0.2970 and 0.2178, respectively), showing weak detection of rare failures. Likewise, Bayesian logistic regression (BLR) [65], despite high precision (0.9950), had extremely low recall (0.2830), indicating an over-reliance on majority-class prediction. BHTF, in contrast, maintained a balanced performance across all metrics, achieving both high precision and recall.
Additional hybrid and specialized frameworks, such as DFPAIS and SDFIS [60], hyperplane-based methods [68], and RUSBoost trees [69], showed limited generalization. For example, RUSBoost trees attained a recall of 0.9085 but suffered from very low precision (0.3071), resulting in a weak F-measure. Although data-blind machine learning [70] achieved a competitive accuracy of 97.30%, it lacked reported precision and recall values, making it difficult to verify its balanced performance under class-imbalanced conditions.
In conclusion, BHTF not only outperformed individual models but also exceeded the average performance of the entire group of state-of-the-art methods by a substantial margin. The improvement—11% in accuracy—strongly validates BHTF’s reliability in real-world predictive maintenance scenarios characterized by data imbalance and multiple failure modes. The strength of BHTF lies in its simultaneous integration of three complementary paradigms of learning—multi-label learning for handling concurrent failure modes, incremental learning for adaptive knowledge acquisition in streaming contexts, and ensemble learning for enhanced generalization—augmented by hybrid oversampling and undersampling techniques within a single framework. While prior studies address these aspects individually or partially, none of the compared state-of-the-art methods incorporate all three paradigms together. This comprehensive design ensures not only superior predictive accuracy but also scalability and adaptability to evolving industrial conditions, thereby positioning BHTF as a distinctive and practical solution for complex predictive maintenance environments.
To statistically validate the superior performance of our proposed BHTF method over these state-of-the-art approaches listed in Table 18, the Wilcoxon signed-rank test [71] was employed. This non-parametric test is particularly suitable for comparing paired data and does not assume normality, instead relying on the symmetry of the distribution of differences. The proposed method achieved an average improvement of 11% over the competing approaches. To assess the significance of this improvement, the null hypothesis (H0) assumes that there is no significant difference between the median performance of BHTF and the competing methods. The test yielded a p-value of 2.39 × 10−9, which is substantially lower than the conventional significance threshold of 0.05. This result provides strong evidence against the null hypothesis and confirms that the performance improvements achieved by the BHTF method are statistically significant. Therefore, the proposed approach not only outperforms individual methods in terms of accuracy, precision, recall, and F-measure but also demonstrates consistent and statistically significant superiority across the evaluated models.
The mathematical expression for the Wilcoxon test is shown in Equation (22):
W = i = 1 n R i +
where
  • n : the total number of non-zero matched differences used in the analysis;
  • R i + : the rank given to each positive difference between matched pairs to represent the contribution of that pair to the overall test statistic;
  • W : the Wilcoxon signed-rank statistic, determined by summing the ranks of all positive differences observed in the paired data.

7. External Validation Across Diverse Datasets

In this study, the AI4I 2020 dataset served as the primary benchmark for developing and evaluating our proposed BHTF method extensively. To strengthen the validity and generalizability of our approach, we conducted an external validation, i.e., evaluating the model on entirely different datasets that were not used during model development. External validation is a critical step in predictive maintenance research as it assesses whether the method maintains consistent performance across different industrial contexts, data distributions, and failure characteristics. To this end, we selected 4 additional multi-label predictive maintenance datasets, each containing 16 distinct failure types, and evaluated BHTF across them. Below, we briefly introduce the datasets, describe the hyperparameter tuning carried out for each to ensure fair adaptation, and present the main results.
The selected datasets are derived from [72] related to predictive maintenance in steel manufacturing. These datasets were generated to simulate the tandem cold rolling (TCM) process, which is crucial in steel production. For the purposes of multi-label PdM evaluation, we selected 4 datasets that include all 16 anomaly labels, corresponding to the diverse failure types across the rolling stands. The summary of the selected datasets is presented via Table 19, in which the observations column indicates the total number of data points, while anomalies shows the number of labeled failure events. Features refer to the total number of measured parameters for each observation, and anomaly types represent the number of distinct failure labels included in the dataset. The products column specifies the number of steel product types processed, and data drift designates whether the dataset includes shifts or changes in the underlying data distribution over time.
Each dataset is generated as a chronological data stream, with observations ordered by increasing work roll mileage. A total of 51 features are recorded across the 5 rolling stands, including entry and exit thickness, width, yield strength, work roll diameter and mileage, thickness reduction, interstand tension, roll speed, rolling force, torque, stand gap, and motor power. Anomalies were introduced based on four types of failures: reduction scheme, electric motor, bearing, and work roll friction. Apart from the reduction anomaly, all other anomalies are stand specific, resulting in 16 distinct anomaly labels. Table 20 summarizes these features and anomaly labels for each dataset.
For the external validation on the four selected TCM datasets, the hyperparameters of the proposed BHTF method were adjusted to account for the increased dataset size and higher class imbalance. While the AI4I 2020 dataset contained 10,000 instances, each of the TCM datasets approximately has 20,000 instances. To address the more pronounced imbalance, the SMOTE oversampling ratio was doubled from R = 4000% (used for AI4I 2020) to R = 8000%, with k = 5 nearest neighbors for synthetic sample generation. Similarly, the PDU parameter u was increased from 7 to 14, meaning that for each minority instance, up to 14 nearest majority neighbors were identified using a LinearNNSearch with k = 1. Feature selection was also optimized, selecting numToSelect = 3 attributes, while the number of labels was set to 16 to match the multi-label structure of the TCM datasets. Other hyperparameters, such as the number of folds (folds = 10) and the number of iterations (setNumIterations = 10), were maintained from the original settings. All parameter values were determined through iterative tuning and evaluation to achieve the best performance across the new datasets.
Table 21 reports the accuracy of the proposed BHTF method on the 5 TCM datasets for 16 failure types. The method achieves high accuracy overall, with average values of 98.47%, 97.80%, 98.34%, and 96.40% for tcm5_dataset_3, tcm5_dataset_4, tcm5_dataset_5, and tcm5_dataset_6, respectively. The results highlight the robustness of BHTF in handling large, imbalanced, and multi-label datasets.

8. Conclusions and Future Works

In this study, a novel ensemble-based approach, Balanced Hoeffding Tree Forest (BHTF), was proposed to address the challenges of predictive maintenance in complex industrial settings. By combining artificial intelligence with industrial IoT data, BHTF aims to forecast equipment failures before they occur, thereby reducing unplanned downtime, lowering maintenance costs, and enhancing operational safety. Unlike traditional models that often struggle with data imbalance and limited failure representation, BHTF introduces a tailored solution that integrates advanced techniques across both modeling and preprocessing stages. The core innovation of BHTF lies in its multi-label fault detection framework, which employs binary relevance to model each failure type independently while preserving co-occurrence relationships. This design enables a more realistic and actionable diagnosis of equipment conditions in manufacturing environments. To further boost the model, a hybrid class balancing strategy was developed, combining SMOTE oversampling with the PDU undersampling technique. This dual-phase preprocessing pipeline addresses the intrinsic imbalance in real-world maintenance datasets, where failure events are significantly rarer than normal operation.
The proposed BHTF was extensively evaluated on the AI4I 2020 predictive maintenance dataset, which includes four critical industrial failure categories, encompassing tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), and overstrain failure (OSF). The model achieved an average accuracy of 97.44%, with corresponding precision (0.9939), recall (0.9744), and F-measure (0.9839) scores, indicating strong diagnostic capabilities across all classes. Furthermore, comparative analysis against a diverse set of state-of-the-art methods—ranging from traditional machine learning to deep learning and hybrid ensemble models—verified the superiority of BHTF. Remarkably, it achieved an improvement of 11% in accuracy, significantly outperforming existing solutions. Notably, the strength of BHTF stems from its simultaneous integration of multi-label learning, incremental learning, and ensemble learning—three innovative paradigms that have not been collectively addressed by the compared methods. While prior works typically focus on one or two of these dimensions in isolation, BHTF unifies them within a single framework. These results underscore the real-world applicability of BHTF for predictive maintenance in dynamic industrial environments, where accurate and timely failure detection is mission critical.
Despite the comprehensive analysis and promising results, this study presents several limitations that warrant further reflection. One potential direction is the creation of a software platform or intelligent service based on BHTF that can operate on real-time industrial data streams. Such a system would not only monitor equipment health continuously but also automatically integrate newly collected transactional data into the historical dataset, enabling incremental learning and regular model updates. This continuous adaptation would improve diagnostic accuracy and ensure that the system remains effective in dynamic manufacturing environments where fault detection is critical.
Additionally, future implementations could focus on embedding the BHTF model into industrial IoT platforms such as monitoring dashboards, edge-computing devices, or mobile applications. These platforms could provide real-time alerts to operators, identifying which specific failure types are predicted to occur and enabling preemptive actions to alleviate potential damage. The system’s predictions could be visualized in a user-friendly format, improving interpretability and enabling even non-expert users to make informed maintenance decisions. Developing mechanisms to prioritize machines based on failure likelihood would help organizations allocate resources more efficiently and implement condition-based maintenance strategies powered by artificial intelligence.
Furthermore, an emerging direction for future work involves adapting the BHTF framework to TinyML environments to enable lightweight and energy-efficient deployment on resource-constrained edge devices. Integrating BHTF with TinyML would allow predictive maintenance models to operate directly on microcontrollers or embedded sensors, reducing latency, enhancing privacy, and minimizing reliance on centralized infrastructure. This would be particularly advantageous in remote or bandwidth-limited manufacturing settings where real-time decision making is essential. Exploring model compression, pruning, or quantization techniques to adapt BHTF for low-power hardware could considerably broaden its applicability and contribute to the development of intelligent and autonomous maintenance systems. Collectively, these future extensions would move the BHTF approach closer to full deployment within factories and support intelligent and human-centric maintenance ecosystems.

Author Contributions

Conceptualization, B.G.; methodology, B.G.; software, B.G.; validation, B.G.; formal analysis, B.G.; investigation, B.G., R.A.K., D.B. and R.Y.; resources, B.G., R.A.K., D.B. and R.Y.; data curation, B.G., R.A.K., D.B. and R.Y.; writing—original draft preparation, B.G.; writing—review and editing, R.A.K., D.B. and R.Y.; visualization, B.G.; supervision, R.A.K. and D.B.; project administration, R.A.K., D.B. and R.Y.; funding acquisition, R.A.K. and R.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The “AI4I 2020 Predictive Maintenance” dataset [45] is publicly available in the UCImachine learning repository (University of California, Irvine, CA, USA) (https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset, accessed on 22 May 2025) for predictive modeling tasks. Furthermore, the “TCM: Benchmark Datasets for Predictive Maintenance in Steel Manufacturing” [72] datasets, including tcm5_dataset_3, tcm5_dataset_4, tcm5_dataset_5, and tcm5_dataset_6, are publicly available in the Zenodo repository (CERN Research Institute, Geneva, Switzerland) (https://zenodo.org/records/11469702, accessed on 20 August 2025), a general-purpose platform for sharing research outputs.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

AdaBoostAdaptive boosting
ADASYNAdaptive synthetic sampling
AIArtificial Intelligence
ANNArtificial neural network
AUC-ROCArea under the receiver operating characteristic curve
BFTByzantine fault tolerant
BGRUBidirectional gated recurrent unit
BHTFBalanced Hoeffding Tree forest
BLRBinary logistic regression
BRBinary relevance
CARTClassification and regression trees
CASTChannel-spatial attention-base temporal
CatBoostCategorical boosting
CCClassifier chain
CNNConvolutional neural network
ctGANConditional tabular generative adversarial network
DFPAISData filling approach based on probability analysis in incomplete soft sets
DNNDeep neural network
DTDecision tree
EFNC-ExpEvolving fuzzy neural classifier with expert rules
ELEnsemble learning
ELMExtreme learning machine
FDFault detection
FPRFalse-positive rate
GBGradient boosting
HDFHeat dissipation failure
ILIncremental learning
KNNK-nearest neighbors
LDALinear discriminant analysis
LightGBMLight gradient boosting machin
LIMELocal interpretable model-agnostic explanations
LOFLocal outlier factor
LPLabel powerset
LRLogistic regression
LSTMLong short-term memory
MAEMean absolute error
MCCMatthews correlation coefficient
MLMachine learning
MLLMulti-label learning
MLPMulti-layer perceptron
MRMRMinimum redundancy maximum relevance
MSEMean squared error
NBNaive Bayes
NNNeural network
OSFOverstrain failure
PARTPartial decision tree
PCAPrincipal component analysis
PdMPredictive maintenance
PDUProximity-driven undersampling
PLSCOPolar lights salp cooperative optimizer
PWFPower failure
QDAQuadratic discriminant analysis
RAKEL Drandom k-labelsets D
RAKEL Orandom k-labelsets O
ResNetResidual neural network
RFRandom forest
RMSERoot mean squared error
RNFRandom failures
RULRemaining useful life
RUSRandom under sampling
RUSBoostRandom undersampling boosting
SASimulated annealing
SDFISSimplified approach for data filling in incomplete soft sets
Self-ONNSelf-organized operational neural network
SHAPShapley additive explanations
SMOTESynthetic minority over-sampling technique
SMOTENCSynthetic minority over-sampling technique for nominal and continuous
SODASelf-organized direction-aware data partitioning
SVMSupport vector machine
TPRTrue-positive rate
TTMLTensor trains-based machine learning
TWFTool wear failure
t-SNEt-distributed stochastic neighbor embedding
XAIExplainable artificial intelligence
XGBoostExtreme gradient boosting

References

  1. Tsallis, C.; Papageorgas, P.; Piromalis, D.; Munteanu, R.A. Application-Wise Review of Machine Learning-Based Predictive Maintenance: Trends, Challenges, and Future Directions. Appl. Sci. 2025, 15, 4898. [Google Scholar] [CrossRef]
  2. Khattach, O.; Moussaoui, O.; Hassine, M. End-to-End Architecture for Real-Time IoT Analytics and Predictive Maintenance Using Stream Processing and ML Pipelines. Sensors 2025, 25, 2945. [Google Scholar] [CrossRef] [PubMed]
  3. Ucar, A.; Karakose, M.; Kırımça, N. Artificial Intelligence for Predictive Maintenance Applications: Key Components, Trustworthiness, and Future Trends. Appl. Sci. 2024, 14, 898. [Google Scholar] [CrossRef]
  4. Esteban, A.; Zafra, A.; Ventura, S. Data Mining in Predictive Maintenance Systems: A Taxonomy and Systematic Review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1471. [Google Scholar] [CrossRef]
  5. Altalhan, M.; Algarni, A.; Alouane, M.T.-H. Imbalanced Data Problem in Machine Learning: A Review. IEEE Access 2025, 13, 13686–13699. [Google Scholar] [CrossRef]
  6. Sajid, N.A.; Rahman, A.; Ahmad, M.; Musleh, D.; Basheer Ahmed, M.I.; Alassaf, R.; Chabani, S.; Ahmed, M.S.; Salam, A.A.; AlKhulaifi, D. Single vs. Multi-Label: The Issues, Challenges and Insights of Contemporary Classification Schemes. Appl. Sci. 2023, 13, 6804. [Google Scholar] [CrossRef]
  7. Hulten, G.; Spencer, L.; Domingos, P. Mining Time-Changing Data Streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; pp. 97–106. [Google Scholar] [CrossRef]
  8. Lins, R.G.; Nascimento de Freitas, T.; Gaspar, R. Methodology for Commercial Vehicle Mechanical Systems Maintenance: Data-Driven and Deep-Learning-Based Prediction. IEEE Access 2025, 13, 33799–33812. [Google Scholar] [CrossRef]
  9. Lin, K.Y.; Hong, Y.H.; Li, M.H.; Shi, Y.; Matsuno, K. Predictive maintenance in industrial systems: An XGBoost-based approach for failure time estimation and resource optimization. J. Ind. Prod. Eng. 2025, 1–24. [Google Scholar] [CrossRef]
  10. Aydın, C.; Evrentuğ, B. Evaluation of Predictive Maintenance Efficiency with the Comparison of Machine Learning Models in Machining Production Process in Brake Industry. PeerJ Comput. Sci. 2025, 11, e2999. [Google Scholar] [CrossRef]
  11. Yıldırım, Ş.; Yücekaya, A.D.; Hekimoğlu, M.; Ucal, M.; Aydin, M.N.; Kalafat, İ. AI-Driven Predictive Maintenance for Workforce and Service Optimization in the Automotive Sector. Appl. Sci. 2025, 15, 6282. [Google Scholar] [CrossRef]
  12. Gunckel, P.; Lobos, G.; Rodríguez, F.; Bustos, R.; Godoy, D. Methodology proposal for the development of failure prediction models applied to conveyor belts of mining material using machine learning. Reliab. Eng. Syst. Saf. 2025, 256, 110709. [Google Scholar] [CrossRef]
  13. Aminzadeh, A.; Sattarpanah Karganroudi, S.; Majidi, S.; Dabompre, C.; Azaiez, K.; Mitride, C.; Sénéchal, E. A Machine Learning Implementation to Predictive Maintenance and Monitoring of Industrial Compressors. Sensors 2025, 25, 1006. [Google Scholar] [CrossRef]
  14. Hu, H.; Xu, K.; Zhang, X.; Li, F.; Zhu, L.; Xu, R.; Li, D. Research on Predictive Maintenance Methods for Current Transformers with Iron Core Structures. Electronics 2025, 14, 625. [Google Scholar] [CrossRef]
  15. Wu, M.; Goh, K.W.; Chaw, K.H.; Koh, Y.S.; Dares, M.; Yeong, C.F.; Zhang, Y. An Intelligent Predictive Maintenance System Based on Random Forest for Addressing Industrial Conveyor Belt Challenges. Front. Mech. Eng. 2024, 10, 1383202. [Google Scholar] [CrossRef]
  16. Shah, S.S.; Daoliang, T.; Kumar, S.C.H. RUL forecasting for wind turbine predictive maintenance based on deep learning. Heliyon 2024, 10, e39268. [Google Scholar] [CrossRef]
  17. Yu, B.; Kim, Y.; Lee, T.; Cho, Y.; Park, J.; Lee, J.; Park, J. Study on Methods Using Multi-Label Learning for the Classification of Compound Faults in Auxiliary Equipment Pumps of Marine Engine Systems. Processes 2024, 12, 2161. [Google Scholar] [CrossRef]
  18. Qureshi, U.R.; Rashid, A.; Altini, N.; Bevilacqua, V.; La Scala, M. Radiometric Infrared Thermography of Solar Photovoltaic Systems: An Explainable Predictive Maintenance Approach for Remote Aerial Diagnostic Monitoring. Smart Cities 2024, 7, 1261–1288. [Google Scholar] [CrossRef]
  19. Maldonado-Correa, J.; Valdiviezo-Condolo, M.; Artigao, E.; Martín-Martínez, S.; Gómez-Lázaro, E. Classification of Highly Imbalanced Supervisory Control and Data Acquisition Data for Fault Detection of Wind Turbine Generators. Energies 2024, 17, 1590. [Google Scholar] [CrossRef]
  20. Khalil, A.F.; Rostam, S. Machine Learning-Based Predictive Maintenance for Fault Detection in Rotating Machinery: A Case Study. Eng. Technol. Appl. Sci. Res. 2024, 14, 13181–13189. [Google Scholar] [CrossRef]
  21. Hadi, R.H.; Hady, H.N.; Hasan, A.M.; Al-Jodah, A.; Humaidi, A.J. Improved Fault Classification for Predictive Maintenance in Industrial IoT Based on AutoML: A Case Study of Ball-Bearing Faults. Processes 2023, 11, 1507. [Google Scholar] [CrossRef]
  22. Fordal, J.M.; Schjølberg, P.; Helgetun, H.; Skjermo, T.Ø.; Wang, Y.; Wang, C. Application of Sensor Data Based Predictive Maintenance and Artificial Neural Networks to Enable Industry 4.0. Adv. Manuf. 2023, 11, 248–263. [Google Scholar] [CrossRef]
  23. Muideen, A.A.; Lee, C.K.M.; Chan, J.; Pang, B.; Alaka, H. Broad Embedded Logistic Regression Classifier for Prediction of Air Pressure Systems Failure. Mathematics 2023, 11, 1014. [Google Scholar] [CrossRef]
  24. Berghout, T.; Bentrcia, T.; Lim, W.H.; Benbouzid, M. A Neural Network Weights Initialization Approach for Diagnosing Real Aircraft Engine Inter-Shaft Bearing Faults. Machines 2023, 11, 1089. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Liu, B.; Wang, C. A Fault Diagnosis Method for Electrical Equipment With Imbalanced SCADA Data Based on SMOTE Oversampling and Domain Adaptation. In Proceedings of the 2023 8th International Conference on Power and Renewable Energy (ICPRE), Shanghai, China, 22–25 September 2023; IEEE: New York, NY, USA, 2023; pp. 195–202. [Google Scholar] [CrossRef]
  26. Dangut, M.D.; Jennions, I.K.; King, S.; Skaf, Z. A Rare Failure Detection Model for Aircraft Predictive Maintenance Using a Deep Hybrid Learning Approach. Neural Comput. Appl. 2023, 35, 2991–3009. [Google Scholar] [CrossRef]
  27. Hung, Y.-H. Developing an Improved Ensemble Learning Approach for Predictive Maintenance in the Textile Manufacturing Process. Sensors 2022, 22, 9065. [Google Scholar] [CrossRef] [PubMed]
  28. Mihigo, I.N.; Zennaro, M.; Uwitonze, A.; Rwigema, J.; Rovai, M. On-Device IoT-Based Predictive Maintenance Analytics Model: Comparing TinyLSTM and TinyModel from Edge Impulse. Sensors 2022, 22, 5174. [Google Scholar] [CrossRef]
  29. Abdalla, R.; Samara, H.; Perozo, N.; Carvajal, C.P.; Jaeger, P. Machine learning approach for predictive maintenance of the electrical submersible pumps (ESPs). ACS Omega 2022, 7, 17641–17651. [Google Scholar] [CrossRef]
  30. Ouadah, A.; Zemmouchi-Ghomari, L.; Salhi, N. Selecting an appropriate supervised machine learning algorithm for predictive maintenance. Int. J. Adv. Manuf. Technol. 2022, 119, 4277–4301. [Google Scholar] [CrossRef]
  31. Chen, H.; Hsu, J.Y.; Hsieh, J.Y.; Hsu, H.Y.; Chang, C.H.; Lin, Y.J. Predictive maintenance of abnormal wind turbine events by using machine learning based on condition monitoring for anomaly detection. J. Mech. Sci. Technol. 2021, 35, 5323–5333. [Google Scholar] [CrossRef]
  32. Ince, T.; Malik, J.; Devecioglu, O.C.; Kiranyaz, S.; Avci, O.; Eren, L.; Gabbouj, M. Early Bearing Fault Diagnosis of Rotating Machinery by 1D Self-Organized Operational Neural Networks. arXiv 2021, arXiv:2109.14873. [Google Scholar] [CrossRef]
  33. Arora, A.; Tsigelny, I.F.; Kouznetsova, V.L. Laryngeal cancer diagnosis via miRNA-based decision tree model. Eur. Arch. Oto-Rhino-Laryngol. 2024, 281, 1391–1399. [Google Scholar] [CrossRef]
  34. Iqbal, N.; Kumar, P. Coronavirus Disease Predictor: An RNA-Seq Based Pipeline for Dimension Reduction and Prediction of COVID-19. J. Phys. Conf. Ser. 2021, 2089, 012025. [Google Scholar] [CrossRef]
  35. Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–2528. [Google Scholar] [CrossRef]
  36. Thaiparnit, S.; Kritsanasung, S.; Chumuang, N. A Classification for Patients with Heart Disease Based on Hoeffding Tree. In Proceedings of the International Joint Conference on Computer Science and Software Engineering, Chonburi, Thailand, 10–12 July 2019; pp. 352–357. [Google Scholar] [CrossRef]
  37. Pramkeaw, P.; Chumuang, N.; Ketcham, M.; Ganokratanaa, T.; Yimyam, W.; Kwansomkid, K.; Makararpong, D. A Machine Learning Framework for Diabetes Detection Using Hoeffding Tree. In Proceedings of the 2025 IEEE International Conference on Cybernetics and Innovations (ICCI), Chonburi, Thailand, 2–4 April 2025; pp. 1–6. [Google Scholar] [CrossRef]
  38. Mohammad, M.A.; Kolahkaj, M. Detecting Network Anomalies Using the Rain Optimization Algorithm and Hoeffding Tree-Based Autoencoder. In Proceedings of the 2024 10th International Conference on Web Research (ICWR), Tehran, Iran, 24–25 April 2024; pp. 137–141. [Google Scholar] [CrossRef]
  39. Rezki, D.; Mouss, L.-H.; Baaziz, A.; Bentrcia, T. Adaptive Prediction of Rate of Penetration While Oil-Well Drilling: A Hoeffding Tree Based Approach. Eng. Appl. Artif. Intell. 2025, 159, 111465. [Google Scholar] [CrossRef]
  40. Chen, W.; Zhang, S. GIS-based comparative study of Bayes network, Hoeffding tree and logistic model tree for landslide susceptibility modeling. Catena 2021, 203, 105344. [Google Scholar] [CrossRef]
  41. de Araújo Josephik, J.G.A.; Siqueira, Y.; Machado, K.G.; Terada, R.; dos Santos, A.L.; Nogueira, M.; Batista, D.M. Applying Hoeffding Tree Algorithms for Effective Stream Learning in IoT DDoS Detection. In Proceedings of the Latin-American Conference on Communications (LATINCOM), Panama City, Panama, 15–17 November 2023; pp. 1–6. [Google Scholar] [CrossRef]
  42. Soares, D.; Dewan, M.A.A.; Lin, O. A Hoeffding Decision Tree Based Approach for Soil Classification. In Proceedings of the 35th Canadian Conference on Artificial Intelligence, Toronto, Ontario, Canada, 30 May–3 June 2022; pp. 1–12. [Google Scholar] [CrossRef]
  43. Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
  44. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority oversampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  45. AI4I 2020 Predictive Maintenance Dataset; UCI Machine Learning Repository: Irvine, CA, USA, 2020. [CrossRef]
  46. Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Morgan Kaufmann: Cambridge, MA, USA, 2016; pp. 1–664. Available online: https://ml.cms.waikato.ac.nz/weka (accessed on 22 May 2025).
  47. Pearson, K. Notes on regression and inheritance in the case of two parents. In Proceedings of the Royal Society of London, London, UK, 20 June 1895; Volume 58, pp. 240–242. [Google Scholar]
  48. Chandu, H.S. A Study of Machine Learning Techniques for Predicting Equipment Failures in Industrial Maintenance. In Proceedings of the 2025 IEEE International Conference on Emerging Technologies and Applications (MPSec ICETA), Gwalior, India, 21–23 February 2025; pp. 1–6. [Google Scholar] [CrossRef]
  49. Besha, A.R.M.A.; Ojekemi, O.S.; Oz, T.; Adegboye, O. PLSCO: An Optimization-Driven Approach for Enhancing Predictive Maintenance Accuracy in Intelligent Manufacturing. Processes 2025, 13, 2707. [Google Scholar] [CrossRef]
  50. Jahani, K.; Moshiri, B.; Hossein Khalaj, B. Secure PDM: A Novel Byzantine Fault Tolerant Federated Learning Framework Using a Robust PCA-Based Anomaly Detection Approach. Int. J. Ind. Electron. Control Optim. 2025. [Google Scholar] [CrossRef]
  51. Araujo, S.A.d.; Bomfim, S.L.; Boukouvalas, D.T.; Lourenço, S.R.; Ibusuki, U.; Oliveira Neto, G.C.d. Integration of Data Analytics and Data Mining for Machine Failure Mitigation and Decision Support in Metal–Mechanical Industry. Logistics 2025, 9, 109. [Google Scholar] [CrossRef]
  52. Prashanth, B.S.; Manoj Kumar, M.V.; Almuraqab, N.; Puneetha, B.H. Leveraging Safe and Secure AI for Predictive Maintenance of Mechanical Devices Using Incremental Learning and Drift Detection. Comput. Mater. Contin. 2025, 83, 4979–4998. [Google Scholar] [CrossRef]
  53. Özdemir, K.; Işık, G. Üretim Süreçlerinde Yapay Zekâ Destekli Hatalı Parça Tahminine Yönelik Bir Uygulama. In Proceedings of the 1. Bilsel Uluslararası Anı Bilimsel Araştırmalar Kongresi, Kars, Turkey, 28–29 June 2025; pp. 175–182. [Google Scholar]
  54. Kumar, S.; Panchal, A.; Rawat, U.; Bhattacharya, P.; Kumar, K. Optimizing Grid Equipment Maintenance through Robust Machine Learning. In Proceedings of the 2025 International Conference on Next Generation Communication & Information Processing (INCIP), Bangalore, India, 23–24 January 2025; pp. 194–199. [Google Scholar] [CrossRef]
  55. Misaii, H.; Fouladirad, M.; Ponchet-Durupt, A.; Askari, B. Predictive Degradation Modelling Using Artificial Intelligence: Milling Machine Case Study. In Proceedings of the European Safety and Reliability Conference ESREL 2024, Cracow, Poland, 23–27 June 2024; Jagiellonian University: Cracow, Poland, 2024; pp. 193–200. Available online: https://hal.science/hal-04564828v1 (accessed on 22 May 2025).
  56. Presciuttini, A.; Cantini, A.; Portioli-Staudacher, A. From Explanations to Actions: Leveraging SHAP, LIME, and Counterfactual Analysis for Operational Excellence in Maintenance Decisions. In Proceedings of the 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Male, Maldives, 4–6 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
  57. Hung, Y.-H.; Huang, M.-L.; Wang, W.-P.; Chen, G.-L. Hybrid Approach Combining Simulated Annealing and Deep Neural Network Models for Diagnosing and Predicting Potential Failures in Smart Manufacturing. Sens. Mater. 2024, 36, 49–65. [Google Scholar] [CrossRef]
  58. Liu, C.-L.; Su, H.-C. Temporal learning in predictive health management using channel-spatial attention-based deep neural networks. Adv. Eng. Inform. 2024, 62, 102604. [Google Scholar] [CrossRef]
  59. Ghadekar, P.; Manakshe, A.; Madhikar, S.; Patil, S.; Mukadam, M.; Gambhir, T. Predictive Maintenance for Industrial Equipment: Using XGBoost and Local Outlier Factor with Explainable AI for Analysis. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; pp. 25–30. [Google Scholar] [CrossRef]
  60. Kong, Z.; Lu, Q.; Wang, L.; Guo, G. A Simplified Approach for Data Filling in Incomplete Soft Sets. Expert Syst. Appl. 2023, 213, 119248. [Google Scholar] [CrossRef]
  61. Souza, P.V.C.; Lughofer, E. EFNC-Exp: An evolving fuzzy neural classifier integrating expert rules and uncertainty. Fuzzy Sets Syst. 2023, 466, 108438. [Google Scholar] [CrossRef]
  62. Chen, C.-H.; Tsung, C.-K.; Yu, S.-S. Designing a Hybrid Equipment-Failure Diagnosis Mechanism under Mixed-Type Data with Limited Failure Samples. Appl. Sci. 2022, 12, 9286. [Google Scholar] [CrossRef]
  63. Vandereycken, B.; Voorhaar, R. TTML: Tensor trains for general supervised machine learning. arXiv 2016, arXiv:2203.04352. [Google Scholar] [CrossRef]
  64. Falla, B.F.; Ortega, D.A. Evaluación De Algoritmos De Inteligencia Artificial Aplicados Al Mantenimiento Predictivo. Ph.D. Thesis, Corporación Universitaria Autónoma de Nariño (AUNAR), Nariño, Colombia, 3 June 2022. Available online: http://repositorio.aunar.edu.co:8080/xmlui/handle/20.500.12276/1258 (accessed on 22 May 2025).
  65. Iantovics, L.B.; Enachescu, C. Method for Data Quality Assessment of Synthetic Industrial Data. Sensors 2022, 22, 1608. [Google Scholar] [CrossRef] [PubMed]
  66. Vuttipittayamongkol, P.; Arreeras, T. Data-driven Industrial Machine Failure Detection in Imbalanced Environments. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management, Kuala Lumpur, Malaysia, 7–10 December 2022; pp. 1224–1227. [Google Scholar] [CrossRef]
  67. Mota, B.; Faria, P.; Ramos, C. Predictive Maintenance for Maintenance-Effective Manufacturing Using Machine Learning Approaches. In Lecture Notes in Networks and Systems, Proceedings of 17th International Conference on Soft Computing Models in Industrial and Environmental Applications, Salamanca, Spain, 5–7 September 2022; Springer International Publishing AG: Cham, Switzerland, 2022; Volume 531, pp. 13–22. [Google Scholar] [CrossRef]
  68. Diao, L.; Deng, M.; Gao, J. Clustering by Constructing Hyper-Planes. IEEE Access 2021, 9, 70167–70181. [Google Scholar] [CrossRef]
  69. Torcianti, A.; Matzka, S. Explainable Artificial Intelligence for Predictive Maintenance Applications using a Local Surrogate Model. In Proceedings of the 4th International Conference on Artificial Intelligence for Industries, Laguna Hills, CA, USA, 20–22 September 2021; pp. 86–88. [Google Scholar] [CrossRef]
  70. Pastorino, J.; Biswas, A.K. Data-Blind ML: Building privacy-aware machine learning models without direct data access. In Proceedings of the IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering, Laguna Hills, CA, USA, 1–3 December 2021; pp. 95–98. [Google Scholar] [CrossRef]
  71. Zimmerman, D.W.; Zumbo, B.D. Relative Power of the Wilcoxon Test, the Friedman Test, and Repeated-Measures ANOVA on Ranks. J. Exp. Educ. 1993, 62, 75–86. [Google Scholar] [CrossRef]
  72. Jakubowski, J.; Bobek, S.; Nalepa, G.J. TCM: Benchmark Datasets for Predictive Maintenance in Steel Manufacturing; Zenodo: Geneva, Switzerland, 2024. [Google Scholar] [CrossRef]
Figure 1. The general architecture of the proposed BHTF method for failure mode diagnosis.
Figure 1. The general architecture of the proposed BHTF method for failure mode diagnosis.
Mathematics 13 03019 g001
Figure 2. Illustration of the proposed hybrid resampling strategy combining SMOTE oversampling and PDU undersampling.
Figure 2. Illustration of the proposed hybrid resampling strategy combining SMOTE oversampling and PDU undersampling.
Mathematics 13 03019 g002
Figure 3. Feature importance ranking for TWF.
Figure 3. Feature importance ranking for TWF.
Mathematics 13 03019 g003
Figure 4. Feature importance ranking for HDF.
Figure 4. Feature importance ranking for HDF.
Mathematics 13 03019 g004
Figure 5. Feature importance ranking for PWF.
Figure 5. Feature importance ranking for PWF.
Mathematics 13 03019 g005
Figure 6. Feature importance ranking for OSF.
Figure 6. Feature importance ranking for OSF.
Mathematics 13 03019 g006
Figure 7. Confusion matrices: (a) TWF failure type, (b) HDF failure type, (c) PWF failure type, and (d) OSF failure type.
Figure 7. Confusion matrices: (a) TWF failure type, (b) HDF failure type, (c) PWF failure type, and (d) OSF failure type.
Mathematics 13 03019 g007
Figure 8. t-SNE visualization of the dataset before (a) and after (b) applying the proposed hybrid resampling strategy.
Figure 8. t-SNE visualization of the dataset before (a) and after (b) applying the proposed hybrid resampling strategy.
Mathematics 13 03019 g008
Figure 9. Sample Hoeffding Tree structure for the TWF label generated by the BHTF framework.
Figure 9. Sample Hoeffding Tree structure for the TWF label generated by the BHTF framework.
Mathematics 13 03019 g009
Figure 10. Sample Hoeffding Tree structure for the HDF label generated by the BHTF framework.
Figure 10. Sample Hoeffding Tree structure for the HDF label generated by the BHTF framework.
Mathematics 13 03019 g010
Figure 11. Sample Hoeffding Tree structure for the PWF label generated by the BHTF framework.
Figure 11. Sample Hoeffding Tree structure for the PWF label generated by the BHTF framework.
Mathematics 13 03019 g011
Figure 12. Sample Hoeffding Tree structure for the OSF label generated by the BHTF framework.
Figure 12. Sample Hoeffding Tree structure for the OSF label generated by the BHTF framework.
Mathematics 13 03019 g012
Table 1. Overview of recent predictive maintenance studies.
Table 1. Overview of recent predictive maintenance studies.
RefYearMethodMachineCRLabelSamplingPurpose
[8]2025LSTMVehicle-SOFailure prediction
[9]2025XGBoost, RF, LSTMAircraft engineS-RUL prediction
[10]2025DT, NB, KNN, SVM, AdaBoost, RF, CatBoost, XGBoost, LightGBM, MLPBraking component-SUFailure classification
[11]2025DT, RF, LightGBM, XGBoostVehicle-S-Service prediction
[12]2025ARIMA, LR, ANN, SVM, PCA, DT, LDA, QDAConveyor belt-S-Failure prediction
[13]2025LRCompressor-S-Monitoring of equipment health
[14]2025RFCurrent
transformers
-SOFault classification
[15]2024RF, LR, ANN, DT, GBConveyor belt-S-Fault classification
[16]2024CNN, LSTM, ResNet Wind turbine-S-RUL prediction
[17]2024CNN, BR, CC, LP, RAKEL D, RAKEL O, Multi-label KNNPump-M-Fault detection
[18]2024CNNSolar panels-SODiagnostic monitoring
[19]2024RF, DT, MLPWind turbine-SOFault detection
[20]2024SVM, AdaBoost, Bagging, MLPRotating machinery-S-Fault detection
[21]2023RF, XGBoost, LightGBM, Auto DNNBall bearing-SUFailure classification
[22]2023ANNLumber machinery-S-Failure prediction
[23]2023LRAir pressure system-SOFailure prediction
[24]2023LSTMAircraft engine-SOFault diagnosis
[25]2023ResNet, CNNHydraulic system, generator bearing, gearbox-SOFault diagnosis
[26]2022Autoencoder, BGRU, CNNAircraft-S-Rare failure prediction
[27]2022LightGBM, XGBoost, RFTextile machinery-SODefect classification
[28]2022TinyLSTM, DNNAutoclave sterilizer-S-RUL prediction
[29]2022XGBoostPump-S-Failure classification
[30]2022RF, DT, KNNOil consumption systemS-Fault diagnosis
[31]2021DNN, RF, SMOTE, PCAWind turbine-SOFailure prediction
[32]2021Self-ONNRotating machinery-S-Fault diagnosis
ProposedBHTFIndustrial machinery-MO, UFailure diagnosis
Table 2. A sample multi-label dataset.
Table 2. A sample multi-label dataset.
SampleXY
S 1 x 11 x 12 x 1 m Y 1 = { y 1 , y 3 }
S 2 x 21 x 22 x 2 m Y 2 = { y 1 , y 2 , y 3 , y 4 }
S n x n 1 x n 2 x n m Y n = { y 2 }
Table 3. Binary relevance transformation for Table 2.
Table 3. Binary relevance transformation for Table 2.
D y 1 XY D y 2 XY D y 3 XY D y 4 XY
S 1 [ x 11 x 1 m ] y 1 S 1 [ x 11 x 1 m ] ¬ y 2 S 1 [ x 11 x 1 m ] y 3 S 1 [ x 11 x 1 m ] ¬ y 4
S 2 [ x 21 x 2 m ] y 1 S 2 [ x 21 x 2 m ] y 2 S 2 [ x 21 x 2 m ] y 3 S 2 [ x 21 x 2 m ] y 4
S n [ x n 1 x n m ] ¬ y 1 S n [ x n 1 x n m ] y 2 S n [ x n 1 x n m ] ¬ y 3 S n [ x n 1 x n m ] ¬ y 4
Table 4. Differences between SMOTE and PDU across several features.
Table 4. Differences between SMOTE and PDU across several features.
FeatureSMOTEPDU
TypeOversamplingUndersampling
Add samples?YesNo
Remove samples?NoYes
Which Class is Affected?Minority class
(Adds to it)
Majority class
(Removes some if noisy or misclassified)
ScenarioWhen the minority class is
underrepresented
When data has noise or overlapping
classes
Uses k-Nearest Neighbors?Yes (to generate data)Yes (to remove misclassified points)
RiskOverfitting if overusedUnderfitting if too aggressive
GoalBalance the dataset by adding more
representative samples
Clean and balance the dataset by
removing noisy or borderline samples
Sensitivity to NoiseHigh—may synthesize noisy or
borderline instances
Low—helps eliminate noisy or
ambiguous instances
Effect on Decision BoundaryExpands the decision region of the
minority class
Sharpens or clarifies the decision boundary by removing overlapping samples
Computational CostModerate—needs distance computations and synthetic generationModerate—distance computations for each minority instance
Main TechniqueFeature-space interpolationDisagreement with the classes of
neighbors
Table 5. Summary of the AI4I 2020 predictive maintenance dataset characteristics.
Table 5. Summary of the AI4I 2020 predictive maintenance dataset characteristics.
Dataset
Type
Attribute
Types
Learning
Tasks
#Instances#VariablesMissing
Values
Subject AreaRelease YearView Counts
Time Series,
Multivariate
Real, BooleanRegression,
Classification,
Causal Discovery
10,00014NoneComputer Science202077,511
Table 6. Variables of the AI4I 2020 predictive maintenance dataset.
Table 6. Variables of the AI4I 2020 predictive maintenance dataset.
Variable NameCategoryType DescriptionUnit
UIDIdentifierIntegerUnique identifier
Product IDIdentifierCategoricalProduct variant identifier
TypeFeatureCategoricalProduct quality level (low, medium, high)
Air temperatureFeatureContinuousAir temperatureK
Process temperatureFeatureContinuousProcess temperature K
Rotational speedFeatureIntegerRotational speed rpm
TorqueFeatureContinuousTorqueNm
Tool wearFeatureIntegerTool wear min
Machine failureTargetBooleanIndicates any failure occurrence
RNFTargetBooleanRandom failures
TWFTargetBooleanTool wear failure
HDFTargetBooleanHeat dissipation failure
PWFTargetBooleanPower failure
OSFTargetBooleanOverstrain failure
Table 7. Statistics of the continuous features in the AI4I 2020 predictive maintenance dataset.
Table 7. Statistics of the continuous features in the AI4I 2020 predictive maintenance dataset.
Variable NameMinMaxMeanStandard Deviation
Air temperature295.3304.5300.02.000
Process temperature305.7313.8310.01.484
Rotational speed116828861538.8179.284
Torque3.876.639.99.969
Tool wear0253107.963.654
Table 8. Performance of BHTF for each failure type using accuracy, precision, recall, and F-measure.
Table 8. Performance of BHTF for each failure type using accuracy, precision, recall, and F-measure.
Failure TypeAccuracyPrecisionRecallF-Measure
TWF93.94%0.99480.93940.9663
HDF98.12%0.99250.98120.9868
PWF98.87%0.99420.98870.9914
OSF98.82%0.99410.98820.9911
Average97.44%0.99390.97440.9839
Table 9. Instance distribution for TWF before and after applying SMOTE and PDU across 10-fold cross-validation.
Table 9. Instance distribution for TWF before and after applying SMOTE and PDU across 10-fold cross-validation.
FoldBefore SMOTEAfter SMOTE
Before PDU
After PDU
18958/428958/17228827/1722
28958/428958/17228826/1722
38958/428958/17228835/1722
48958/428958/17228822/1722
58959/418959/16818839/1681
68959/418959/16818815/1681
78959/418959/16818833/1681
88959/418959/16818864/1681
98959/418959/16818831/1681
108959/418959/16818830/1681
Average8959/418959/16978832/1697
Table 10. Instance distribution for HDF before and after applying SMOTE and PDU across 10-fold cross-validation.
Table 10. Instance distribution for HDF before and after applying SMOTE and PDU across 10-fold cross-validation.
FoldBefore SMOTEAfter SMOTE Before PDUAfter PDU
18896/1048896/42648843/4264
28896/1048896/42648851/4264
38896/1048896/42648842/4264
48896/1048896/42648841/4264
58896/1048896/42648839/4264
68897/1038897/42238838/4223
78897/1038897/42238851/4223
88897/1038897/42238841/4223
98897/1038897/42238838/4223
108897/1038897/42238842/4223
Average8897/1048897/42448843/4244
Table 11. Instance distribution for PWF before and after applying SMOTE and PDU across 10-fold cross-validation.
Table 11. Instance distribution for PWF before and after applying SMOTE and PDU across 10-fold cross-validation.
FoldBefore SMOTEAfter SMOTE Before PDUAfter PDU
18914/868914/35268856/3526
28914/868914/35268849/3526
38914/868914/35268851/3526
48914/868914/35268850/3526
58914/868914/35268854/3526
68915/858915/34858857/3485
78915/858915/34858865/3485
88915/858915/34858854/3485
98915/858915/34858859/3485
108915/858915/34858843/3485
Average8915/868915/35068854/3506
Table 12. Instance distribution for OSF before and after applying SMOTE and PDU across 10-fold cross-validation.
Table 12. Instance distribution for OSF before and after applying SMOTE and PDU across 10-fold cross-validation.
FoldBefore SMOTEAfter SMOTE Before PDUAfter PDU
18911/898911/36498884/3649
28911/898911/36498885/3649
38912/888912/36088882/3608
48912/888912/36088879/3608
58912/888912/36088886/3608
68912/888912/36088873/3608
78912/888912/36088888/3608
88912/888912/36088887/3608
98912/888912/36088885/3608
108912/888912/36088883/3608
Average8912/888912/36168883/3616
Table 17. Training times across 10 folds for each failure type label in seconds.
Table 17. Training times across 10 folds for each failure type label in seconds.
FoldTWFHDFPWFOSF
10.5470.2250.1850.209
20.2350.2150.1840.195
30.1640.2040.1830.193
40.2030.2100.1880.194
50.1450.2020.1830.193
60.1500.2090.1860.195
70.1480.2010.1820.193
80.1480.2010.1800.192
90.1480.1960.1790.191
100.1450.2020.1790.190
Average0.2030.2070.1830.195
Table 18. Comparison of BHTF with the state-of-the-art methods on the same AI4I 2020 dataset. N/A: Not Available.
Table 18. Comparison of BHTF with the state-of-the-art methods on the same AI4I 2020 dataset. N/A: Not Available.
ReferenceYearMethodTraining ProtocolDataset SplitHyperparameters SettingsAccuracy (%)PrecisionRecallF-Measure
Chandu [48]2025GBFeature selection; SMOTE; min–max normalization; outlier removalTrain/test (not specified ratios)N/A90.000.92000.90000.8569
MLP61.000.73000.61000.6000
KNN71.680.67090.68310.6769
Besha et al. [49]2025ELM + PLSCOOptimization-driven training with metaheuristic hybrid (PLO + CSO + SSA)70–30%m = 100, a = [1,1.5],
c1 = [2/e,2]
95.470.86790.86590.8669
Jahani et al. [50]2025BFT + PCA
(Byzantine = 0.2)
Federated learningN/AN/A89.90---
BFT + PCA
(Byzantine = 0.4)
89.83---
BFT + PCA
(Byzantine = 0.6)
89.00---
Araujo et al. [51]2025CARTSMOTE; categorical encoding; MRMR feature selectionFive-fold-cross-validationcriterion = entropy,
splitter = best,
max_depth = 5,
min_samples_split = 2,
min_samples_leaf = 1,
num_features_split = none,
max_leaf_nodes = none,
random_state = 42
82.10---
Prashanth et al. [52]2025DNNIncremental and dynamic learningHold-out validation3 layers (64,32, 1),
ReLU, sigmoid
84.00---
SVMN/AN/AN/A89.00---
Özdemir
et al. [53]
2025LRSMOTE; categorical encoding; supervised learningN/AN/A88.000.42000.61000.5000
RF94.000.45000.68000.5400
XGBoost97.000.47000.74000.5800
Kumar1
et al. [54]
2025KNNNearest-neighbor votingTrain/validation/test (not specified ratios)k = 1, Euclidean distance94.00--0.9400
SVMKernel-based supervised learningC = 100, gamma = 1, kernel = RBF95.00--0.9500
RFEnsemble learningMax depth = 10, number of trees = 50096.00--0.9600
XGBoostGradient boostinglearning rate = 0.1, max depth = 5, n_estimators = 50097.00--0.9700
Misaii et al. [55]2024LSTMSequential deep learning; SMOTE; binary cross-entropy loss function80–20%N/A80.000.960.830.89
Presciuttini et al. [56]2024RF + XAI
(SHAP, LIME, counterfactual)
Supervised learning80–20%number of trees = 100, random_state = 4295.00---
Hung et al. [57]2024DNN+Adam
SingleHL Model I
Models trained for 100 epochs; batch size 400; single- and double-hidden-layer architectures90–10%number of neurons per hidden layer = 100,
activation function (hidden layers) = ReLU,
activation function (output layer) = Softmax,
output classes = 6,
input neurons = 5
93.58-0.94000.9300
DNN+Adam
SingleHL Model II
95.37-0.96000.9600
DNN+SA
DoubleHL Model III
96.54-0.95000.9500
DNN+SA
DoubleHL Model IV
97.09-0.97000.9700
Liu and Su [58]2024CASTEarly stopping if validation loss does not improve for 3 iterationsFive-fold-cross-validationEpochs = 100, batch size = 64, learning rate = 0.001, hidden size = 512, optimizer = AdamW---0.8800
SE-ResNet 18---0.8400
GE-ResNet 18---0.8100
SE-SCNet 18---0.8600
Ghadekar et al. [59]2024XGBoostSMOTEN/AN/A96.000.98000.96000.9690
RF95.500.97600.95500.9640
LOF91.700.95100.91700.9330
One-class SVM91.200.95300.91200.9300
Kong
et al. [60]
2023DFPAISIterative data fillingN/AN/A83.74---
SDFISSimplified data filling82.03---
Souza and Lughofer [61]2023EFNC-ExpSequential stream-based updating; fuzzification; expert rules70–30%γ for DA plane (single hyperparameter)97.30---
SODAIncremental clustering; dynamically updating clouds and feature weightsNo separate hyperparameters beyond γ96.80---
Chen
et al. [62]
2022CatBoostOrdered boosting and gradient descentThree-fold cross-validationHyperparameters optimized using Optuna64.23-0.2868-
SMOTENC + CatBoostSMOTE88.09-0.7881-
ctGAN + CatBoostData normalization; learning distribution; oversampling87.08-0.8305-
SMOTENC + ctGAN + CatBoostSMOTE, GAN88.83-0.9068-
Vandereycken and Voorhaar [63]2022XGBoostTraining with different TT initializations70% train/
15% validation/
15% test
Optimized via
validation set
95.74---
RF95.10---
TTML + XGBoost77.00---
TTML + RF78.00---
TTML + MLP 176.20---
TTML + MLP 2 65.00---
Falla and Ortega [64]2022RFSupervised learning, oversampling70–30%Sklearn default parameters; random seed = 4296.810.97400.76390.8563
Neural NetworksHidden layers = 10, max iterations = 500, penalty = 0.001, random seed = 21, other defaults91.500.91660.86110.8880
Iantovics and Enachescu [65]2022BLRMathematical modeling for data quality assessmentN/AStandard β coefficients97.100.99500.28300.4407
Vuttipittayamongkol and Arreeras [66]2022SVMStandard supervised learning
70–30%Default caret parameters-0.72290.59410.6522
DT-0.83910.72280.7766
KNN-0.81080.29700.4348
RF-0.82670.61390.7045
NN-0.73330.21780.3359
Mota
et al. [67]
2022GB, SVMBatch training; preprocessing with data aggregation, min–max normalization, imputation, feature engineering, oversampling, and undersampling80–20%Automatic hyperparameter tuning using five-fold cross-validation94.55-0.9200-
Diao
et al. [68]
2021Constructing Hyper-PlanesUnsupervised learning; mean-shift and min–max normalizationN/AH = TL (set of hyper-planes); δ determined automatically---0.6200
Torcianti and Matzka [69]2021RUSBoost TreesDecision tree-based learning20 dataset pointsKernel width σ = 0.05, iteratively decreased by 0.01 down to 0.01;
feature importance threshold ≥ 25%
92.740.30710.90850.4590
Pastorino and Biswas [70]2021Data-Blind Machine LearningSimple NN training for tabular datasets, CNN training for MNISTTrain/test (not specified ratios)CTGAN settings for generative model;
no special tuning for MNIST
97.30---
Average 88.940.78450.74920.7793
Proposed MethodBalanced Hoeffding Tree Forest (BHTF)Multi-label learning (MLL);
incremental learning (IL);
ensemble learning (EL);
oversampling; undersampling
10-fold-cross-validationSMOTE: k = 5, R = 4000%; PDU: u = 7;
numIterations = 10
97.440.99390.97440.9839
Table 19. Overview of selected multi-label TCM datasets.
Table 19. Overview of selected multi-label TCM datasets.
DatasetObservationsAnomaliesFeaturesAnomaly TypesProductsData Drift
tcm5_dataset_320003981514 (16)4False
tcm5_dataset_420001925514 (16)20False
tcm5_dataset_5200051031514 (16)5True
tcm5_dataset_620008954514 (16)25True
Table 20. Variables of selected TCM datasets.
Table 20. Variables of selected TCM datasets.
Variable NameCategoryType DescriptionUnit
thickness_entryFeatureContinuousSteel entry thickness before rollingmm
thickness_exitFeatureContinuousSteel exit thickness after rollingmm
widthFeatureContinuousSteel widthmm
ys_entryFeatureContinuousSteel yield strength at entryMPa
ys_exitFeatureContinuousSteel yield strength at exitMPa
work_roll_diamFeatureContinuousWork roll diameter for stands 1–5mm
work_roll_mileageFeatureContinuousWork roll mileage for stands 1–5km
reductionFeatureContinuousThickness reduction per stand (1–5)
tensionFeatureContinuousInterstand tension (0: before stand 1, 1–5: after stands 1–5)N
roll_speedFeatureContinuousLinear work roll speed for stands 1–5NaN
forceFeatureContinuousRolling force for stands 1–5N
torqueFeatureContinuousRolling torque for stands 1–5Nm
gapFeatureContinuousStand gap for stands 1–5mm
motor_powerFeatureContinuousElectric motor power for stands 1–5kW
Anomaly_ReductionTargetBooleanLabel for anomaly in reduction scheme
Anomaly_ElectricTargetBooleanLabel for anomaly in electric motor per stand
Anomaly_BearingTargetBooleanLabel for anomaly in stand bearings
Anomaly_WorkRollTargetBooleanLabel for anomaly in work roll friction per stand
Table 21. Accuracy results of the proposed BHTF method on TCM benchmark datasets.
Table 21. Accuracy results of the proposed BHTF method on TCM benchmark datasets.
LabelFailure Typetcm5_dataset_3tcm5_dataset_4tcm5_dataset_5tcm5_dataset_6
1Anomaly Reduction93.2490.8198.0586.15
2Anomaly Electric 197.6399.5597.9395.96
3Anomaly Bearing 199.6699.4699.7598.08
4Anomaly WorkRoll 198.5097.4199.3495.94
5Anomaly Electric 296.4899.0797.2299.14
6Anomaly Bearing 299.2898.0499.8699.62
7Anomaly WorkRoll 298.2996.6997.9395.14
8Anomaly Electric 399.3299.1898.3298.21
9Anomaly Bearing 399.6199.7599.6996.33
10Anomaly WorkRoll 399.0894.8395.5296.35
11Anomaly Electric 499.9599.8498.2797.90
12Anomaly Bearing 499.7399.3599.7399.36
13Anomaly WorkRoll 496.8097.5895.6693.80
14Anomaly Electric 599.8999.8099.7598.66
15Anomaly Bearing 599.2698.5299.6199.72
16Anomaly WorkRoll 598.8494.9596.7392.10
Average98.4797.8098.3496.40
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghasemkhani, B.; Kut, R.A.; Birant, D.; Yilmaz, R. Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance. Mathematics 2025, 13, 3019. https://doi.org/10.3390/math13183019

AMA Style

Ghasemkhani B, Kut RA, Birant D, Yilmaz R. Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance. Mathematics. 2025; 13(18):3019. https://doi.org/10.3390/math13183019

Chicago/Turabian Style

Ghasemkhani, Bita, Recep Alp Kut, Derya Birant, and Reyat Yilmaz. 2025. "Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance" Mathematics 13, no. 18: 3019. https://doi.org/10.3390/math13183019

APA Style

Ghasemkhani, B., Kut, R. A., Birant, D., & Yilmaz, R. (2025). Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance. Mathematics, 13(18), 3019. https://doi.org/10.3390/math13183019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop