Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance

Ghasemkhani, Bita; Kut, Recep Alp; Birant, Derya; Yilmaz, Reyat

doi:10.3390/math13183019

Open AccessArticle

Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance

¹

Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey

²

Department of Computer Engineering, Dokuz Eylul University, Izmir 35390, Turkey

³

Department of Electrical and Electronics Engineering, Dokuz Eylul University, Izmir 35390, Turkey

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(18), 3019; https://doi.org/10.3390/math13183019

Submission received: 5 August 2025 / Revised: 4 September 2025 / Accepted: 9 September 2025 / Published: 18 September 2025

(This article belongs to the Special Issue Artificial Intelligence for Fault Detection in Manufacturing)

Download

Browse Figures

Versions Notes

Abstract

Predictive maintenance (PdM) is essential for reducing equipment downtime and enhancing operational efficiency. However, PdM datasets frequently suffer from significant class imbalance and are often limited to single-label classification, which fails to reflect the complexity of real-world industrial systems where multiple failure modes can occur simultaneously. As the main contribution, we propose the Balanced Hoeffding Tree Forest (BHTF)—a novel multi-label classification framework that combines oversampling and undersampling strategies to effectively mitigate data imbalance. BHTF leverages the binary relevance method to decompose the multi-label problem into multiple binary tasks and utilizes an ensemble of Hoeffding Trees to ensure scalability and adaptability to streaming data. In particular, BHTF unifies three learning paradigms—multi-label learning (MLL), ensemble learning (EL), and incremental learning (IL)—providing a comprehensive and scalable approach for predictive maintenance applications. The key contribution of the proposed method is that it incorporates a hybrid data preprocessing strategy, introducing a novel undersampling technique, named Proximity-Driven Undersampling (PDU), and combining it with the Synthetic Minority Oversampling Technique (SMOTE) to effectively deal with the class imbalance issue in highly skewed datasets. Experimental results on the benchmark AI4I 2020 dataset showed that BHTF achieved an average classification accuracy of 97.44%, outperformed by a margin of the state-of-the-art methods (88.94%) with an improvement of 11% on average. These findings highlight the potential of BHTF as a robust artificial intelligence-based solution for complex fault detection in manufacturing predictive maintenance applications.

Keywords:

machine learning; predictive maintenance; multi-label classification; ensemble learning; incremental learning; data imbalance; fault detection

MSC:

68T01

1. Introduction

Predictive maintenance (PdM) has emerged as a critical strategy in modern industrial systems to ensure operational reliability, minimize unplanned downtime, and reduce maintenance costs. Unlike traditional maintenance approaches—either reactive (performed after a failure) or preventive (conducted at scheduled intervals)—PdM adopts a proactive approach by analyzing real-time operational data to detect early signs of equipment degradation. This enables maintenance interventions to be scheduled precisely when needed, thereby extending equipment lifespan and refining overall productivity. The increasing availability of sensor data, coupled with advancements in artificial intelligence (AI), has considerably enriched the applicability of PdM solutions across various industrial domains, particularly in fault prediction [1].

Modern predictive maintenance frameworks employ data-driven methods to continuously monitor machinery conditions, detect operational anomalies, and make informed, real-time decisions regarding maintenance scheduling. These systems process vast amounts of high-frequency sensor data embedded within industrial equipment, uncovering patterns and insights that are often imperceptible to human operators. A typical PdM workflow encompasses several critical stages: data acquisition through IoT-enabled sensors, preprocessing to clean and transform raw signals, machine learning (ML)-driven fault detection (FD) and prognosis to detect current failures and predict future breakdowns, and maintenance planning based on predictive analytics [2]. By seamlessly integrating these components, PdM solutions improve equipment reliability, minimize downtime, and enable proactive maintenance in complex industrial environments [3].

Despite notable advances in predictive maintenance, current PdM models predominantly rely on single-label classification frameworks, where each data instance is assigned only one failure mode or a binary label such as failure versus no failure. This simplification ignores the complexity of real-world industrial environments, where multiple failure modes often occur simultaneously—such as concurrent tool wear and thermal faults—making single-label models insufficient for capturing the true condition of equipment. Moreover, PdM datasets typically exhibit severe data imbalance, with failure events being rare compared to normal operation, which can lead to biased models and reduced diagnostic accuracy, in a mathematical sense [4,5]. These challenges underscore the need for advanced multi-label classification methods that can handle class imbalance and support robust predictive maintenance. Our study focuses on filling these gaps by introducing a novel multi-label classification method with both oversampling and undersampling techniques.

Unlike traditional single-label methods, multi-label classification models enable each data instance to be associated with multiple categories simultaneously [6]. Multi-label learning (MLL) is the machine learning methodology designed for such cases, where an observation may belong to more than one class at the same time, making it particularly suitable for complex domains like predictive maintenance. In our study, this capability allows for more nuanced modeling, captures interdependencies among failure modes, and supports targeted maintenance strategies. For instance, a machine component might simultaneously exhibit symptoms of both overheating and vibration-related wear—treating these as separate but co-occurring failure modes allows maintenance teams to apply a more accurate and efficient intervention. Incorporating multi-label learning into predictive maintenance frameworks can significantly boost their capacity to respond to the intricate failure patterns commonly encountered in modern industrial environments.

Although multi-label classification can provide a more accurate and comprehensive framework for failure diagnosis in predictive maintenance, its efficacy can be hindered by the prevalent issue of data imbalance. In industrial manufacturing datasets, certain failure modes appear far less frequently than others, resulting in biased learning, where rare but critical failure types may be overlooked. This problem becomes even more pronounced in multi-label scenarios, where some combinations of various classes are scarcely represented. To mitigate these challenges, we employed data-level balancing strategies—namely oversampling and undersampling. The oversampling technique increases the presence of minority labels by generating synthetic or duplicated instances, thereby increasing the model’s sensitivity to rare conditions. Conversely, undersampling reduces the dominance of majority classes by selectively removing redundant examples, contributing to preventing overfitting. When integrated appropriately, these artificial intelligence-powered techniques can significantly improve the performance of multi-label models, leading to more balanced and reliable fault detection in predictive maintenance applications.

To bridge the gap between the complexity of real-world failure scenarios and the limitations of current PdM approaches, this study introduces a novel method called the Balanced Hoeffding Tree Forest (BHTF). Tailored for multi-label learning in predictive maintenance, BHTF builds on the Hoeffding Tree algorithm [7]—a fast, incremental learning (IL)-based decision tree well-suited for high-volume data as it continuously updates the model as new data streams in without requiring complete retraining—and extends it into an ensemble learning (EL)-based framework, where multiple classifiers are combined to improve stability, robustness, and predictive accuracy. To model multiple co-occurring failure modes, BHTF applies the binary relevance strategy, decomposing the multi-label problem into a set of independent binary classification tasks. This decomposition allows the ensemble to learn each failure type separately, while still capturing their potential co-occurrence patterns, thus developing both interpretability and diagnostic detail. Another key innovation of BHTF lies in its integrated handling of imbalanced classes—a common challenge in PdM datasets, where certain failure types are significantly underrepresented. BHTF involves a novel technique to solve data imbalance. By incorporating both oversampling and undersampling techniques at the data preprocessing stage, BHTF achieves a more balanced class distribution, improving model performance without introducing excessive noise or overfitting. BHTF was evaluated on the AI4I 2020 dataset, which includes the co-occurrence of four industrially critical failure types, namely tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), and overstrain failure (OSF), demonstrating consistent results across all categories.

The main contributions of this study are as follows:

(i).: Three learning paradigms integration: The proposed Balanced Hoeffding Tree Forest (BHTF) uniquely combines Multi-Label Learning (MLL), Incremental Learning (IL), and Ensemble Learning (EL), within a single framework. This integration allows BHTF to simultaneously handle multiple co-occurring failure modes, continuously update with streaming data, and leverage ensemble strategies for robust predictive performance.
(ii).: Introduction of BHTF for predictive maintenance: BHTF is a novel artificial intelligence-based method that applies both oversampling and undersampling techniques for multi-label classification in manufacturing environments, addressing challenges of data imbalance and real-world complexity for the first time.
(iii).: Multi-label failure mode diagnosis: BHTF predicts multiple failure types simultaneously using the binary relevance strategy, enabling detection of co-occurrence patterns and providing more detailed diagnostic insights for targeted maintenance actions.
(iv).: Hybrid class balancing: The method incorporates a hybrid data preprocessing strategy by proposing a novel undersampling technique, named Proximity-Driven Undersampling (PDU), and combining it with the Synthetic Minority Oversampling Technique (SMOTE), effectively mitigating class imbalance in highly skewed datasets.
(v).: Outperformance of existing methods: BHTF achieved an average accuracy of 97.44% to simultaneously predict failure modes with 11% improvement over state-of-the-art approaches. This result underscores its high potential for deployment in industrial predictive maintenance systems, particularly within manufacturing sectors.

The remainder of this article is organized as follows. Section 2 reviews related work on predictive maintenance that use machine learning methods. Section 3 describes the proposed BHTF method in detail, including the model architecture and sampling strategies. Section 4 presents the experimental setup, dataset characteristics, evaluation metrics, and implementation details. Section 5 gives the results, and Section 6 compares the performance of BHTF with existing state-of-the-art methods. Finally, the last section concludes this study and outlines potential directions for future research.

2. Related Works

Based on the scope of this study, Table 1 presents a collection of representative predictive maintenance works from the past five years [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32], focusing on various tasks, methods, and applications. The table includes key aspects of each study, organized into the following columns: machine, C, R, label, sampling, and purpose. The machine column indicates the type of equipment or component (e.g., wind turbine, conveyor belt, and bearing) to which the PdM methodology was applied for providing insight into the domain diversity. The C and R columns specify whether the addressed task is a classification or regression problem, respectively. The label column distinguishes between single-label (S) and multi-label (M) prediction tasks. The sampling column describes the data balancing techniques employed, namely oversampling (O) or undersampling (U), indicating whether the studies explicitly deal with class imbalance. The purpose column summarizes the specific goal of each PdM study, including failure prediction, fault detection, or anomaly detection. In addition, these PdM studies have been validated using various evaluation measures such as accuracy, precision, recall, F-measure, mean absolute error (MAE), root mean squared error (RMSE), and others. This structured overview enables a comparative understanding of recent trends and challenges in PdM research.

The methods employed span a wide spectrum of traditional and advanced artificial intelligence algorithms. Ensemble learning techniques like random forest (RF), extreme gradient boosting (XGBoost), LightGBM, and AdaBoost are frequently used [9,10,14,15,19,20,27]. Deep learning models, including convolutional neural networks (CNN), long short-term memory (LSTM), Autoencoders, and ResNet, appear in recent works [16,24,25,26,28,31] to show a shift toward learning complex temporal or image-based signals. In addition, classical models such as logistic regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), decision tree (DT), and naive Bayes (NB) remain widely used across various PdM datasets [10,13,22,23].

The reviewed works apply PdM to a wide variety of machine types to indicate the cross-domain applicability of predictive maintenance. For instance, PdM was applied to vehicle components [8,10,11], conveyor belts [12,15], wind turbines [16,19,31], bearings [21,25], aircraft systems [9,26], and industrial machinery like rotors, gearboxes, electric motors, and pumps [17,20,27,29,30]. Some works focus on current transformers [14], compressors [13], and lumber machines [22] to reveal growing interest in applying PdM to smart and connected environments.

The studies reviewed span a wide range of task types. A significant number of works addressed classification tasks [8,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,27,29,30,31,32], aiming to detect, diagnose, or classify various fault types or failure modes. In contrast, regression tasks [16,28,30] focused on estimating continuous outcomes such as the remaining useful life (RUL) or time deviation. Some studies, such as [9,30], incorporated both classification and regression objectives simultaneously.

Regarding sampling strategies, some studies explicitly addressed data imbalance in the data. Techniques such as SMOTE [8,14,18,24,25,27,31] and ADASYN [19] were used for oversampling minority class instances, while a few works applied undersampling [10,21], typically through random selection. Some works instead employed cost-sensitive learning (e.g., class weighting in [11]) or tackled imbalance through algorithm-level adjustments.

The purposes pursued across these studies vary, yet several key categories emerge. Failure prediction is the most prevalent objective [8,12,22,31], where the models forecast whether a failure will happen in the near future. Others focus on failure or fault detection and classification [10,14,15,17,19,20,21,25,27,30,32], where the goal is to identify the specific type or cause of failure. Anomaly detection [26] appears in cases where failures are rare and abnormal behavior is learned indirectly. Some papers discussed diagnostic monitoring [18] or RUL estimation [9,16,28], which are particularly relevant for long-term asset condition forecasting.

In terms of performance evaluation, the studies adopted a range of classification and regression mathematical metrics. For classification, accuracy, precision, recall, F-measure, AUC-ROC, TPR, and FPR are the most commonly reported ones. Some works also employed confusion matrices (CM), hamming loss (HL), and Matthews’s correlation coefficient (MCC) [17,19,21,27]. For regression, studies used MAE, RMSE, MSE, and R² [16,28,30]. These diverse metrics reflect different emphases on precision, robustness, or time-based performance, depending on the application domain.

Overall, the reviewed literature demonstrates a growing diversity of predictive maintenance methods, broader application domains, and increasing attention to practical challenges such as data imbalance. Nevertheless, regarding label structure, previous studies commonly formulated PdM as a single-label problem, predicting one failure or health state at a time. However, in this study, we explored multi-label learning, reflecting cases where multiple failure types can co-occur or where the system needs to predict several outputs in parallel. Only a few studies have simultaneously tackled the dual challenges of multi-failure diagnosis and severe class imbalance using a combination of both oversampling and undersampling techniques. In contrast, our proposed BHTF framework directly addresses these limitations by coupling multi-label classification with a hybrid resampling strategy—integrating PDU and SMOTE. This approach boosts learning from imbalanced data while preserving failure diversity, making BHTF particularly effective for diagnosing complex failure modes in industrial manufacturing machinery. These priorities differentiate our method and establish the foundation for the contributions detailed in the following sections.

An additional key design choice in our work is the adoption of the Hoeffding Tree (HT) [7] classifier as the base learner in the proposed ensemble. HTs are specifically designed for high-speed data streams and support instance-wise, incremental learning in constant time, making them highly suitable for real-time, artificial intelligence-based predictive maintenance scenarios. A major advantage of HTs is their ability to handle uncertainty in learning time by offering a fixed computational cost per instance, while producing decision trees that closely approximate those built by conventional batch learners. This makes HTs exceptionally efficient for mining continuous or large-scale industrial data.

The potency of the HT classifier has been demonstrated in various studies [33,34,35,36,37,38,39,40,41,42]. Prior study [33] had shown that HT outperformed traditional classifiers such as NB, MLP, LR, logistic model trees (LMT), and sequential minimal optimization, due to its strong generalization ability. It obtained higher mathematical accuracy rates even over ensemble methods like AdaBoost and Random Forests in various classification tasks [33,34,35,36,37]. It achieved better results compared to alternative approaches, including Random Tree, Reduced Error Pruning Tree, Decision Stump, J48, RF, and LMT [36]. Similarly, HT surpassed a range of algorithms—J48, LR, RF, SVM, PART, K-Star, and OneR—in diabetes detection tasks [37]. In [38], a Hoeffding Tree-based model outperformed neural networks, decision trees, SVM, KNN, and ensemble methods in detecting network anomalies. Furthermore, in oil-well drilling applications, HT demonstrated superior predictive accuracy and adaptability to concept drift compared to XGBoost for rate of penetration (ROP) prediction [39]. For the validation dataset, the highest AUC value belonged to HT (0.802) against LMT and Bayesian Networks models, which produced lower AUC values (0.761 and 0.764, respectively) [40]. Likewise, HT achieved superior accuracy compared to models like Logistic Regression in the detection of security attacks for IoT devices [41]. In [42], it was stated that the Hoeffding Tree performed best compared to a series of its counterparts, such as J48, DT, RF, NB, Bayesian Network, and KNN.

Further, a critical review of the literature highlights that while prior studies have leveraged multi-label learning, incremental learning, or ensemble learning individually, none have integrated all three paradigms concurrently in predictive maintenance tasks. For instance, one of the few attempts at multi-label learning in PdM is represented in [17], which employs BR, CC, LP, and multi-label KNN to capture co-occurring fault types. Several works have utilized algorithms with incremental learning capabilities to adapt to streaming or sequential data—for example, LSTM [8,9,16,24], DT [10,11,12,15,19,30], NB [10] and KNN [30], LR [12,13,15,23], and BGRU [26]—yet these studies remained limited to single-label fault prediction. On the other hand, studies have used ensemble learning methods, such as RF [9,10,11,14,15,19,21,27,30,31], boosting techniques, including AdaBoost, XGBoost, CatBoost, LightGBM, and GB [9,10,11,15,20,21,27,29], and Bagging [20], to improve classification robustness, but again within a single-label context. Importantly, none of these prior works integrate all three paradigms simultaneously.

In contrast, the proposed BHTF method uniquely combines multi-label learning, incremental learning, and ensemble learning in a single framework, enabling multi-label fault diagnosis in dynamic data streams while effectively handling imbalance through hybrid sampling. By extending Hoeffding Trees into an ensemble and applying both oversampling and undersampling techniques (SMOTE + PDU), the BHTF method enhances predictive performance while preserving the advantages of incremental learning, offering a more comprehensive solution than existing approaches that rely on only one or two paradigms.

3. Materials and Methods

3.1. Proposed Method

The overall architecture of the proposed Balanced Hoeffding Tree Forest (BHTF) method is illustrated in Figure 1. The model is designed to address classification challenges in the domain of predictive maintenance, where different types of machinery failures can occur simultaneously. The framework consists of several interconnected stages, from data preparation to model training and evaluation, as described below:

Data collection: The predictive maintenance dataset usually contains sensor-based data collected from industrial machinery operating under various conditions. The dataset can include input features such as temperature, rotational speed, and torque related to machinery to reflect real-time machine behavior. These data are typically gathered in manufacturing environments, where operational precision is critical.
Multi-label dataset construction: The dataset includes several target variables, each representing a different failure type that may occur concurrently. This setup naturally forms a multi-label learning problem, where each machine instance can be associated with multiple failure types. To address this, the classification task is modeled using the binary relevance (BR) [43] strategy, which decomposes the multi-label problem into independent binary classification tasks—one for each failure. For each label, the value is 1 if the corresponding failure occurs, and 0 otherwise.
Data preprocessing: To improve data quality and prepare it for modeling, the following preprocessing steps can be performed:
–
Cleaning: It involves detecting and removing errors, duplicates, and inconsistencies in the data to ensure that the model is trained on high-quality and reliable input. In addition, it also includes the removal of unique identifiers since they do not contribute predictive value. Some data cleaning techniques can also be applied to handle any missing values in the features.
–
Feature selection: The redundant or irrelevant features were removed to reduce overfitting, improve accuracy, and decrease computational cost.
–
Oversampling and undersampling (hybrid resampling): Due to the inherent data imbalance in the PdM dataset—where failure cases are considerably underrepresented compared to the healthy class—a hybrid resampling strategy was employed after identifying the minority and majority classes dynamically based on their frequencies.
○
First, the synthetic minority over-sampling technique (SMOTE) [44] was applied to generate synthetic samples for minority failure classes. It is a widely used method for mitigating class imbalance by generating synthetic samples for minority classes rather than merely duplicating existing ones. Unlike basic oversampling methods such as random oversampling, which risk overfitting by repeating identical instances, SMOTE generates diverse new samples through interpolation between similar minority class instances in the feature space.
○
Then, our proposed PDU method was used to reduce the number of healthy (non-failure) instances, resulting in a more balanced and learnable dataset. These steps are essential to prevent model bias toward the majority class and enhance the model’s ability to identify rare failure types, thereby contributing to more effective fault detection.
Label-wise dataset separation: Following data balancing, the multi-label dataset was decomposed into multiple binary datasets, one for each of the four selected failure types. Each dataset contains the same feature set but is independently labeled according to whether the corresponding failure occurred. This decomposition aligns with the binary relevance framework and enables independent model training for each failure mode.
Model training—Hoeffding Tree forest construction: Each balanced dataset was used to train several Hoeffding Trees, which are well-suited for efficiently processing large-scale scenarios. Hoeffding Trees inherently support incremental learning, allowing the model to adapt continuously to streaming or sequential data without retraining from scratch. The result in a collection of Hoeffding Tree models, each focused on detecting a specific failure mode. This process leverages the speed, adaptability, and online learning capabilities of lightweight artificial intelligence algorithms in dynamic industrial environments.
Model aggregation—BHTF: The individually trained Hoeffding Trees for each failure mode were combined to form BHTF. This ensemble learning structure utilizes the efficiency and scalability of Hoeffding Trees while enabling multi-label predictions across multiple failure types simultaneously. Although each tree is trained independently, the ensemble facilitates an integrated diagnosis of probable co-occurring failures within a single inference step.
Prediction and evaluation: The BHTF model is applied to new, unseen data to predict possible failure types. The performance of the model is evaluated using standard classification metrics, including accuracy, precision, recall, F-measure and confusion matrix, measured for each label as well as overall. This mathematical evaluation framework ensures a comprehensive understanding of the model’s ability to correctly diagnose failure modes in manufacturing systems.
Presentation: It involves effective visualization of model outputs and integration with business rules. It includes decision making, where predictions inform or directly drive actions—ranging from automated responses to human-guided choices.

3.2. Multi-Label Learning

Given that different failure types in industrial machinery can occur simultaneously, the classification task in this study naturally aligns with a multi-label learning (MLL) framework. In this setting, each instance may be associated with one or more target labels, corresponding to different failure modes. Formally, a MLL problem is defined over a training dataset

{D = {(x_{i}, Y_{i})}}_{i = 1}^{n}

, where each

x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i m})

is a feature vector, with

m

attributes, and

Y_{i} \subset L

is a label subset drawn from the full label set

L = {y_{1}, y_{2}, \dots, y_{q}}

, with

q

denoting the total number of possible labels. The objective is to learn a mapping function

G

that predicts the appropriate label subset for an unseen instance

\hat{Y}

for an unseen instance

x

,

G (S) \to \hat{Y}

.

To manage this complexity, our proposed method utilizes the binary relevance strategy—one of the most widely adopted approaches for multi-label classification. The core idea of BR is to decompose the multi-label task into

q

independent binary classification problems, one for each label. Each classifier is trained to predict the presence or absence of a particular label, treating all other labels as irrelevant.

This process begins by transforming the original dataset into

q

binary-labeled datasets

D_{y_{j}}

for

j = 1, 2, \dots, q

. Each dataset retains the same feature vectors as the original data but replaces the multi-label targets with binary labels: a sample is marked positive if it includes the target label

y_{i}

, and negative otherwise. A separate binary classifier

h_{i, j}

is then trained for each dataset

D_{y_{j}}

. During inference, a new instance is evaluated across all

q

models, and the predicted label set

\hat{Y}

is formed by aggregating the labels for which the corresponding classifiers output a positive result in Equation (1):

G (x) = \{y_{i} \in L| h_{j} (x) = 1}

(1)

Table 2 presents a typical multi-label dataset, where each instance

S_{i}

is represented by a feature vector

x_{i}

and an associated subset of labels

Y_{i} \subset L

. For example,

S_{1}

is linked to failure types

y_{1}

and

y_{3}

, while

S_{2}

corresponds to all four simultaneous failure modes:

y_{1}

,

y_{2}

,

y_{3}

, and

y_{4}

. This exemplifies the multi-label nature of the dataset, in which instances can belong to multiple classes concurrently. The label subsets demonstrate how complex failure conditions are encoded in the learning framework.

An example of this transformation is illustrated in Table 3, which shows how a multi-label dataset is converted into multiple binary datasets, one per label in terms of binary relevance. For instance, if a sample is associated with labels

y_{1}

and

y_{3}

, it will be treated as positive in the binary datasets for

y_{1}

and

y_{3}

, and as negative in the datasets for the remaining labels.

In this setting, each output is encoded as a binary vector (e.g., [0, 1, 0, 1]), where each position corresponds to a specific failure mode: 1 denotes the presence and 0 the absence of that mode. This representation allows classifiers to be effectively used for solving multi-label learning tasks. The binary relevance approach is modular and computationally efficient, with a complexity that scales linearly with the number of labels

q

and the cost

C

of the base classifier, i.e.,

O (q \times C)

. BR is a simple and scalable technique, making it particularly suitable for our predictive maintenance scenario, where labels are sparse yet well-defined. In this study, we utilized the BR technique by incorporating failure-specific resampling strategies (SMOTE and PDU) and ensemble learning via Hoeffding Tree forest, which together enhance the model’s ability to learn from multi-label imbalanced data.

3.3. Hybrid Resampling Strategy

One of the fundamental challenges in predictive maintenance is the severe data imbalance in class distribution, where failure events are extremely rare compared to normal operating conditions. This imbalance can undermine the performance of machine learning models, as they tend to be biased toward the majority class (i.e., healthy states), leading to poor sensitivity in detecting rare but critical failure modes.

To address this issue, the proposed BHTF method integrates a hybrid resampling strategy consisting of two stages: oversampling of minority failure classes using the SMOTE algorithm, and undersampling of the majority (healthy) class via PDU as a novel filtering technique over the multi-label dataset. This combined approach enhances the model’s ability to learn from scarce failure data and increases its sensitivity in detecting multiple, potentially co-occurring failure types. The following subsections detail the oversampling and undersampling methodologies applied in our framework:

3.3.1. Oversampling with SMOTE

To address the severe imbalance between healthy and failure states in the dataset, the proposed BHTF method employs SMOTE to increase the representation of failure samples. As discussed, each fault type is modeled separately under a binary relevance transformation. For each binary dataset, SMOTE is applied to the minority class, corresponding to the presence of a specific failure mode. Mainly, the following steps are executed:

Identify the minority class in the current binary-labeled dataset.

For each minority instance

x_{i}

, find its k-nearest neighbors from the same class using mathematical Euclidean distance over numeric features through Equation (2):

d (x_{i}, x_{j}) = \sqrt{\sum_{l = 1}^{d} {{(x}_{i l} - x_{j l})}^{2}}

(2)

Randomly select one neighbor

x_{N N}

, and generate a new synthetic sample using linear interpolation in Equation (3):

x_{s y n} = x_{i} + δ \times (x_{N N} - x_{i})

(3)

where

x_{i}

is the original minority instance,

δ \in [0,1]

is a random number drawn from a uniform distribution, and

x_{N N}

is the selected neighbor. Alternatively, the synthetic sample can be expressed as a convex combination of the original and its neighbor through Equation (4), which highlights that the new instance lies along the line segment connecting two minority samples in the feature space:

x_{s y n} = {(1 - δ) \times x}_{i} + δ \times x_{N N}

(4)

Repeat this process until the minority class size increases by a predefined oversampling ratio.

This oversampling process is applied to each binary dataset independently, allowing the model to better learn the subtle variations and minority patterns of each failure mode.

3.3.2. Undersampling with PDU

In parallel with oversampling, the proposed method introduces a novel undersampling technique—Proximity-Driven Undersampling (PDU)—to selectively reduce the number of majority class (i.e., healthy) instances. This step aims to balance the dataset further by removing potentially noisy or less informative majority samples located close to minority class (i.e., failure) instances, thereby preserving critical decision boundaries for accurate fault detection in manufacturing systems.

The PDU technique operates by utilizing local proximity analysis in the feature space. Specifically, for each minority instance (i.e., label = 1), the algorithm identifies its nearest neighbor. If the nearest neighbor belongs to the majority class (i.e., label = 0), it is removed from the training set. This process is repeated iteratively for up to a user-specified number of iterations, allowing the method to clean the immediate surrounding region of each minority instance from potentially ambiguous majority samples. It yields a balanced and locally denoised training dataset with reduced overlap near class boundaries, optimized for improved minority class recognition and addressing data imbalance in predictive maintenance contexts.

The PDU method operates as follows:

Consider the dataset $D$ for a given target class label as an input.
Compute the Euclidean distance according to Equation (2) between $x_{i} \in D_{m i n o r i t y}$ (i.e., label = 1) and all other instances.
Identify its nearest neighbor, denoted by $x_{N N}$ .
If $x_{N N}$ belongs to the majority class (i.e., label = 0), remove it from the training set.
Repeat the steps 3 and 4 until a user-specified number of iterations, denoted by $U$
Return to the step 2 to repeat the same process for all minority instances in the dataset.

As visualized in Figure 2, the hybrid resampling framework proceeds in three steps. The original multi-label dataset shows a highly imbalanced distribution, with sparse minority class instances (blue circles) overshadowed by dominant majority samples (green triangles). In the oversampling stage, the SMOTE generates synthetic minority samples (yellow circles) using nearest-neighbor interpolation around an example real instance

x_{i}

, forming a denser minority cluster. After that, in the undersampling stage, PDU examines the local neighborhood around each minority instance. Majority class instances found within a proximity distance (i.e., up to

U = 7

neighbors) are removed from the training set. Red triangles indicate such removed samples, resulting in the final cleaned dataset with better balance and reduced local class noise.

To clarify the distinctions between SMOTE and PDU, we summarize their differences in Table 4. This table presents a comprehensive comparison of the two methods across multiple features, including their approach to sampling, affected classes, scenarios, use of k-nearest neighbors, risks, goals, sensitivity to noise, effects on decision boundaries, computational cost, and main techniques.

3.4. Hoeffding Tree Classifier

In this study, we adopt the Hoeffding Tree [7] as the base learner for each binary classification task derived via the binary relevance decomposition. The Hoeffding Tree is a streaming decision tree algorithm designed for scalable, online learning from large-scale or continuously arriving data. Unlike traditional decision trees, e.g., C4.5 or classification and regression trees (CART) that require multiple passes through the entire dataset, the Hoeffding Tree incrementally builds its structure by observing one instance at a time. This capability makes it well suited for predictive maintenance scenarios, where real-time sensor data may arrive in high volumes.

The key foundation of the Hoeffding Tree lies in the Hoeffding bound, which provides a statistical guarantee for selecting a splitting attribute based on a finite number of observations. Given that the splitting metric (e.g., information gain or Gini index) is computed on observed data, the Hoeffding bound ensures that the attribute selected using the current sample is, with high probability, the same as the one that would be chosen if the algorithm had access to an infinite dataset. Mathematically, the Hoeffding bound is expressed as Equation (5):

ϵ = \sqrt{\frac{θ^{2} l n (\frac{1}{δ})}{2 \times n}}

(5)

where

ϵ

denotes the maximum difference between the true and estimated values of the splitting criterion,

θ

is the range of the splitting function (e.g., for information gain,

θ = \log_{2} c

, where

c

is the number of classes),

δ

is the user-defined confidence parameter (e.g., 0.01 for 99% confidence), and

n

is the number of observed instances at a given node. Using this bound, the Hoeffding Tree determines when it has seen enough data at a node to confidently choose the best splitting attribute. Specifically, it compares the top two attributes

G_{1}

and

G_{2}

with their evaluated scores (e.g., information gain), and chooses to split on

G_{1}

if Equation (6) follows as:

G_{1} - G_{2} > ϵ

(6)

This criterion ensures that the selected split is statistically superior with high confidence, preventing premature or unreliable splits caused by insufficient data. With probability

1 - δ

, it guarantees that

G_{1}

is indeed the better splitting attribute. If the condition is not met, the algorithm defers the split and waits for more data to accumulate, thereby preventing premature or inaccurate decisions.

The Hoeffding Tree classifier is particularly well-suited for predictive maintenance tasks due to its key advantages: it supports incremental learning, meaning the model can be updated in real time as new sensor data arrives without retraining from scratch; it is memory-efficient, maintaining only summary statistics at each node instead of storing the entire dataset; it exhibits the anytime property, producing a usable model even in early training stages; and it is robust to missing values, accommodating both nominal and numeric features. These characteristics make it an ideal base learner for large-scale, real-time industrial applications. Accordingly, the Hoeffding Tree forms the foundation of the ensemble learning described in the next subsection.

3.5. Hoeffding Tree Forest

The final stage of the proposed method aggregates multiple Hoeffding Tree classifiers into an ensemble structure to improve robustness and predictive accuracy. This ensemble—referred to as the Hoeffding Tree Forest—forms the backbone of the BHTF architecture. For each failure type identified through the binary relevance transformation, a dedicated and balanced binary dataset is created using SMOTE-based oversampling and the proposed PDU undersampling techniques. On each of these balanced datasets,

T

Hoeffding Tree classifiers (i.e.,

T = 10

) are independently trained. The trees exploit the statistical rigor of the Hoeffding bound to incrementally build reliable models from large-scale data.

To generate predictions, each Hoeffding Tree in an ensemble produces an output for a given instance. The final decision for each failure label is obtained via majority voting among the corresponding

T

classifiers. Formally, let

H_{j} = {h_{j, 1}, h_{j, 2}, h_{j, 3}, \dots, h_{j, T}}

indicate the ensemble of

T

Hoeffding Trees trained for label

y_{j}

. Then, the ensemble prediction

{\hat{y}}_{j}

is defined as Equation (7):

{\hat{y}}_{j} = m o d e (h_{j, 1} (x), h_{j, 2} (x), \dots, h_{j, T} (x))

(7)

where the mode operator returns the most frequently predicted class (0 or 1) for the instance

x

. This mechanism ensures that the ensemble prediction is determined by the consensus of the classifiers, thereby reducing the influence of individual misclassifications and improving the robustness of the final decision.

The ensemble learning strategy adopted in the proposed BHTF offers several key advantages. By aggregating predictions from multiple Hoeffding Tree classifiers, it improves generalization performance by reducing variance and mitigating overfitting. The use of resampled datasets—achieved through SMOTE and PDU techniques—ensures robust learning from minority class instances, boosting the model’s ability to detect rare and critical failure types. Furthermore, the independence of ensemble members across labels makes BHTF particularly well-suited for industrial predictive maintenance tasks characterized by multi-label failure diagnosis, severe data imbalance, and the need for efficient processing.

3.6. Algorithm

To provide a clear and structured overview of the proposed method, the complete algorithmic process behind the BHTF is presented through Algorithm 1. While the previous subsections have described each individual component in detail—including multi-label decomposition, hybrid resampling, and ensemble learning—the algorithm summarizes how these components are integrated into a unified predictive maintenance framework. Specifically, BHTF enables multi-label fault diagnosis, aggregates multiple Hoeffding Trees into a balanced ensemble, and maintains adaptability to streaming data, thereby addressing the core challenges of PdM. The algorithm outlines the data transformation, training, and inference phases involved in constructing BHTF, enabling reproducibility and better understanding of the method’s implementation.

Algorithm 1: Balanced Hoeffding Tree Forest (BHTF)

Inputs:

D

: multi-label dataset

D = {\{(x_{i}, Y_{i})\}}_{i = 1}^{n}

L

: label set

L = \{y_{1}, y_{2}, \dots, y_{q}\}

T

: number of Hoeffding Trees per label

R

: oversampling rate

k

: number of nearest neighbors

U

: undersampling threshold

x

: new instance to be predicted
Outputs:

H

: the ensemble of models, i.e., models for the jth label

H_{j} = \{h_{j, 1}, h_{j, 2}, \dots, h_{j, T}\}

\hat{Y}

: predicted label set for input instance

x

// Step 1—Binary Relevance Decomposition
// Multi-label learning: Decompose the problem into

q

binary tasks to capture co-occurring failures.
for

j = 1 t o q

D_{y_{j}} = \emptyset

// Initialize binary dataset for label

y_{j}

for each

(x_{i}, Y_{i})

in

D

if

y_{j} \in Y_{i}

D_{y_{j}}

.Add

(x_{i}, 1)

// Assign positive label for presence of

y_{j}

else

D_{y_{j}}

.Add

(x_{i}, 0)

// Assign negative label for absence of

y_{j}

                              end if
                  end for each
          end for

        // Step 2—Hybrid Resampling
        // Hybrid imbalance handling: Integrate SMOTE-based oversampling with proximity-driven undersampling (PDU).
          for

j = 1 t o q

// Step 2.1—Oversampling

c_{m i n o r i t y} = M i n o r i t y C l a s s (D_{y_{i}})

// Identify minority class
for each

x_{i} \in D_{y_{i}}

where class

(x_{i})

==

c_{m i n o r i t y}

N_{k} (x_{i}) = k N e a r e s t N e i g h b o r s (x_{i}, k)

// Find k-nearest neighbors
for

r = 1 t o R

// Generate

R

synthetic samples
for each

x_{N N} \in N_{k} (x_{i})

x_{s y n} = x_{i} + δ \times (x_{N N} - x_{i})

// Synthetic instance,

δ \in [0,1]

D_{y_{i}}

.Add

(x_{s y n}, 1)

                   // Add synthetic minority sample to dataset
                                    end for each
                            end for
                  end for each
                  // Step 2.2—Undersampling

c_{m i n o r i t y} = M i n o r i t y C l a s s (D_{y_{i}})

// Identify minority class

c_{m a j o r i t y} = M a j o r i t y C l a s s (D_{y_{i}})

// Identify majority class

F = \emptyset

// Initialize removal set
for each

x_{i} \in D_{y_{i}}

where class

(x_{i}) = =

c_{m i n o r i t y}

for

j = 1 t o U

x_{N N} = N e a r e s t N e i g h b o r s (x_{i})

// Find nearest neighbor
if class

(x_{N N}) \in c_{m a j o r i t y}

then

F = F \cup x_{N N}

                        // Flag majority neighbor for removal
                                    else break;
                                    end if
                           end for
                  end for each

D_{j} = D_{y_{i}}

.Remove

(F)

// Remove flagged majority instances
end for

        // Step 3—Model Training
        // Ensemble learning: Construct multiple Hoeffding Trees per label to improve robustness.

H = \emptyset

// Initialize ensemble of models
for

j = 1 t o q

H_{j} = \emptyset

// Initialize ensemble for label

y_{j}

for

t = 1 t o T

D_{j}^{T}

= Bootstrapping(

D_{j}

)

h_{j, t} = H o e f f d i n g T r e e (D_{j}^{T})

// Train Hoeffding Tree on resampled data

H_{j} = H_{j} \cup h_{j, t}

                                      // Add trained tree to ensemble
                  end for

H = H \cup H_{j}

end for

        // Step 4—Model Testing
        // Prediction across multiple labels through majority voting within each ensemble.

\hat{Y} = \emptyset

// Initialize predicted label set
for

j = 1 t o q

V_{j} = \{h_{j, 1} (x), h_{j, 2} (x), \dots, h_{j, T} (x)\}

// Collect predictions from ensemble

H_{j}

{\hat{y}}_{j} =

mode(

V_{j}

) // Compute majority vote for label

y_{j}

\hat{Y} = \hat{Y} \cup y_{j}

// Add

y_{j}

to predicted label set if voted as present
end for
End Algorithm

4. Experimental Setup

4.1. Dataset Description

To assess the effectiveness of the proposed BHTF method, we utilized the AI4I 2020 predictive maintenance dataset, which is publicly available through the UCI machine learning repository (University of California, Irvine, CA, USA) [45]. This dataset is widely used in predictive maintenance research due to its realistic industrial context and rich collection of sensor-based features. It provides a reliable foundation for modeling multi-label classification tasks within manufacturing systems. A summary of its main characteristics is represented in Table 5.

The dataset comprises 10,000 records and 14 variables, integrating identification fields, sensor measurements, and binary target indicators related to various machine failure types. In detail, the UID and Product ID serve as identifiers, while the Type attribute indicates the quality grade of a product as low (L), medium (M), or high (H). The primary sensor-driven features include air temperature, process temperature, rotational speed, torque, and tool wear, taking the main operational parameters of the machinery. On the output side, the dataset includes six binary potential target labels, including random failures (RNF), tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), and an aggregated machine failure flag, which signals whether any of the aforementioned failures have occurred. Table 6 provides detailed information on all dataset variables, including their name, category, type, description, and unit of measurement.

In our study, the UID and Product ID columns were removed during preprocessing, as they carry no predictive value for failure analysis. We also excluded the random failures (RNF) label due to its inherently unpredictable nature, which does not align with the structured diagnostic objective of our artificial intelligence-driven approach. Additionally, the general machine failure flag—indicating whether any failure has occurred—was omitted from the target space, since our goal was not to distinguish between failure and no-failure states. Instead, we focused on diagnosing specific failure types. Therefore, our BHTF method was trained exclusively on the four actionable failure modes, namely TWF, HDF, PWF, and OSF.

Each data instance represents the state of a manufacturing machine, defined by its sensor readings as input features and a corresponding multi-label output. The output is encoded as a binary vector, where each position indicates the presence (1) or absence (0) of a specific failure mode in the order of TWF, HDF, PWF, and OSF. A key characteristic of this dataset is that multiple failures can occur simultaneously, making it essentially suitable for multi-label classification rather than traditional single-label approaches. This complexity also introduces data imbalance, as certain combinations of failure types are rare yet critical for effective fault detection.

The distributional characteristics of the dataset’s continuous features are summarized in Table 7. These statistics—minimum, maximum, mean, and standard deviation—provide a mathematical quantitative overview of the sensor readings, which reflect the machine’s operational behavior under various conditions. Understanding the variability and range of these inputs is crucial for proper model training in predictive maintenance applications.

4.2. Evaluation Metrics

4.2.1. Standard Per-Label Metrics

To evaluate the predictive performance of the proposed BHTF model, we employed 10-fold cross-validation, a widely accepted resampling technique that offers a balanced trade-off between bias and variance in performance estimation. Several mathematical evaluation metrics were utilized to capture different aspects of classification quality in the context of multi-label and imbalanced learning tasks. The fundamental metrics include accuracy (ACC), precision (PR), recall (R), and F-measure (F), each calculated based on the confusion matrix components, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These metrics are formally defined in Equations (8)–(11):

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(8)

P R = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

F = \frac{2 T P}{2 T P + F P + F N}

(11)

here,

TP refers to the number of correctly predicted positive instances,
TN to the correctly predicted negatives,
FP to the negative instances incorrectly classified as positive, and
FN to the positive instances that were missed by the classifier.

While accuracy provides an overall performance measure, precision and recall give a more profound perception of how well the model handles data imbalances, particularly in minority failure classes. The F-measure (or F1-score) balances precision and recall, making it a valuable metric when these two are in tension, as often occurs in imbalanced multi-label scenarios.

4.2.2. Weighted Metrics

Following the standard per-label metrics, we compute weighted precision (WPR) and weighted recall (WR). These metrics account for the relative importance of each class and are particularly useful in multi-label and imbalanced datasets, as they give more weight to classes with more instances while still reflecting performance on minority classes. The weighted metrics are formally defined as Equations (12) and (13), respectively:

W P R = \frac{\sum_{i = 1}^{N} {P R}_{i} \times S_{i}}{\sum_{i = 1}^{N} S_{i}}

(12)

W R = \frac{\sum_{i = 1}^{N} R_{i} \times S_{i}}{\sum_{i = 1}^{N} S_{i}}

(13)

where

N

is the number of labels or classes,

{P R}_{i}

and

R_{i}

are the precision and recall for class

S_{i}

, respectively, and

S_{i}

is the sum of

{T P}_{i}

and

{F N}_{i}

, representing the total number of true instances of class

S_{i}

in the dataset. This weighted formulation ensures that classes with more instances contribute proportionally more to the overall precision and recall, providing a balanced and fair assessment of model performance in imbalanced multi-label scenarios.

4.2.3. Multi-Label Metrics

In multi-label classification, each instance can be associated with multiple labels simultaneously, which makes the evaluation more complex compared to single-label tasks. Fundamental metrics such as accuracy, precision, recall, and F-measure, although informative, may not fully capture the nuances of multi-label performance. To provide a more comprehensive assessment of the proposed BHTF model, we employed additional multi-label metrics, including macro-F1, micro-F1, Hamming loss, Jaccard index, and subset accuracy. These metrics evaluate performance from different perspectives, such as per-label, per-instance, and overall prediction quality, thus confirming a balanced and rigorous analysis of the model’s effectiveness in multi-label scenarios.

Macro-F1 evaluates how well the model predicts each label individually and then averages the results equally across all labels, regardless of their frequency in the dataset. This metric provides an unbiased view of performance across both majority and minority classes, making it especially valuable in imbalanced multi-label tasks. It is formally defined in Equation (14):

M a c r o_F 1 = \frac{1}{N} \sum_{i = 1}^{N} F_{i}

(14)

where

N

is the number of labels and

F_{i}

is the F-measure (F1-score) for class

i

.

The micro-precision (MP) and micro-recall (MR) are obtained by aggregating true positives, false positives, and false negatives across all labels before computing the precision and recall values. This approach ensures that the contribution of each label is proportional to its number of instances, making it suitable for datasets with imbalanced label distributions. They are presented in Equations (15) and (16):

M P = \frac{\sum_{i = 1}^{N} {T P}_{i}}{\sum_{i = 1}^{N} {T P}_{i} + \sum_{i = 1}^{N} F P}

(15)

M R = \frac{\sum_{i = 1}^{N} {T P}_{i}}{\sum_{i = 1}^{N} {T P}_{i} + \sum_{i = 1}^{N} F N}

(16)

Using these definitions, the micro-F1-score is expressed as Equation (17):

M i c r o_F 1 = 2 \times \frac{M P \times M R}{M P + M R}

(17)

Unlike macro-F1, which treats all labels equally, micro-F1 places more emphasis on labels with a higher number of instances by aggregating predictions across all classes. This makes micro-F1 a robust and widely adopted metric for evaluating overall model effectiveness in multi-label classification, especially in scenarios with significant class imbalance.

The Hamming loss quantifies the proportion of incorrect label predictions relative to the total number of predictions across all labels. In other words, it evaluates the average number of misclassification errors (false positives and false negatives) per instance per label. This makes it especially suitable for multi-label tasks, as it provides a label-wise error perspective instead of only focusing on the entire label set. If a confusion matrix is available for each label, the Hamming loss can be directly computed from FP and FN, normalized by the total number of predictions, as shown in Equation (18):

H a m m i n g_L o s s = \frac{\sum_{i = 1}^{N} ({F P}_{i} + {F N}_{i})}{N \times M}

(18)

where

N

is the total number of labels,

M

is the total number of instances, and

{F P}_{i}

and

{F N}_{i}

correspond to the false positives and false negatives for label

i

. This formulation demonstrates that Hamming loss essentially captures the ratio of prediction errors to the total number of instance–label decisions, thus offering an interpretable and fine-grained measure of model performance in imbalanced multi-label scenarios.

The Jaccard index measures the proportion of correctly predicted labels relative to all labels that were either predicted or actually present, providing an instance-level assessment of multi-label prediction quality. When confusion matrices are available for each label, the Jaccard index can be computed directly from TP, FP, and FN, as expressed in Equation (19):

J a c c a r d_I n d e x = \frac{\sum_{i = 1}^{N} {T P}_{i}}{\sum_{i = 1}^{N} {T P}_{i} + \sum_{i = 1}^{N} {F P}_{i} + \sum_{i = 1}^{N} {F N}_{i}}

(19)

The subset accuracy, also known as the exact match ratio, evaluates multi-label predictions at the instance level by considering an instance correctly classified only if all its labels are predicted correctly. Using the confusion matrix for each label, this can be represented through Equations (20) and (21):

S u b s e t A c c u r a c y f o r I n s t a n c e i = \{\begin{matrix} 1 i f {T P}_{i} + {T N}_{i} = N f o r a l l l a b e l s i \\ 0 o t h e r w i s e \end{matrix}

(20)

S u b s e t A c c u r a c y = \frac{1}{M} \sum_{i = 1}^{M} 1 {a l l l a b e l s c o r r e c t f o r i n s t a n c e i}

(21)

where

N

is the total number of labels,

M

is the number of instances, and

{T P}_{i}

and

{T N}_{i}

denote true positives and true negatives for label

i

. Equation (20) illustrates the instance-level logic: an instance contributes 1 only if every label is correct, and 0 otherwise. Equation (21) then averages these instance-level results across all

M

instances to compute the overall subset accuracy. Although very strict—since a single misclassified label sets the score to zero—subset accuracy provides precious insight into the model’s ability to achieve fully correct predictions, complementing other metrics such as Hamming loss and Jaccard index.

4.3. Hyperparameters

To ensure optimal performance of the proposed BHTF method, a comprehensive set of hyperparameters was empirically selected across its key components, including oversampling, undersampling, decision tree induction, ensemble training, feature selection, and feature importance. The entire implementation was developed in Java version 17 (Oracle Corp., Austin, TX, USA) using the Weka version 3.8.6 (University of Waikato, Hamilton, New Zealand) machine learning library [46]. All experiments were conducted on a standard desktop computer equipped with an Intel^® Core™ i7 processor (Intel Corp., Santa Clara, CA, USA) running at 1.90 GHz and 8 GB of RAM.

SMOTE oversampling: To address class imbalance within each binary decomposition, SMOTE was applied. The number of nearest neighbors for generating synthetic samples was set to k = 5, and an aggressive oversampling ratio of R = 4000% was chosen. This configuration confirmed that sufficient synthetic instances were generated for underrepresented failure classes to prevent classifier bias.
PDU undersampling: For balancing the overrepresented majority class, a proximity-based dynamic undersampling strategy was adopted. This method removes the majority instances located close to each minority instance in the feature space. Mainly, for each minority instance, up to u = 7 nearest majority neighbors were identified using a LinearNNSearch with k = 1, a brute-force search algorithm that computes Euclidean distances linearly to find the closest neighbor. This step helped refine class boundaries and reduce overlap between majority and minority classes.
Ensemble configuration: The multi-label learning approach in BHTF utilized an ensemble of Hoeffding Trees for each label. Specifically, for each binary classification task related to a distinct failure mode, an ensemble of T = 10 trees was trained using a bagging meta-classifier. This strategy was implemented to improve prediction robustness, and configured with the following main parameters:
–
numIterations = 10 (number of Hoeffding Trees in each ensemble);
–
bagSizePercent = 100 (100% of the training data in each bootstrap sample);
–
batchSize = 100 (the batch size for model updates);
–
seed = 1 (reproducibility of randomized processes);
–
numExecutionSlots = 1 (sequential performance of model training);
–
representCopiesUsingWeights = false (explicitly sampling of each instance without relying on weighting schemes).
Hoeffding Tree settings: Each base classifier in the ensemble was configured using the Hoeffding Tree implementation. The following parameters were set to control tree growth and splitting behavior:
–
gracePeriod = 200 (minimum number of instances seen between split attempts);
–
hoeffdingTieThreshold = 0.05 (threshold to break ties for close information gains);
–
leafPredictionStrategy = Naive Bayes adaptive (Naive Bayes prediction in leaf nodes when beneficial);
–
minimumFractionOfWeightInfoGain = 0.01 (minimum fraction of total weight required to consider a split);
–
naiveBayesPredictionThreshold = 0.0 (threshold below which Naive Bayes predictions are used);
–
splitConfidence = 1.0 × 10⁻⁷ (confidence level used for splitting decisions);
–
splitCriterion = Info gain split (uses information gain as the splitting metric).
Feature selection: We employed the Pearson correlation technique [47] as a filter-based supervised attribute selection approach. It was combined with the Ranker search method to measure the predictive relevance of each feature. Multiple configurations were empirically tested, including heuristic strategies based on logarithmic and square root formulas for determining the number of features to retain (e.g., $\log_{2} (m)$ , $\sqrt{m}$ , where $m$ is the number of original features). Among these, selecting all six features provided the most favorable balance between model accuracy and complexity. This optimal configuration (numToSelect = 6) was identified through extensive experimentation and justified further in the results section.
Feature importance: To further enhance interpretability and provide visual analysis, we examined the contribution of individual features to each failure mode using Pearson correlation scores. The results are presented in Figure 3, Figure 4, Figure 5 and Figure 6, where features are ranked for each of the four failure modes: TWF, HDF, PWF, and OSF. For instance, torque and tool wear emerge as dominant indicators for OSF and TWF, respectively, while air temperature and rotational speed strongly influence HDF and PWF. These findings not only validate our decision to retain six features during preprocessing but also provide explicit evidence of how different sensors contribute to specific failures. Importantly, these visualizations offer an intuitive understanding of the data characteristics prior to modeling, thereby complementing the performance-driven results of BHTF.

5. Results

5.1. Overall BHTF Performance

The performance of the proposed BHTF method was evaluated through four separate binary classification tasks, each corresponding to a distinct failure mode: TWF, HDF, PWF, and OSF. The results are summarized in Table 8. BHTF achieved an overall accuracy of 97.44%, with an average precision of 0.9939, recall of 0.9744, and F-measure of 0.9839 across all failure modes. These mathematical metrics indicate a strong and balanced classification capability to underline the efficacy of the method for fault detection in manufacturing systems powered by machine learning. Performance across individual labels also remained consistently high. The accuracy ranged from 93.94% for TWF to 98.87% for PWF. All precision scores exceeded 0.99, demonstrating the model’s ability to correctly identify positive instances with minimal false positives. The recall values ranged from 0.9394 (TWF) to 0.9887 (PWF), indicating strong sensitivity across all classes. Corresponding F-measure scores varied between 0.9663 and 0.9914, confirming the model’s robustness in balancing precision and recall. These results affirm that the BHTF method delivers reliable and accurate multi-label predictions for predictive maintenance, effectively identifying multiple concurrent failure types while maintaining high classification quality.

5.2. Confusion Matrix

To further explore the classification behavior of BHTF, Figure 7 presents confusion matrices for each failure type. Each illustration reports the number of instances predicted as failure or no failure against their actual labels. The majority of true positives and true negatives are correctly captured, with very few false negatives (e.g., 8 (17.39%) for TWF, 3 (2.61%) for HDF) and a reasonable number of false positives (e.g., 598 (6.01%) for TWF, 107 (1.08%) for PWF), which aligns with the data imbalance addressed through resampling. These matrices demonstrate that BHTF can appropriately detect rare failure events while maintaining a low rate of misclassification for healthy instances.

To provide a thorough assessment of the proposed BHTF method in a multi-label context, we further evaluated its performance using macro-F1, micro-F1, Hamming loss, Jaccard index, and subset accuracy, all of which rely on the label-wise confusion matrices presented in Figure 7. The macro-F1-score of 0.9839 indicates that BHTF performs consistently well across all failure types, averaging the F1-scores equally without being dominated by the majority labels. The micro-F1-score of 0.9869, which aggregates contributions from all instances and labels, confirms that the method maintains high overall predictive accuracy even in the presence of class imbalance. The Hamming loss of 0.0256 demonstrates that only a small fraction of label predictions are incorrect relative to the total number of label assignments, reflecting the model’s reliability at the individual label level. Complementing this, the Jaccard index of 0.9742 underlines strong overlap between the predicted and true label sets, showcasing that BHTF captures the relevant failure events effectively. Finally, the subset accuracy of 90.07% confirms that in the vast majority of instances, the model predicts all labels correctly, further evidencing its robustness in exact multi-label prediction. Collectively, these results reinforce that BHTF not only achieves high performance in standard metrics such as accuracy, precision, recall, and F-measure (Table 8) but also excels across rigorous multi-label evaluation criteria, demonstrating its suitability for predictive maintenance tasks in imbalanced and multi-label scenarios.

5.3. Resampling Performance Across Folds

Following the performance evaluation of BHTF, a deeper analysis was conducted to examine how the class distribution evolved through the hybrid resampling process in each fold of cross-validation, represented in Table 9, Table 10, Table 11 and Table 12. For each fold, 90% of the data was used for training and 10% for testing. The values shown in each table correspond to the number of healthy and failure instances in the training set, represented in the format healthy/failure, at three key stages of the resampling pipeline: (i) before SMOTE (original class imbalance), (ii) after SMOTE or before PDU (after minority class upsampling), and (iii) after PDU (after majority class reduction). This breakdown illustrates the success of the proposed resampling strategy in transforming highly imbalanced binary datasets into more balanced ones, thereby improving the learnability for each failure classification task.

The instance distributions shown in Table 9, Table 10, Table 11 and Table 12 reflect the impact of the hybrid resampling strategy—comprising SMOTE for oversampling and the proposed PDU method for undersampling—across all 10 folds for each independent failure type (TWF, HDF, PWF, and OSF). Initially, all datasets exhibited a pronounced imbalance, with failure instances ranging from as low as 41 (TWF) to 104 (HDF) compared to nearly 9000 healthy instances. After applying SMOTE, the failure class in each fold was increased to a target range (approximately 1681–4264), depending on the specific failure type. Subsequently, the PDU step efficiently reduced the majority (healthy) class to levels closely aligned with the upsampled failure counts. On average, this process resulted in nearly balanced distributions, such as 8832/1697 for TWF, 8843/4244 for HDF, 8854/3506 for PWF, and 8883/3616 for OSF (healthy/failure). This reliable balancing across folds and failure types ensured that each binary classifier was trained on data with minimal class bias, thereby increasing the fairness of BHTF across all failure diagnosis tasks.

In addition to the quantitative distributions reported in Table 9, Table 10, Table 11 and Table 12, t-distributed stochastic neighbor embedding (t-SNE) was employed to provide a visual analysis of how the proposed hybrid resampling strategy reshapes the data space. The high-dimensional feature space was projected into a two-dimensional embedding defined by component 1 and component 2 for both the imbalanced (before resampling) and balanced (after resampling) datasets. As shown in Figure 8a, before resampling the dataset is dominated by the majority class (class 0), with failure samples (red points, class 1) sparsely scattered among a dense cluster of healthy (non-failure) samples (blue points, class 0). This makes the minority class (class 1) difficult to learn. After applying SMOTE followed by the proposed PDU undersampling, as illustrated in Figure 8b, the two classes become more distributed, with minority samples forming clearer clusters and achieving improved separation from the majority ones. This visualization confirms that the hybrid resampling pipeline not only balances the class distributions numerically but also enhances the geometric separability of the classes in feature space, thereby facilitating more effective learning by the BHTF model.

5.4. Sensitivity Analysis

To determine the most effective configuration for the proposed BHTF method, an extensive hyperparameter sensitivity analysis was conducted. This process involved systematic experimentation with various parameter settings, including different SMOTE oversampling ratios, neighborhood sizes for the PDU undersampling, numbers of Hoeffding Trees in the ensemble, and subsets of input features. Although a wide range of hyperparameter combinations were explored through grid search, only representative results are presented here to illustrate the main performance trends, represented in Table 13, Table 14, Table 15 and Table 16. For each tested configuration, standard classification metrics were computed across the four failure types (TWF, HDF, PWF, and OSF) to evaluate the model’s diagnostic success in fault detection under data imbalance conditions. The final configuration adopted in BHTF reflects the best-performing combination, selected to mathematically maximize predictive performance while maintaining model simplicity and generalizability.

5.4.1. Effect of SMOTE Ratio

To illustrate the influence of various SMOTE oversampling ratios (R) on the BHTF model’s performance, representative experiments were conducted using three settings: 4000%, 5000%, and 6000%. The classification accuracy for each failure type is reported in Table 13. Among these settings, the 4000% SMOTE ratio achieved the highest overall accuracy, with an average of 97.44% across all failure types. While HDF and OSF showed slight improvements with larger ratios, the performance on TWF declined as the oversampling rate increased. This indicates that excessive SMOTE can introduce noisy or redundant synthetic samples, particularly harming minority class generalization in sensitive failure types such as TWF. As part of the broader hyperparameter search, the 4000% SMOTE setting was adopted in the final BHTF configuration, as it offered the most favorable balance between classification accuracy and result consistency.

Table 13. Accuracy results for each failure type under different SMOTE oversampling ratios (R).

Failure	R = 4000	R = 5000	R = 6000
TWF	93.94	93.08	92.77
HDF	98.12	98.15	98.28
PWF	98.87	98.84	98.80
OSF	98.82	99.07	99.07
Average	97.44	97.29	97.23

5.4.2. Effect of Number of Neighbors in PDU

The sensitivity of BHTF to the number of neighbors used in the PDU technique is thoroughly assessed in our experiments with different values for the neighborhood size parameter (u), including 1, 3, 5, 7, and 9. The resulting classification accuracies across all four failure types are indicated in Table 14. The results reveal that the overall accuracy remains consistently high across all tested neighborhood sizes, with only slight fluctuations observed—ranging narrowly from 97.39% to 97.44%. The best average accuracy (97.44%) was achieved at both u = 7 and u = 9; however, to ensure a more computationally efficient configuration, u = 7 was selected as the final setting. This value struck a favorable trade-off between the risk of excessive undersampling and model performance that can arise from overly large neighborhoods.

Table 14. Accuracy results for each failure type under different numbers of neighbors in PDU (u).

Failure	u = 1	u = 3	u = 5	u = 7	u = 9
TWF	93.96	93.75	93.76	93.94	93.80
HDF	98.13	98.05	98.03	98.12	98.19
PWF	98.92	98.91	99.01	98.87	98.94
OSF	98.72	98.84	98.75	98.82	98.81
Average	97.43	97.39	97.39	97.44	97.44

5.4.3. Effect of Number of Hoeffding Trees

The impact of ensemble size on the classification performance of the proposed BHTF method is investigated by using various numbers of Hoeffding Trees (

T

), including 10, 50, and 100. Table 15 shows the accuracy results for each failure type under these configurations. The results confirm that while increasing the number of trees leads to slight gains for some failure types—such as PWF—there is a negligible or even slightly negative effect on others, such as TWF. The average accuracy across all failure types remains relatively stable, with the highest average of 97.44% achieved when

T

= 10. This indicates that a larger ensemble does not necessarily improve performance and may introduce redundant computational complexity. Based on these findings,

T

= 10 was selected for the final ensemble model to confirm the efficiency of BHTF.

Table 15. Accuracy results for each failure type under different numbers of Hoeffding Trees (T).

Failure	T = 10	T = 50	T = 100
TWF	93.94	93.85	93.77
HDF	98.12	98.11	98.09
PWF	98.87	98.91	98.95
OSF	98.82	98.80	98.80
Average	97.44	97.42	97.40

5.4.4. Effect of Number of Selected Features

To evaluate the impact of feature subset size on BHTF performance, we experimented with varying numbers of selected features, from 2 to 6. The corresponding accuracy results for each failure type are represented in Table 16. As the number of features increased, accuracy generally improved across all failure categories. Notably, HDF exhibited a significant gain between 2 and 3 features, and the overall average accuracy impressively rose from 96.24% (n = 2) to 97.44% (f = 6). Although minor fluctuations were observed beyond 3 features, the highest performance was gained when all six original features were retained. Therefore, this configuration was selected for the final model, proposing optimal predictive accuracy without unnecessary feature exclusion.

Table 16. Accuracy results for each failure type under different numbers of features (f).

Failure	f = 2	f = 3	f = 4	f = 5	f = 6
TWF	93.46	93.69	93.83	93.80	93.94
HDF	93.87	97.29	98.02	98.11	98.12
PWF	98.86	98.95	98.93	98.90	98.87
OSF	98.78	98.76	98.75	98.79	98.82
Average	96.24	97.17	97.38	97.40	97.44

5.5. Computational Cost Analysis

Computational cost analysis is an essential component in evaluating new machine learning methods, particularly regarding training time for real-time or streaming applications. Therefore, in addition to predictive performance, we assess the computational efficiency of the proposed BHTF method based on training time. Each experiment was conducted using 10-fold cross-validation, and the training time was measured for every fold and for each failure type label. The unit of measurement is seconds. The results of the per-fold training times are stated in Table 17, along with the averaged values. These results indicate that the proposed BHTF method achieves high predictive performance while maintaining minimal training costs, with average training times below 0.21 s across all labels.

This analysis confirms that BHTF not only provides superior predictive accuracy across all failure types but also requires minimal computational resources, reinforcing its suitability for real-time and streaming applications.

5.6. Hoeffding Tree Structure Analysis

To gain further insight into the internal decision-making behavior of the proposed BHTF framework, this subsection presents visual analyses of representative Hoeffding Tree structures. Figure 9, Figure 10, Figure 11 and Figure 12 illustrate sample Hoeffding Trees extracted from the constructed forests corresponding to each failure mode label, namely TWF, HDF, PWF, and OSF, respectively. These trees were selected from the ensemble of 400 Hoeffding Trees (10 fold × 10 trees per fold × 4 labels) and utilized as interpretable examples for examining how decision paths are formed under the influence of the hybrid balancing strategy and multi-label decomposition. By analyzing these structures, we aim to reveal how different failure types were distinguished based on feature splits and to assess the interpretability of the proposed model in practical predictive maintenance scenarios.

Figure 9 presents a sample Hoeffding Tree for the TWF label, demonstrating the sequence of decisions the BHTF model uses to discriminate between normal and faulty instances. The root node splits on tool wear, which is thus identified as the most informative feature. When tool wear is below or equal to the root threshold (≤201.636), the model immediately classifies the instance as non-failure (0), supported by a large number of observed instances (8273.677), underlining the importance of low tool wear as a strong indicator of healthy operation. When tool wear exceeds the root threshold (>201.636), the tree evaluates the torque feature, where lower values (≤11.196 Nm) lead to a non-failure outcome based on 25.000 instances. Higher torque prompts deeper inspection through rotational speed, followed by process temperature. Only when both of these features exceed their respective thresholds (>1289.273 rpm and >307.811 K) does the tree output a failure (1) prediction—at a leaf supporting 1384.000 instances. This tree illustrates how TWF failures are detected through a hierarchical combination of features by prioritizing tool wear and torque as strong early indicators, while rotational speed and process temperature act as secondary confirmation to accurately flag true fault detections.

Figure 10 presents the Hoeffding Tree structure learned for the HDF label to show that how BHTF separates healthy and faulty instances based on thermal and operational characteristics. The root split occurs on air temperature and signals its importance in identifying overheating-related failures. When the air temperature is ≤301.845 K, the model either checks a secondary air temperature threshold (301.245 K)—where instances are classified as non-failure (0) regardless of torque—for air temperatures between 301.245 K and 301.845 K, considers rotational speed to assign failure (1) for low-speed cases (≤1376 rpm) and non-failure otherwise. When air temperature exceeds 301.845 K, rotational speed becomes the primary discriminator, with lower speeds (≤1399.736 rpm) to prompt further inspection of process temperature and tool wear. In this case, temperatures ≤ 312.095 K lead to failure (1) predictions—with both low and high tool-wear values confirming the faulty class—while higher process temperatures revert to non-failure. At high rotational speeds (>1399.736 rpm), the model again predicts non-failure. This tree displays that air temperature and rotational speed as key drivers in detecting HDF faults, using process temperature and tool wear to pinpoint critical failure scenarios under elevated thermal and mechanical stresses.

Figure 11 indicates the Hoeffding Tree constructed for the PWF label to capture how the BHTF model distinguishes between normal and failure states, based mainly on operational torque and speed. The root split occurs at the rotational speed to emphasize its significance for detecting PWF-related faults. When rotational speed is ≤1966.383 rpm, the tree focuses on torque. Low-to-moderate torque levels (≤55.182 Nm and ≤58.711 Nm) lead to non-failure (0) predictions. However, when torque exceeds 58.711 Nm, the tree switches to predicting a failure (1)—with distinct leaves for moderate-high torque ranges (58.711–63.513 Nm). Additionally, if the rotational speed itself surpasses 1966.383 rpm, the model directly classifies the instance as a failure (1) to approve that very high speeds are a strong indicator of PWF failure mode. This structure also underscores that high torque is a strong early indicators of PWF, whereas lower operating regimes are reliably associated with healthy system behavior.

Figure 12 presents the Hoeffding Tree structure created for the OSF label, showcasing that the BHTF framework uses a combination of wear, torque, speed, and machining type to control the operational state. The root node splits on tool wear and identifies it as the primary factor by wear values ≤ 183.206 min lead into a non-failure (0) state with further refinement based on torque. Within this branch, lower torque values (≤61.186 Nm) are split by rotational speed, regardless of speed (<1403.727 rpm or >1403.727 rpm), both paths lead to non-failure outcomes. Higher torque values (>61.186 Nm) also produce a non-failure prediction and consider that low tool wear is a strong signal of healthy operation, even under high torque. When tool wear exceeds 183.206 min, torque again acts as a decision attribute. Lower torque values (≤49.191 Nm) maintain a non-failure prediction, while higher torque values prompt evaluation of the type of machining operation. Here, instances with type L are assigned a failure with 2890 instances, whereas types M and H remain in the non-failure class. This tree designates that high tool wear combined with high torque, during type L operations, is a strong signature for OSF failure, whereas low wear or alternative operation types generally reflect healthy behavior.

6. Discussion

To rigorously evaluate the performance of the proposed BHTF method, it was compared against 58 state-of-the-art predictive maintenance approaches drawn from 23 recent studies [48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70], including traditional classifiers, ensemble-based learners, deep neural network architectures, hybrid techniques, and augmentation-based strategies for fault detection in manufacturing systems. The results of this comparison, conducted over the same AI4I 2020 dataset, are summarized in Table 18, which reports for each method the reference, year, method type, training protocol, dataset split, hyperparameter settings, and evaluation metrics.

The BHTF method achieved the highest overall accuracy (97.44%), precision (0.9939), recall (0.9744), and F-measure (0.9839), outperforming all other models tested on the AI4I 2020 dataset. These results underline its robust diagnostic capability in handling complex and imbalanced industrial failure patterns. Compared to the average performance of all state-of-the-art methods (accuracy: 88.94%), BHTF demonstrated a significant improvement of 11% in accuracy. These gains reflect the method’s ability to deliver consistent and reliable predictions across both majority and minority failure categories.

When benchmarked against ensemble models such as EFNC-Exp (97.30%) [61], RF (96.81%) [64], and CatBoost-based techniques [62], BHTF not only slightly improved accuracy but also delivered notably higher recall and F-measure. This performance advantage is largely attributed to BHTF’s integrated balancing strategy, which combines SMOTE-based oversampling with PDU undersampling, effectively mitigating the skew introduced by class imbalance. For instance, while RF [64] achieved high precision (0.9740), it fell short on recall (0.7639), indicating a tendency to miss minority class failures—a problem that BHTF addresses successfully.

In comparison with DNN-based approaches—including CAST, SE-ResNet 18, GE-ResNet 18, and SE-SCNet 18 [58]—as well as standard neural networks [64] and DNNs in [52], BHTF outperformed all on every metric. Although these deep models produced moderately good F-measures (e.g., SE-SCNet 18: 0.8600), they struggled with data imbalance and generalization to complex failure patterns. Similarly, TTML-based hybrid models [63], despite their conceptual innovation, achieved limited accuracy (65–78%), showing a lack of robustness in multi-failure classification without dedicated balancing.

Beyond classical ensembles and deep learning, several advanced frameworks were also considered. For example, metaheuristic-optimized ELM with PLSCO [49] and DNN models coupled with simulated annealing (SA) [57] showed competitive results (up to 97.09% accuracy), while Byzantine fault-tolerant federated learning [50] provided robustness under adversarial conditions, albeit with lower accuracy (about 89%). Likewise, LSTM models with SMOTE resampling [55] and interpretable RF + XAI approaches integrating SHAP and LIME [56] emphasized temporal dynamics and explainability, respectively. However, despite their novelty, these methods were still surpassed by the accuracy of BHTF, which also offered stronger balance across precision, recall, and F-measure.

Data augmentation methods such as SMOTENC + ctGAN + CatBoost [62] improved recall (0.9068) but failed to match BHTF’s overall diagnostic strength. In contrast, BHTF’s hybrid balancing mechanism avoids overfitting and synthetic noise—challenges commonly seen in GAN-based augmentation—while achieving both high recall and precision.

Traditional models such as SVM, KNN, decision trees, and logistic regression [52,53,54,66,67,69] generally underperformed. While decision trees reached a reasonable F-measure of 0.7766 [66], other models such as KNN and NN yielded very low recall (0.2970 and 0.2178, respectively), showing weak detection of rare failures. Likewise, Bayesian logistic regression (BLR) [65], despite high precision (0.9950), had extremely low recall (0.2830), indicating an over-reliance on majority-class prediction. BHTF, in contrast, maintained a balanced performance across all metrics, achieving both high precision and recall.

Additional hybrid and specialized frameworks, such as DFPAIS and SDFIS [60], hyperplane-based methods [68], and RUSBoost trees [69], showed limited generalization. For example, RUSBoost trees attained a recall of 0.9085 but suffered from very low precision (0.3071), resulting in a weak F-measure. Although data-blind machine learning [70] achieved a competitive accuracy of 97.30%, it lacked reported precision and recall values, making it difficult to verify its balanced performance under class-imbalanced conditions.

In conclusion, BHTF not only outperformed individual models but also exceeded the average performance of the entire group of state-of-the-art methods by a substantial margin. The improvement—11% in accuracy—strongly validates BHTF’s reliability in real-world predictive maintenance scenarios characterized by data imbalance and multiple failure modes. The strength of BHTF lies in its simultaneous integration of three complementary paradigms of learning—multi-label learning for handling concurrent failure modes, incremental learning for adaptive knowledge acquisition in streaming contexts, and ensemble learning for enhanced generalization—augmented by hybrid oversampling and undersampling techniques within a single framework. While prior studies address these aspects individually or partially, none of the compared state-of-the-art methods incorporate all three paradigms together. This comprehensive design ensures not only superior predictive accuracy but also scalability and adaptability to evolving industrial conditions, thereby positioning BHTF as a distinctive and practical solution for complex predictive maintenance environments.

To statistically validate the superior performance of our proposed BHTF method over these state-of-the-art approaches listed in Table 18, the Wilcoxon signed-rank test [71] was employed. This non-parametric test is particularly suitable for comparing paired data and does not assume normality, instead relying on the symmetry of the distribution of differences. The proposed method achieved an average improvement of 11% over the competing approaches. To assess the significance of this improvement, the null hypothesis (H₀) assumes that there is no significant difference between the median performance of BHTF and the competing methods. The test yielded a p-value of 2.39 × 10⁻⁹, which is substantially lower than the conventional significance threshold of 0.05. This result provides strong evidence against the null hypothesis and confirms that the performance improvements achieved by the BHTF method are statistically significant. Therefore, the proposed approach not only outperforms individual methods in terms of accuracy, precision, recall, and F-measure but also demonstrates consistent and statistically significant superiority across the evaluated models.

The mathematical expression for the Wilcoxon test is shown in Equation (22):

W = \sum_{i = 1}^{n} {R_{i}}^{+}

(22)

where

$n$ : the total number of non-zero matched differences used in the analysis;
${R_{i}}^{+}$ : the rank given to each positive difference between matched pairs to represent the contribution of that pair to the overall test statistic;
$W$ : the Wilcoxon signed-rank statistic, determined by summing the ranks of all positive differences observed in the paired data.

7. External Validation Across Diverse Datasets

In this study, the AI4I 2020 dataset served as the primary benchmark for developing and evaluating our proposed BHTF method extensively. To strengthen the validity and generalizability of our approach, we conducted an external validation, i.e., evaluating the model on entirely different datasets that were not used during model development. External validation is a critical step in predictive maintenance research as it assesses whether the method maintains consistent performance across different industrial contexts, data distributions, and failure characteristics. To this end, we selected 4 additional multi-label predictive maintenance datasets, each containing 16 distinct failure types, and evaluated BHTF across them. Below, we briefly introduce the datasets, describe the hyperparameter tuning carried out for each to ensure fair adaptation, and present the main results.

The selected datasets are derived from [72] related to predictive maintenance in steel manufacturing. These datasets were generated to simulate the tandem cold rolling (TCM) process, which is crucial in steel production. For the purposes of multi-label PdM evaluation, we selected 4 datasets that include all 16 anomaly labels, corresponding to the diverse failure types across the rolling stands. The summary of the selected datasets is presented via Table 19, in which the observations column indicates the total number of data points, while anomalies shows the number of labeled failure events. Features refer to the total number of measured parameters for each observation, and anomaly types represent the number of distinct failure labels included in the dataset. The products column specifies the number of steel product types processed, and data drift designates whether the dataset includes shifts or changes in the underlying data distribution over time.

Each dataset is generated as a chronological data stream, with observations ordered by increasing work roll mileage. A total of 51 features are recorded across the 5 rolling stands, including entry and exit thickness, width, yield strength, work roll diameter and mileage, thickness reduction, interstand tension, roll speed, rolling force, torque, stand gap, and motor power. Anomalies were introduced based on four types of failures: reduction scheme, electric motor, bearing, and work roll friction. Apart from the reduction anomaly, all other anomalies are stand specific, resulting in 16 distinct anomaly labels. Table 20 summarizes these features and anomaly labels for each dataset.

For the external validation on the four selected TCM datasets, the hyperparameters of the proposed BHTF method were adjusted to account for the increased dataset size and higher class imbalance. While the AI4I 2020 dataset contained 10,000 instances, each of the TCM datasets approximately has 20,000 instances. To address the more pronounced imbalance, the SMOTE oversampling ratio was doubled from R = 4000% (used for AI4I 2020) to R = 8000%, with k = 5 nearest neighbors for synthetic sample generation. Similarly, the PDU parameter u was increased from 7 to 14, meaning that for each minority instance, up to 14 nearest majority neighbors were identified using a LinearNNSearch with k = 1. Feature selection was also optimized, selecting numToSelect = 3 attributes, while the number of labels was set to 16 to match the multi-label structure of the TCM datasets. Other hyperparameters, such as the number of folds (folds = 10) and the number of iterations (setNumIterations = 10), were maintained from the original settings. All parameter values were determined through iterative tuning and evaluation to achieve the best performance across the new datasets.

Table 21 reports the accuracy of the proposed BHTF method on the 5 TCM datasets for 16 failure types. The method achieves high accuracy overall, with average values of 98.47%, 97.80%, 98.34%, and 96.40% for tcm5_dataset_3, tcm5_dataset_4, tcm5_dataset_5, and tcm5_dataset_6, respectively. The results highlight the robustness of BHTF in handling large, imbalanced, and multi-label datasets.

8. Conclusions and Future Works

In this study, a novel ensemble-based approach, Balanced Hoeffding Tree Forest (BHTF), was proposed to address the challenges of predictive maintenance in complex industrial settings. By combining artificial intelligence with industrial IoT data, BHTF aims to forecast equipment failures before they occur, thereby reducing unplanned downtime, lowering maintenance costs, and enhancing operational safety. Unlike traditional models that often struggle with data imbalance and limited failure representation, BHTF introduces a tailored solution that integrates advanced techniques across both modeling and preprocessing stages. The core innovation of BHTF lies in its multi-label fault detection framework, which employs binary relevance to model each failure type independently while preserving co-occurrence relationships. This design enables a more realistic and actionable diagnosis of equipment conditions in manufacturing environments. To further boost the model, a hybrid class balancing strategy was developed, combining SMOTE oversampling with the PDU undersampling technique. This dual-phase preprocessing pipeline addresses the intrinsic imbalance in real-world maintenance datasets, where failure events are significantly rarer than normal operation.

The proposed BHTF was extensively evaluated on the AI4I 2020 predictive maintenance dataset, which includes four critical industrial failure categories, encompassing tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), and overstrain failure (OSF). The model achieved an average accuracy of 97.44%, with corresponding precision (0.9939), recall (0.9744), and F-measure (0.9839) scores, indicating strong diagnostic capabilities across all classes. Furthermore, comparative analysis against a diverse set of state-of-the-art methods—ranging from traditional machine learning to deep learning and hybrid ensemble models—verified the superiority of BHTF. Remarkably, it achieved an improvement of 11% in accuracy, significantly outperforming existing solutions. Notably, the strength of BHTF stems from its simultaneous integration of multi-label learning, incremental learning, and ensemble learning—three innovative paradigms that have not been collectively addressed by the compared methods. While prior works typically focus on one or two of these dimensions in isolation, BHTF unifies them within a single framework. These results underscore the real-world applicability of BHTF for predictive maintenance in dynamic industrial environments, where accurate and timely failure detection is mission critical.

Despite the comprehensive analysis and promising results, this study presents several limitations that warrant further reflection. One potential direction is the creation of a software platform or intelligent service based on BHTF that can operate on real-time industrial data streams. Such a system would not only monitor equipment health continuously but also automatically integrate newly collected transactional data into the historical dataset, enabling incremental learning and regular model updates. This continuous adaptation would improve diagnostic accuracy and ensure that the system remains effective in dynamic manufacturing environments where fault detection is critical.

Additionally, future implementations could focus on embedding the BHTF model into industrial IoT platforms such as monitoring dashboards, edge-computing devices, or mobile applications. These platforms could provide real-time alerts to operators, identifying which specific failure types are predicted to occur and enabling preemptive actions to alleviate potential damage. The system’s predictions could be visualized in a user-friendly format, improving interpretability and enabling even non-expert users to make informed maintenance decisions. Developing mechanisms to prioritize machines based on failure likelihood would help organizations allocate resources more efficiently and implement condition-based maintenance strategies powered by artificial intelligence.

Furthermore, an emerging direction for future work involves adapting the BHTF framework to TinyML environments to enable lightweight and energy-efficient deployment on resource-constrained edge devices. Integrating BHTF with TinyML would allow predictive maintenance models to operate directly on microcontrollers or embedded sensors, reducing latency, enhancing privacy, and minimizing reliance on centralized infrastructure. This would be particularly advantageous in remote or bandwidth-limited manufacturing settings where real-time decision making is essential. Exploring model compression, pruning, or quantization techniques to adapt BHTF for low-power hardware could considerably broaden its applicability and contribute to the development of intelligent and autonomous maintenance systems. Collectively, these future extensions would move the BHTF approach closer to full deployment within factories and support intelligent and human-centric maintenance ecosystems.

Author Contributions

Conceptualization, B.G.; methodology, B.G.; software, B.G.; validation, B.G.; formal analysis, B.G.; investigation, B.G., R.A.K., D.B. and R.Y.; resources, B.G., R.A.K., D.B. and R.Y.; data curation, B.G., R.A.K., D.B. and R.Y.; writing—original draft preparation, B.G.; writing—review and editing, R.A.K., D.B. and R.Y.; visualization, B.G.; supervision, R.A.K. and D.B.; project administration, R.A.K., D.B. and R.Y.; funding acquisition, R.A.K. and R.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The “AI4I 2020 Predictive Maintenance” dataset [45] is publicly available in the UCImachine learning repository (University of California, Irvine, CA, USA) (https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset, accessed on 22 May 2025) for predictive modeling tasks. Furthermore, the “TCM: Benchmark Datasets for Predictive Maintenance in Steel Manufacturing” [72] datasets, including tcm5_dataset_3, tcm5_dataset_4, tcm5_dataset_5, and tcm5_dataset_6, are publicly available in the Zenodo repository (CERN Research Institute, Geneva, Switzerland) (https://zenodo.org/records/11469702, accessed on 20 August 2025), a general-purpose platform for sharing research outputs.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

AdaBoost	Adaptive boosting
ADASYN	Adaptive synthetic sampling
AI	Artificial Intelligence
ANN	Artificial neural network
AUC-ROC	Area under the receiver operating characteristic curve
BFT	Byzantine fault tolerant
BGRU	Bidirectional gated recurrent unit
BHTF	Balanced Hoeffding Tree forest
BLR	Binary logistic regression
BR	Binary relevance
CART	Classification and regression trees
CAST	Channel-spatial attention-base temporal
CatBoost	Categorical boosting
CC	Classifier chain
CNN	Convolutional neural network
ctGAN	Conditional tabular generative adversarial network
DFPAIS	Data filling approach based on probability analysis in incomplete soft sets
DNN	Deep neural network
DT	Decision tree
EFNC-Exp	Evolving fuzzy neural classifier with expert rules
EL	Ensemble learning
ELM	Extreme learning machine
FD	Fault detection
FPR	False-positive rate
GB	Gradient boosting
HDF	Heat dissipation failure
IL	Incremental learning
KNN	K-nearest neighbors
LDA	Linear discriminant analysis
LightGBM	Light gradient boosting machin
LIME	Local interpretable model-agnostic explanations
LOF	Local outlier factor
LP	Label powerset
LR	Logistic regression
LSTM	Long short-term memory
MAE	Mean absolute error
MCC	Matthews correlation coefficient
ML	Machine learning
MLL	Multi-label learning
MLP	Multi-layer perceptron
MRMR	Minimum redundancy maximum relevance
MSE	Mean squared error
NB	Naive Bayes
NN	Neural network
OSF	Overstrain failure
PART	Partial decision tree
PCA	Principal component analysis
PdM	Predictive maintenance
PDU	Proximity-driven undersampling
PLSCO	Polar lights salp cooperative optimizer
PWF	Power failure
QDA	Quadratic discriminant analysis
RAKEL D	random k-labelsets D
RAKEL O	random k-labelsets O
ResNet	Residual neural network
RF	Random forest
RMSE	Root mean squared error
RNF	Random failures
RUL	Remaining useful life
RUS	Random under sampling
RUSBoost	Random undersampling boosting
SA	Simulated annealing
SDFIS	Simplified approach for data filling in incomplete soft sets
Self-ONN	Self-organized operational neural network
SHAP	Shapley additive explanations
SMOTE	Synthetic minority over-sampling technique
SMOTENC	Synthetic minority over-sampling technique for nominal and continuous
SODA	Self-organized direction-aware data partitioning
SVM	Support vector machine
TPR	True-positive rate
TTML	Tensor trains-based machine learning
TWF	Tool wear failure
t-SNE	t-distributed stochastic neighbor embedding
XAI	Explainable artificial intelligence
XGBoost	Extreme gradient boosting

References

Tsallis, C.; Papageorgas, P.; Piromalis, D.; Munteanu, R.A. Application-Wise Review of Machine Learning-Based Predictive Maintenance: Trends, Challenges, and Future Directions. Appl. Sci. 2025, 15, 4898. [Google Scholar] [CrossRef]
Khattach, O.; Moussaoui, O.; Hassine, M. End-to-End Architecture for Real-Time IoT Analytics and Predictive Maintenance Using Stream Processing and ML Pipelines. Sensors 2025, 25, 2945. [Google Scholar] [CrossRef] [PubMed]
Ucar, A.; Karakose, M.; Kırımça, N. Artificial Intelligence for Predictive Maintenance Applications: Key Components, Trustworthiness, and Future Trends. Appl. Sci. 2024, 14, 898. [Google Scholar] [CrossRef]
Esteban, A.; Zafra, A.; Ventura, S. Data Mining in Predictive Maintenance Systems: A Taxonomy and Systematic Review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1471. [Google Scholar] [CrossRef]
Altalhan, M.; Algarni, A.; Alouane, M.T.-H. Imbalanced Data Problem in Machine Learning: A Review. IEEE Access 2025, 13, 13686–13699. [Google Scholar] [CrossRef]
Sajid, N.A.; Rahman, A.; Ahmad, M.; Musleh, D.; Basheer Ahmed, M.I.; Alassaf, R.; Chabani, S.; Ahmed, M.S.; Salam, A.A.; AlKhulaifi, D. Single vs. Multi-Label: The Issues, Challenges and Insights of Contemporary Classification Schemes. Appl. Sci. 2023, 13, 6804. [Google Scholar] [CrossRef]
Hulten, G.; Spencer, L.; Domingos, P. Mining Time-Changing Data Streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; pp. 97–106. [Google Scholar] [CrossRef]
Lins, R.G.; Nascimento de Freitas, T.; Gaspar, R. Methodology for Commercial Vehicle Mechanical Systems Maintenance: Data-Driven and Deep-Learning-Based Prediction. IEEE Access 2025, 13, 33799–33812. [Google Scholar] [CrossRef]
Lin, K.Y.; Hong, Y.H.; Li, M.H.; Shi, Y.; Matsuno, K. Predictive maintenance in industrial systems: An XGBoost-based approach for failure time estimation and resource optimization. J. Ind. Prod. Eng. 2025, 1–24. [Google Scholar] [CrossRef]
Aydın, C.; Evrentuğ, B. Evaluation of Predictive Maintenance Efficiency with the Comparison of Machine Learning Models in Machining Production Process in Brake Industry. PeerJ Comput. Sci. 2025, 11, e2999. [Google Scholar] [CrossRef]
Yıldırım, Ş.; Yücekaya, A.D.; Hekimoğlu, M.; Ucal, M.; Aydin, M.N.; Kalafat, İ. AI-Driven Predictive Maintenance for Workforce and Service Optimization in the Automotive Sector. Appl. Sci. 2025, 15, 6282. [Google Scholar] [CrossRef]
Gunckel, P.; Lobos, G.; Rodríguez, F.; Bustos, R.; Godoy, D. Methodology proposal for the development of failure prediction models applied to conveyor belts of mining material using machine learning. Reliab. Eng. Syst. Saf. 2025, 256, 110709. [Google Scholar] [CrossRef]
Aminzadeh, A.; Sattarpanah Karganroudi, S.; Majidi, S.; Dabompre, C.; Azaiez, K.; Mitride, C.; Sénéchal, E. A Machine Learning Implementation to Predictive Maintenance and Monitoring of Industrial Compressors. Sensors 2025, 25, 1006. [Google Scholar] [CrossRef]
Hu, H.; Xu, K.; Zhang, X.; Li, F.; Zhu, L.; Xu, R.; Li, D. Research on Predictive Maintenance Methods for Current Transformers with Iron Core Structures. Electronics 2025, 14, 625. [Google Scholar] [CrossRef]
Wu, M.; Goh, K.W.; Chaw, K.H.; Koh, Y.S.; Dares, M.; Yeong, C.F.; Zhang, Y. An Intelligent Predictive Maintenance System Based on Random Forest for Addressing Industrial Conveyor Belt Challenges. Front. Mech. Eng. 2024, 10, 1383202. [Google Scholar] [CrossRef]
Shah, S.S.; Daoliang, T.; Kumar, S.C.H. RUL forecasting for wind turbine predictive maintenance based on deep learning. Heliyon 2024, 10, e39268. [Google Scholar] [CrossRef]
Yu, B.; Kim, Y.; Lee, T.; Cho, Y.; Park, J.; Lee, J.; Park, J. Study on Methods Using Multi-Label Learning for the Classification of Compound Faults in Auxiliary Equipment Pumps of Marine Engine Systems. Processes 2024, 12, 2161. [Google Scholar] [CrossRef]
Qureshi, U.R.; Rashid, A.; Altini, N.; Bevilacqua, V.; La Scala, M. Radiometric Infrared Thermography of Solar Photovoltaic Systems: An Explainable Predictive Maintenance Approach for Remote Aerial Diagnostic Monitoring. Smart Cities 2024, 7, 1261–1288. [Google Scholar] [CrossRef]
Maldonado-Correa, J.; Valdiviezo-Condolo, M.; Artigao, E.; Martín-Martínez, S.; Gómez-Lázaro, E. Classification of Highly Imbalanced Supervisory Control and Data Acquisition Data for Fault Detection of Wind Turbine Generators. Energies 2024, 17, 1590. [Google Scholar] [CrossRef]
Khalil, A.F.; Rostam, S. Machine Learning-Based Predictive Maintenance for Fault Detection in Rotating Machinery: A Case Study. Eng. Technol. Appl. Sci. Res. 2024, 14, 13181–13189. [Google Scholar] [CrossRef]
Hadi, R.H.; Hady, H.N.; Hasan, A.M.; Al-Jodah, A.; Humaidi, A.J. Improved Fault Classification for Predictive Maintenance in Industrial IoT Based on AutoML: A Case Study of Ball-Bearing Faults. Processes 2023, 11, 1507. [Google Scholar] [CrossRef]
Fordal, J.M.; Schjølberg, P.; Helgetun, H.; Skjermo, T.Ø.; Wang, Y.; Wang, C. Application of Sensor Data Based Predictive Maintenance and Artificial Neural Networks to Enable Industry 4.0. Adv. Manuf. 2023, 11, 248–263. [Google Scholar] [CrossRef]
Muideen, A.A.; Lee, C.K.M.; Chan, J.; Pang, B.; Alaka, H. Broad Embedded Logistic Regression Classifier for Prediction of Air Pressure Systems Failure. Mathematics 2023, 11, 1014. [Google Scholar] [CrossRef]
Berghout, T.; Bentrcia, T.; Lim, W.H.; Benbouzid, M. A Neural Network Weights Initialization Approach for Diagnosing Real Aircraft Engine Inter-Shaft Bearing Faults. Machines 2023, 11, 1089. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, B.; Wang, C. A Fault Diagnosis Method for Electrical Equipment With Imbalanced SCADA Data Based on SMOTE Oversampling and Domain Adaptation. In Proceedings of the 2023 8th International Conference on Power and Renewable Energy (ICPRE), Shanghai, China, 22–25 September 2023; IEEE: New York, NY, USA, 2023; pp. 195–202. [Google Scholar] [CrossRef]
Dangut, M.D.; Jennions, I.K.; King, S.; Skaf, Z. A Rare Failure Detection Model for Aircraft Predictive Maintenance Using a Deep Hybrid Learning Approach. Neural Comput. Appl. 2023, 35, 2991–3009. [Google Scholar] [CrossRef]
Hung, Y.-H. Developing an Improved Ensemble Learning Approach for Predictive Maintenance in the Textile Manufacturing Process. Sensors 2022, 22, 9065. [Google Scholar] [CrossRef] [PubMed]
Mihigo, I.N.; Zennaro, M.; Uwitonze, A.; Rwigema, J.; Rovai, M. On-Device IoT-Based Predictive Maintenance Analytics Model: Comparing TinyLSTM and TinyModel from Edge Impulse. Sensors 2022, 22, 5174. [Google Scholar] [CrossRef]
Abdalla, R.; Samara, H.; Perozo, N.; Carvajal, C.P.; Jaeger, P. Machine learning approach for predictive maintenance of the electrical submersible pumps (ESPs). ACS Omega 2022, 7, 17641–17651. [Google Scholar] [CrossRef]
Ouadah, A.; Zemmouchi-Ghomari, L.; Salhi, N. Selecting an appropriate supervised machine learning algorithm for predictive maintenance. Int. J. Adv. Manuf. Technol. 2022, 119, 4277–4301. [Google Scholar] [CrossRef]
Chen, H.; Hsu, J.Y.; Hsieh, J.Y.; Hsu, H.Y.; Chang, C.H.; Lin, Y.J. Predictive maintenance of abnormal wind turbine events by using machine learning based on condition monitoring for anomaly detection. J. Mech. Sci. Technol. 2021, 35, 5323–5333. [Google Scholar] [CrossRef]
Ince, T.; Malik, J.; Devecioglu, O.C.; Kiranyaz, S.; Avci, O.; Eren, L.; Gabbouj, M. Early Bearing Fault Diagnosis of Rotating Machinery by 1D Self-Organized Operational Neural Networks. arXiv 2021, arXiv:2109.14873. [Google Scholar] [CrossRef]
Arora, A.; Tsigelny, I.F.; Kouznetsova, V.L. Laryngeal cancer diagnosis via miRNA-based decision tree model. Eur. Arch. Oto-Rhino-Laryngol. 2024, 281, 1391–1399. [Google Scholar] [CrossRef]
Iqbal, N.; Kumar, P. Coronavirus Disease Predictor: An RNA-Seq Based Pipeline for Dimension Reduction and Prediction of COVID-19. J. Phys. Conf. Ser. 2021, 2089, 012025. [Google Scholar] [CrossRef]
Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–2528. [Google Scholar] [CrossRef]
Thaiparnit, S.; Kritsanasung, S.; Chumuang, N. A Classification for Patients with Heart Disease Based on Hoeffding Tree. In Proceedings of the International Joint Conference on Computer Science and Software Engineering, Chonburi, Thailand, 10–12 July 2019; pp. 352–357. [Google Scholar] [CrossRef]
Pramkeaw, P.; Chumuang, N.; Ketcham, M.; Ganokratanaa, T.; Yimyam, W.; Kwansomkid, K.; Makararpong, D. A Machine Learning Framework for Diabetes Detection Using Hoeffding Tree. In Proceedings of the 2025 IEEE International Conference on Cybernetics and Innovations (ICCI), Chonburi, Thailand, 2–4 April 2025; pp. 1–6. [Google Scholar] [CrossRef]
Mohammad, M.A.; Kolahkaj, M. Detecting Network Anomalies Using the Rain Optimization Algorithm and Hoeffding Tree-Based Autoencoder. In Proceedings of the 2024 10th International Conference on Web Research (ICWR), Tehran, Iran, 24–25 April 2024; pp. 137–141. [Google Scholar] [CrossRef]
Rezki, D.; Mouss, L.-H.; Baaziz, A.; Bentrcia, T. Adaptive Prediction of Rate of Penetration While Oil-Well Drilling: A Hoeffding Tree Based Approach. Eng. Appl. Artif. Intell. 2025, 159, 111465. [Google Scholar] [CrossRef]
Chen, W.; Zhang, S. GIS-based comparative study of Bayes network, Hoeffding tree and logistic model tree for landslide susceptibility modeling. Catena 2021, 203, 105344. [Google Scholar] [CrossRef]
de Araújo Josephik, J.G.A.; Siqueira, Y.; Machado, K.G.; Terada, R.; dos Santos, A.L.; Nogueira, M.; Batista, D.M. Applying Hoeffding Tree Algorithms for Effective Stream Learning in IoT DDoS Detection. In Proceedings of the Latin-American Conference on Communications (LATINCOM), Panama City, Panama, 15–17 November 2023; pp. 1–6. [Google Scholar] [CrossRef]
Soares, D.; Dewan, M.A.A.; Lin, O. A Hoeffding Decision Tree Based Approach for Soil Classification. In Proceedings of the 35th Canadian Conference on Artificial Intelligence, Toronto, Ontario, Canada, 30 May–3 June 2022; pp. 1–12. [Google Scholar] [CrossRef]
Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority oversampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
AI4I 2020 Predictive Maintenance Dataset; UCI Machine Learning Repository: Irvine, CA, USA, 2020. [CrossRef]
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Morgan Kaufmann: Cambridge, MA, USA, 2016; pp. 1–664. Available online: https://ml.cms.waikato.ac.nz/weka (accessed on 22 May 2025).
Pearson, K. Notes on regression and inheritance in the case of two parents. In Proceedings of the Royal Society of London, London, UK, 20 June 1895; Volume 58, pp. 240–242. [Google Scholar]
Chandu, H.S. A Study of Machine Learning Techniques for Predicting Equipment Failures in Industrial Maintenance. In Proceedings of the 2025 IEEE International Conference on Emerging Technologies and Applications (MPSec ICETA), Gwalior, India, 21–23 February 2025; pp. 1–6. [Google Scholar] [CrossRef]
Besha, A.R.M.A.; Ojekemi, O.S.; Oz, T.; Adegboye, O. PLSCO: An Optimization-Driven Approach for Enhancing Predictive Maintenance Accuracy in Intelligent Manufacturing. Processes 2025, 13, 2707. [Google Scholar] [CrossRef]
Jahani, K.; Moshiri, B.; Hossein Khalaj, B. Secure PDM: A Novel Byzantine Fault Tolerant Federated Learning Framework Using a Robust PCA-Based Anomaly Detection Approach. Int. J. Ind. Electron. Control Optim. 2025. [Google Scholar] [CrossRef]
Araujo, S.A.d.; Bomfim, S.L.; Boukouvalas, D.T.; Lourenço, S.R.; Ibusuki, U.; Oliveira Neto, G.C.d. Integration of Data Analytics and Data Mining for Machine Failure Mitigation and Decision Support in Metal–Mechanical Industry. Logistics 2025, 9, 109. [Google Scholar] [CrossRef]
Prashanth, B.S.; Manoj Kumar, M.V.; Almuraqab, N.; Puneetha, B.H. Leveraging Safe and Secure AI for Predictive Maintenance of Mechanical Devices Using Incremental Learning and Drift Detection. Comput. Mater. Contin. 2025, 83, 4979–4998. [Google Scholar] [CrossRef]
Özdemir, K.; Işık, G. Üretim Süreçlerinde Yapay Zekâ Destekli Hatalı Parça Tahminine Yönelik Bir Uygulama. In Proceedings of the 1. Bilsel Uluslararası Anı Bilimsel Araştırmalar Kongresi, Kars, Turkey, 28–29 June 2025; pp. 175–182. [Google Scholar]
Kumar, S.; Panchal, A.; Rawat, U.; Bhattacharya, P.; Kumar, K. Optimizing Grid Equipment Maintenance through Robust Machine Learning. In Proceedings of the 2025 International Conference on Next Generation Communication & Information Processing (INCIP), Bangalore, India, 23–24 January 2025; pp. 194–199. [Google Scholar] [CrossRef]
Misaii, H.; Fouladirad, M.; Ponchet-Durupt, A.; Askari, B. Predictive Degradation Modelling Using Artificial Intelligence: Milling Machine Case Study. In Proceedings of the European Safety and Reliability Conference ESREL 2024, Cracow, Poland, 23–27 June 2024; Jagiellonian University: Cracow, Poland, 2024; pp. 193–200. Available online: https://hal.science/hal-04564828v1 (accessed on 22 May 2025).
Presciuttini, A.; Cantini, A.; Portioli-Staudacher, A. From Explanations to Actions: Leveraging SHAP, LIME, and Counterfactual Analysis for Operational Excellence in Maintenance Decisions. In Proceedings of the 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Male, Maldives, 4–6 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Hung, Y.-H.; Huang, M.-L.; Wang, W.-P.; Chen, G.-L. Hybrid Approach Combining Simulated Annealing and Deep Neural Network Models for Diagnosing and Predicting Potential Failures in Smart Manufacturing. Sens. Mater. 2024, 36, 49–65. [Google Scholar] [CrossRef]
Liu, C.-L.; Su, H.-C. Temporal learning in predictive health management using channel-spatial attention-based deep neural networks. Adv. Eng. Inform. 2024, 62, 102604. [Google Scholar] [CrossRef]
Ghadekar, P.; Manakshe, A.; Madhikar, S.; Patil, S.; Mukadam, M.; Gambhir, T. Predictive Maintenance for Industrial Equipment: Using XGBoost and Local Outlier Factor with Explainable AI for Analysis. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; pp. 25–30. [Google Scholar] [CrossRef]
Kong, Z.; Lu, Q.; Wang, L.; Guo, G. A Simplified Approach for Data Filling in Incomplete Soft Sets. Expert Syst. Appl. 2023, 213, 119248. [Google Scholar] [CrossRef]
Souza, P.V.C.; Lughofer, E. EFNC-Exp: An evolving fuzzy neural classifier integrating expert rules and uncertainty. Fuzzy Sets Syst. 2023, 466, 108438. [Google Scholar] [CrossRef]
Chen, C.-H.; Tsung, C.-K.; Yu, S.-S. Designing a Hybrid Equipment-Failure Diagnosis Mechanism under Mixed-Type Data with Limited Failure Samples. Appl. Sci. 2022, 12, 9286. [Google Scholar] [CrossRef]
Vandereycken, B.; Voorhaar, R. TTML: Tensor trains for general supervised machine learning. arXiv 2016, arXiv:2203.04352. [Google Scholar] [CrossRef]
Falla, B.F.; Ortega, D.A. Evaluación De Algoritmos De Inteligencia Artificial Aplicados Al Mantenimiento Predictivo. Ph.D. Thesis, Corporación Universitaria Autónoma de Nariño (AUNAR), Nariño, Colombia, 3 June 2022. Available online: http://repositorio.aunar.edu.co:8080/xmlui/handle/20.500.12276/1258 (accessed on 22 May 2025).
Iantovics, L.B.; Enachescu, C. Method for Data Quality Assessment of Synthetic Industrial Data. Sensors 2022, 22, 1608. [Google Scholar] [CrossRef] [PubMed]
Vuttipittayamongkol, P.; Arreeras, T. Data-driven Industrial Machine Failure Detection in Imbalanced Environments. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management, Kuala Lumpur, Malaysia, 7–10 December 2022; pp. 1224–1227. [Google Scholar] [CrossRef]
Mota, B.; Faria, P.; Ramos, C. Predictive Maintenance for Maintenance-Effective Manufacturing Using Machine Learning Approaches. In Lecture Notes in Networks and Systems, Proceedings of 17th International Conference on Soft Computing Models in Industrial and Environmental Applications, Salamanca, Spain, 5–7 September 2022; Springer International Publishing AG: Cham, Switzerland, 2022; Volume 531, pp. 13–22. [Google Scholar] [CrossRef]
Diao, L.; Deng, M.; Gao, J. Clustering by Constructing Hyper-Planes. IEEE Access 2021, 9, 70167–70181. [Google Scholar] [CrossRef]
Torcianti, A.; Matzka, S. Explainable Artificial Intelligence for Predictive Maintenance Applications using a Local Surrogate Model. In Proceedings of the 4th International Conference on Artificial Intelligence for Industries, Laguna Hills, CA, USA, 20–22 September 2021; pp. 86–88. [Google Scholar] [CrossRef]
Pastorino, J.; Biswas, A.K. Data-Blind ML: Building privacy-aware machine learning models without direct data access. In Proceedings of the IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering, Laguna Hills, CA, USA, 1–3 December 2021; pp. 95–98. [Google Scholar] [CrossRef]
Zimmerman, D.W.; Zumbo, B.D. Relative Power of the Wilcoxon Test, the Friedman Test, and Repeated-Measures ANOVA on Ranks. J. Exp. Educ. 1993, 62, 75–86. [Google Scholar] [CrossRef]
Jakubowski, J.; Bobek, S.; Nalepa, G.J. TCM: Benchmark Datasets for Predictive Maintenance in Steel Manufacturing; Zenodo: Geneva, Switzerland, 2024. [Google Scholar] [CrossRef]

Figure 1. The general architecture of the proposed BHTF method for failure mode diagnosis.

Figure 2. Illustration of the proposed hybrid resampling strategy combining SMOTE oversampling and PDU undersampling.

Figure 3. Feature importance ranking for TWF.

Figure 4. Feature importance ranking for HDF.

Figure 5. Feature importance ranking for PWF.

Figure 6. Feature importance ranking for OSF.

Figure 7. Confusion matrices: (a) TWF failure type, (b) HDF failure type, (c) PWF failure type, and (d) OSF failure type.

Figure 8. t-SNE visualization of the dataset before (a) and after (b) applying the proposed hybrid resampling strategy.

Figure 9. Sample Hoeffding Tree structure for the TWF label generated by the BHTF framework.

Figure 10. Sample Hoeffding Tree structure for the HDF label generated by the BHTF framework.

Figure 11. Sample Hoeffding Tree structure for the PWF label generated by the BHTF framework.

Figure 12. Sample Hoeffding Tree structure for the OSF label generated by the BHTF framework.

Table 1. Overview of recent predictive maintenance studies.

Ref	Year	Method	Machine	C	R	Label	Sampling	Purpose
[8]	2025	LSTM	Vehicle	√	-	S	O	Failure prediction
[9]	2025	XGBoost, RF, LSTM	Aircraft engine	√	√	S	-	RUL prediction
[10]	2025	DT, NB, KNN, SVM, AdaBoost, RF, CatBoost, XGBoost, LightGBM, MLP	Braking component	√	-	S	U	Failure classification
[11]	2025	DT, RF, LightGBM, XGBoost	Vehicle	√	-	S	-	Service prediction
[12]	2025	ARIMA, LR, ANN, SVM, PCA, DT, LDA, QDA	Conveyor belt	√	-	S	-	Failure prediction
[13]	2025	LR	Compressor	-	√	S	-	Monitoring of equipment health
[14]	2025	RF	Current transformers	√	-	S	O	Fault classification
[15]	2024	RF, LR, ANN, DT, GB	Conveyor belt	√	-	S	-	Fault classification
[16]	2024	CNN, LSTM, ResNet	Wind turbine	-	√	S	-	RUL prediction
[17]	2024	CNN, BR, CC, LP, RAKEL D, RAKEL O, Multi-label KNN	Pump	√	-	M	-	Fault detection
[18]	2024	CNN	Solar panels	√	-	S	O	Diagnostic monitoring
[19]	2024	RF, DT, MLP	Wind turbine	√	-	S	O	Fault detection
[20]	2024	SVM, AdaBoost, Bagging, MLP	Rotating machinery	√	-	S	-	Fault detection
[21]	2023	RF, XGBoost, LightGBM, Auto DNN	Ball bearing	√	-	S	U	Failure classification
[22]	2023	ANN	Lumber machinery	√	-	S	-	Failure prediction
[23]	2023	LR	Air pressure system	√	-	S	O	Failure prediction
[24]	2023	LSTM	Aircraft engine	√	-	S	O	Fault diagnosis
[25]	2023	ResNet, CNN	Hydraulic system, generator bearing, gearbox	√	-	S	O	Fault diagnosis
[26]	2022	Autoencoder, BGRU, CNN	Aircraft	√	-	S	-	Rare failure prediction
[27]	2022	LightGBM, XGBoost, RF	Textile machinery	√	-	S	O	Defect classification
[28]	2022	TinyLSTM, DNN	Autoclave sterilizer	-	√	S	-	RUL prediction
[29]	2022	XGBoost	Pump	√	-	S	-	Failure classification
[30]	2022	RF, DT, KNN	Oil consumption system	√	√	S	-	Fault diagnosis
[31]	2021	DNN, RF, SMOTE, PCA	Wind turbine	√	-	S	O	Failure prediction
[32]	2021	Self-ONN	Rotating machinery	√	-	S	-	Fault diagnosis
Proposed		BHTF	Industrial machinery	√	-	M	O, U	Failure diagnosis

Table 2. A sample multi-label dataset.

Sample	X				Y
$S_{1}$	$x_{11}$	$x_{12}$	…	$x_{1 m}$	$Y_{1} = {y_{1}, y_{3}}$
$S_{2}$	$x_{21}$	$x_{22}$	…	$x_{2 m}$	$Y_{2} = {y_{1}, {y_{2}, y}_{3}, y_{4}}$
…	…	…	…	…	…
$S_{n}$	$x_{n 1}$	$x_{n 2}$	…	$x_{n m}$	$Y_{n} = {y_{2}}$

Table 3. Binary relevance transformation for Table 2.

$D_{y_{1}}$	X	Y	$D_{y_{2}}$	X	Y	$D_{y_{3}}$	X	Y	$D_{y_{4}}$	X	Y
$S_{1}$	$[x_{11} \dots x_{1 m}]$	$y_{1}$	$S_{1}$	$[x_{11} \dots x_{1 m}]$	${\neg y}_{2}$	$S_{1}$	$[x_{11} \dots x_{1 m}]$	$y_{3}$	$S_{1}$	$[x_{11} \dots x_{1 m}]$	${\neg y}_{4}$
$S_{2}$	$[x_{21} \dots x_{2 m}]$	$y_{1}$	$S_{2}$	$[x_{21} \dots x_{2 m}]$	$y_{2}$	$S_{2}$	$[x_{21} \dots x_{2 m}]$	$y_{3}$	$S_{2}$	$[x_{21} \dots x_{2 m}]$	$y_{4}$
…	…	…	…	…	…	…	…	…	…	…	…
$S_{n}$	$[x_{n 1} \dots x_{n m}]$	${\neg y}_{1}$	$S_{n}$	$[x_{n 1} \dots x_{n m}]$	$y_{2}$	$S_{n}$	$[x_{n 1} \dots x_{n m}]$	${\neg y}_{3}$	$S_{n}$	$[x_{n 1} \dots x_{n m}]$	${\neg y}_{4}$

Table 4. Differences between SMOTE and PDU across several features.

Feature	SMOTE	PDU
Type	Oversampling	Undersampling
Add samples?	Yes	No
Remove samples?	No	Yes
Which Class is Affected?	Minority class (Adds to it)	Majority class (Removes some if noisy or misclassified)
Scenario	When the minority class is underrepresented	When data has noise or overlapping classes
Uses k-Nearest Neighbors?	Yes (to generate data)	Yes (to remove misclassified points)
Risk	Overfitting if overused	Underfitting if too aggressive
Goal	Balance the dataset by adding more representative samples	Clean and balance the dataset by removing noisy or borderline samples
Sensitivity to Noise	High—may synthesize noisy or borderline instances	Low—helps eliminate noisy or ambiguous instances
Effect on Decision Boundary	Expands the decision region of the minority class	Sharpens or clarifies the decision boundary by removing overlapping samples
Computational Cost	Moderate—needs distance computations and synthetic generation	Moderate—distance computations for each minority instance
Main Technique	Feature-space interpolation	Disagreement with the classes of neighbors

Table 5. Summary of the AI4I 2020 predictive maintenance dataset characteristics.

Dataset Type	Attribute Types	Learning Tasks	#Instances	#Variables	Missing Values	Subject Area	Release Year	View Counts
Time Series, Multivariate	Real, Boolean	Regression, Classification, Causal Discovery	10,000	14	None	Computer Science	2020	77,511

Table 6. Variables of the AI4I 2020 predictive maintenance dataset.

Variable Name	Category	Type	Description	Unit
UID	Identifier	Integer	Unique identifier	–
Product ID	Identifier	Categorical	Product variant identifier	–
Type	Feature	Categorical	Product quality level (low, medium, high)	–
Air temperature	Feature	Continuous	Air temperature	K
Process temperature	Feature	Continuous	Process temperature	K
Rotational speed	Feature	Integer	Rotational speed	rpm
Torque	Feature	Continuous	Torque	Nm
Tool wear	Feature	Integer	Tool wear	min
Machine failure	Target	Boolean	Indicates any failure occurrence	–
RNF	Target	Boolean	Random failures	–
TWF	Target	Boolean	Tool wear failure	–
HDF	Target	Boolean	Heat dissipation failure	–
PWF	Target	Boolean	Power failure	–
OSF	Target	Boolean	Overstrain failure	–

Table 7. Statistics of the continuous features in the AI4I 2020 predictive maintenance dataset.

Variable Name	Min	Max	Mean	Standard Deviation
Air temperature	295.3	304.5	300.0	2.000
Process temperature	305.7	313.8	310.0	1.484
Rotational speed	1168	2886	1538.8	179.284
Torque	3.8	76.6	39.9	9.969
Tool wear	0	253	107.9	63.654

Table 8. Performance of BHTF for each failure type using accuracy, precision, recall, and F-measure.

Failure Type	Accuracy	Precision	Recall	F-Measure
TWF	93.94%	0.9948	0.9394	0.9663
HDF	98.12%	0.9925	0.9812	0.9868
PWF	98.87%	0.9942	0.9887	0.9914
OSF	98.82%	0.9941	0.9882	0.9911
Average	97.44%	0.9939	0.9744	0.9839

Table 9. Instance distribution for TWF before and after applying SMOTE and PDU across 10-fold cross-validation.

Fold	Before SMOTE	After SMOTE Before PDU	After PDU
1	8958/42	8958/1722	8827/1722
2	8958/42	8958/1722	8826/1722
3	8958/42	8958/1722	8835/1722
4	8958/42	8958/1722	8822/1722
5	8959/41	8959/1681	8839/1681
6	8959/41	8959/1681	8815/1681
7	8959/41	8959/1681	8833/1681
8	8959/41	8959/1681	8864/1681
9	8959/41	8959/1681	8831/1681
10	8959/41	8959/1681	8830/1681
Average	8959/41	8959/1697	8832/1697

Table 10. Instance distribution for HDF before and after applying SMOTE and PDU across 10-fold cross-validation.

Fold	Before SMOTE	After SMOTE Before PDU	After PDU
1	8896/104	8896/4264	8843/4264
2	8896/104	8896/4264	8851/4264
3	8896/104	8896/4264	8842/4264
4	8896/104	8896/4264	8841/4264
5	8896/104	8896/4264	8839/4264
6	8897/103	8897/4223	8838/4223
7	8897/103	8897/4223	8851/4223
8	8897/103	8897/4223	8841/4223
9	8897/103	8897/4223	8838/4223
10	8897/103	8897/4223	8842/4223
Average	8897/104	8897/4244	8843/4244

Table 11. Instance distribution for PWF before and after applying SMOTE and PDU across 10-fold cross-validation.

Fold	Before SMOTE	After SMOTE Before PDU	After PDU
1	8914/86	8914/3526	8856/3526
2	8914/86	8914/3526	8849/3526
3	8914/86	8914/3526	8851/3526
4	8914/86	8914/3526	8850/3526
5	8914/86	8914/3526	8854/3526
6	8915/85	8915/3485	8857/3485
7	8915/85	8915/3485	8865/3485
8	8915/85	8915/3485	8854/3485
9	8915/85	8915/3485	8859/3485
10	8915/85	8915/3485	8843/3485
Average	8915/86	8915/3506	8854/3506

Table 12. Instance distribution for OSF before and after applying SMOTE and PDU across 10-fold cross-validation.

Fold	Before SMOTE	After SMOTE Before PDU	After PDU
1	8911/89	8911/3649	8884/3649
2	8911/89	8911/3649	8885/3649
3	8912/88	8912/3608	8882/3608
4	8912/88	8912/3608	8879/3608
5	8912/88	8912/3608	8886/3608
6	8912/88	8912/3608	8873/3608
7	8912/88	8912/3608	8888/3608
8	8912/88	8912/3608	8887/3608
9	8912/88	8912/3608	8885/3608
10	8912/88	8912/3608	8883/3608
Average	8912/88	8912/3616	8883/3616

Table 17. Training times across 10 folds for each failure type label in seconds.

Fold	TWF	HDF	PWF	OSF
1	0.547	0.225	0.185	0.209
2	0.235	0.215	0.184	0.195
3	0.164	0.204	0.183	0.193
4	0.203	0.210	0.188	0.194
5	0.145	0.202	0.183	0.193
6	0.150	0.209	0.186	0.195
7	0.148	0.201	0.182	0.193
8	0.148	0.201	0.180	0.192
9	0.148	0.196	0.179	0.191
10	0.145	0.202	0.179	0.190
Average	0.203	0.207	0.183	0.195

Table 18. Comparison of BHTF with the state-of-the-art methods on the same AI4I 2020 dataset. N/A: Not Available.

Reference	Year	Method	Training Protocol	Dataset Split	Hyperparameters Settings	Accuracy (%)	Precision	Recall	F-Measure
Chandu [48]	2025	GB	Feature selection; SMOTE; min–max normalization; outlier removal	Train/test (not specified ratios)	N/A	90.00	0.9200	0.9000	0.8569
		MLP				61.00	0.7300	0.6100	0.6000
		KNN				71.68	0.6709	0.6831	0.6769
Besha et al. [49]	2025	ELM + PLSCO	Optimization-driven training with metaheuristic hybrid (PLO + CSO + SSA)	70–30%	m = 100, a = [1,1.5], c1 = [2/e,2]	95.47	0.8679	0.8659	0.8669
Jahani et al. [50]	2025	BFT + PCA (Byzantine = 0.2)	Federated learning	N/A	N/A	89.90	-	-	-
		BFT + PCA (Byzantine = 0.4)				89.83	-	-	-
		BFT + PCA (Byzantine = 0.6)				89.00	-	-	-
Araujo et al. [51]	2025	CART	SMOTE; categorical encoding; MRMR feature selection	Five-fold-cross-validation	criterion = entropy, splitter = best, max_depth = 5, min_samples_split = 2, min_samples_leaf = 1, num_features_split = none, max_leaf_nodes = none, random_state = 42	82.10	-	-	-
Prashanth et al. [52]	2025	DNN	Incremental and dynamic learning	Hold-out validation	3 layers (64,32, 1), ReLU, sigmoid	84.00	-	-	-
Prashanth et al. [52]	2025	SVM	N/A	N/A	N/A	89.00	-	-	-
Özdemir et al. [53]	2025	LR	SMOTE; categorical encoding; supervised learning	N/A	N/A	88.00	0.4200	0.6100	0.5000
		RF				94.00	0.4500	0.6800	0.5400
		XGBoost				97.00	0.4700	0.7400	0.5800
Kumar1 et al. [54]	2025	KNN	Nearest-neighbor voting	Train/validation/test (not specified ratios)	k = 1, Euclidean distance	94.00	-	-	0.9400
		SVM	Kernel-based supervised learning		C = 100, gamma = 1, kernel = RBF	95.00	-	-	0.9500
		RF	Ensemble learning		Max depth = 10, number of trees = 500	96.00	-	-	0.9600
		XGBoost	Gradient boosting		learning rate = 0.1, max depth = 5, n_estimators = 500	97.00	-	-	0.9700
Misaii et al. [55]	2024	LSTM	Sequential deep learning; SMOTE; binary cross-entropy loss function	80–20%	N/A	80.00	0.96	0.83	0.89
Presciuttini et al. [56]	2024	RF + XAI (SHAP, LIME, counterfactual)	Supervised learning	80–20%	number of trees = 100, random_state = 42	95.00	-	-	-
Hung et al. [57]	2024	DNN+Adam SingleHL Model I	Models trained for 100 epochs; batch size 400; single- and double-hidden-layer architectures	90–10%	number of neurons per hidden layer = 100, activation function (hidden layers) = ReLU, activation function (output layer) = Softmax, output classes = 6, input neurons = 5	93.58	-	0.9400	0.9300
		DNN+Adam SingleHL Model II				95.37	-	0.9600	0.9600
		DNN+SA DoubleHL Model III				96.54	-	0.9500	0.9500
		DNN+SA DoubleHL Model IV				97.09	-	0.9700	0.9700
Liu and Su [58]	2024	CAST	Early stopping if validation loss does not improve for 3 iterations	Five-fold-cross-validation	Epochs = 100, batch size = 64, learning rate = 0.001, hidden size = 512, optimizer = AdamW	-	-	-	0.8800
		SE-ResNet 18				-	-	-	0.8400
		GE-ResNet 18				-	-	-	0.8100
		SE-SCNet 18				-	-	-	0.8600
Ghadekar et al. [59]	2024	XGBoost	SMOTE	N/A	N/A	96.00	0.9800	0.9600	0.9690
		RF				95.50	0.9760	0.9550	0.9640
		LOF				91.70	0.9510	0.9170	0.9330
		One-class SVM				91.20	0.9530	0.9120	0.9300
Kong et al. [60]	2023	DFPAIS	Iterative data filling	N/A	N/A	83.74	-	-	-
Kong et al. [60]	2023	SDFIS	Simplified data filling	N/A	N/A	82.03	-	-	-
Souza and Lughofer [61]	2023	EFNC-Exp	Sequential stream-based updating; fuzzification; expert rules	70–30%	γ for DA plane (single hyperparameter)	97.30	-	-	-
Souza and Lughofer [61]	2023	SODA	Incremental clustering; dynamically updating clouds and feature weights	70–30%	No separate hyperparameters beyond γ	96.80	-	-	-
Chen et al. [62]	2022	CatBoost	Ordered boosting and gradient descent	Three-fold cross-validation	Hyperparameters optimized using Optuna	64.23	-	0.2868	-
		SMOTENC + CatBoost	SMOTE			88.09	-	0.7881	-
		ctGAN + CatBoost	Data normalization; learning distribution; oversampling			87.08	-	0.8305	-
		SMOTENC + ctGAN + CatBoost	SMOTE, GAN			88.83	-	0.9068	-
Vandereycken and Voorhaar [63]	2022	XGBoost	Training with different TT initializations	70% train/ 15% validation/ 15% test	Optimized via validation set	95.74	-	-	-
		RF				95.10	-	-	-
		TTML + XGBoost				77.00	-	-	-
		TTML + RF				78.00	-	-	-
		TTML + MLP 1				76.20	-	-	-
		TTML + MLP 2				65.00	-	-	-
Falla and Ortega [64]	2022	RF	Supervised learning, oversampling	70–30%	Sklearn default parameters; random seed = 42	96.81	0.9740	0.7639	0.8563
Falla and Ortega [64]	2022	Neural Networks	Supervised learning, oversampling	70–30%	Hidden layers = 10, max iterations = 500, penalty = 0.001, random seed = 21, other defaults	91.50	0.9166	0.8611	0.8880
Iantovics and Enachescu [65]	2022	BLR	Mathematical modeling for data quality assessment	N/A	Standard β coefficients	97.10	0.9950	0.2830	0.4407
Vuttipittayamongkol and Arreeras [66]	2022	SVM	Standard supervised learning	70–30%	Default caret parameters	-	0.7229	0.5941	0.6522
		DT				-	0.8391	0.7228	0.7766
		KNN				-	0.8108	0.2970	0.4348
		RF				-	0.8267	0.6139	0.7045
		NN				-	0.7333	0.2178	0.3359
Mota et al. [67]	2022	GB, SVM	Batch training; preprocessing with data aggregation, min–max normalization, imputation, feature engineering, oversampling, and undersampling	80–20%	Automatic hyperparameter tuning using five-fold cross-validation	94.55	-	0.9200	-
Diao et al. [68]	2021	Constructing Hyper-Planes	Unsupervised learning; mean-shift and min–max normalization	N/A	H = TL (set of hyper-planes); δ determined automatically	-	-	-	0.6200
Torcianti and Matzka [69]	2021	RUSBoost Trees	Decision tree-based learning	20 dataset points	Kernel width σ = 0.05, iteratively decreased by 0.01 down to 0.01; feature importance threshold ≥ 25%	92.74	0.3071	0.9085	0.4590
Pastorino and Biswas [70]	2021	Data-Blind Machine Learning	Simple NN training for tabular datasets, CNN training for MNIST	Train/test (not specified ratios)	CTGAN settings for generative model; no special tuning for MNIST	97.30	-	-	-
		Average				88.94	0.7845	0.7492	0.7793
Proposed Method		Balanced Hoeffding Tree Forest (BHTF)	Multi-label learning (MLL); incremental learning (IL); ensemble learning (EL); oversampling; undersampling	10-fold-cross-validation	SMOTE: k = 5, R = 4000%; PDU: u = 7; numIterations = 10	97.44	0.9939	0.9744	0.9839

Table 19. Overview of selected multi-label TCM datasets.

Dataset	Observations	Anomalies	Features	Anomaly Types	Products	Data Drift
tcm5_dataset_3	20003	981	51	4 (16)	4	False
tcm5_dataset_4	20001	925	51	4 (16)	20	False
tcm5_dataset_5	20005	1031	51	4 (16)	5	True
tcm5_dataset_6	20008	954	51	4 (16)	25	True

Table 20. Variables of selected TCM datasets.

Variable Name	Category	Type	Description	Unit
thickness_entry	Feature	Continuous	Steel entry thickness before rolling	mm
thickness_exit	Feature	Continuous	Steel exit thickness after rolling	mm
width	Feature	Continuous	Steel width	mm
ys_entry	Feature	Continuous	Steel yield strength at entry	MPa
ys_exit	Feature	Continuous	Steel yield strength at exit	MPa
work_roll_diam	Feature	Continuous	Work roll diameter for stands 1–5	mm
work_roll_mileage	Feature	Continuous	Work roll mileage for stands 1–5	km
reduction	Feature	Continuous	Thickness reduction per stand (1–5)	–
tension	Feature	Continuous	Interstand tension (0: before stand 1, 1–5: after stands 1–5)	N
roll_speed	Feature	Continuous	Linear work roll speed for stands 1–5	NaN
force	Feature	Continuous	Rolling force for stands 1–5	N
torque	Feature	Continuous	Rolling torque for stands 1–5	Nm
gap	Feature	Continuous	Stand gap for stands 1–5	mm
motor_power	Feature	Continuous	Electric motor power for stands 1–5	kW
Anomaly_Reduction	Target	Boolean	Label for anomaly in reduction scheme	–
Anomaly_Electric	Target	Boolean	Label for anomaly in electric motor per stand	–
Anomaly_Bearing	Target	Boolean	Label for anomaly in stand bearings	–
Anomaly_WorkRoll	Target	Boolean	Label for anomaly in work roll friction per stand	–

Table 21. Accuracy results of the proposed BHTF method on TCM benchmark datasets.

Label	Failure Type	tcm5_dataset_3	tcm5_dataset_4	tcm5_dataset_5	tcm5_dataset_6
1	Anomaly Reduction	93.24	90.81	98.05	86.15
2	Anomaly Electric 1	97.63	99.55	97.93	95.96
3	Anomaly Bearing 1	99.66	99.46	99.75	98.08
4	Anomaly WorkRoll 1	98.50	97.41	99.34	95.94
5	Anomaly Electric 2	96.48	99.07	97.22	99.14
6	Anomaly Bearing 2	99.28	98.04	99.86	99.62
7	Anomaly WorkRoll 2	98.29	96.69	97.93	95.14
8	Anomaly Electric 3	99.32	99.18	98.32	98.21
9	Anomaly Bearing 3	99.61	99.75	99.69	96.33
10	Anomaly WorkRoll 3	99.08	94.83	95.52	96.35
11	Anomaly Electric 4	99.95	99.84	98.27	97.90
12	Anomaly Bearing 4	99.73	99.35	99.73	99.36
13	Anomaly WorkRoll 4	96.80	97.58	95.66	93.80
14	Anomaly Electric 5	99.89	99.80	99.75	98.66
15	Anomaly Bearing 5	99.26	98.52	99.61	99.72
16	Anomaly WorkRoll 5	98.84	94.95	96.73	92.10
	Average	98.47	97.80	98.34	96.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ghasemkhani, B.; Kut, R.A.; Birant, D.; Yilmaz, R. Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance. Mathematics 2025, 13, 3019. https://doi.org/10.3390/math13183019

AMA Style

Ghasemkhani B, Kut RA, Birant D, Yilmaz R. Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance. Mathematics. 2025; 13(18):3019. https://doi.org/10.3390/math13183019

Chicago/Turabian Style

Ghasemkhani, Bita, Recep Alp Kut, Derya Birant, and Reyat Yilmaz. 2025. "Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance" Mathematics 13, no. 18: 3019. https://doi.org/10.3390/math13183019

APA Style

Ghasemkhani, B., Kut, R. A., Birant, D., & Yilmaz, R. (2025). Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance. Mathematics, 13(18), 3019. https://doi.org/10.3390/math13183019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Balanced Hoeffding Tree Forest (BHTF): A Novel Multi-Label Classification with Oversampling and Undersampling Techniques for Failure Mode Diagnosis in Predictive Maintenance

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Proposed Method

3.2. Multi-Label Learning

3.3. Hybrid Resampling Strategy

3.3.1. Oversampling with SMOTE

3.3.2. Undersampling with PDU

3.4. Hoeffding Tree Classifier

3.5. Hoeffding Tree Forest

3.6. Algorithm

4. Experimental Setup

4.1. Dataset Description

4.2. Evaluation Metrics

4.2.1. Standard Per-Label Metrics

4.2.2. Weighted Metrics

4.2.3. Multi-Label Metrics

4.3. Hyperparameters

5. Results

5.1. Overall BHTF Performance

5.2. Confusion Matrix

5.3. Resampling Performance Across Folds

5.4. Sensitivity Analysis

5.4.1. Effect of SMOTE Ratio

5.4.2. Effect of Number of Neighbors in PDU

5.4.3. Effect of Number of Hoeffding Trees

5.4.4. Effect of Number of Selected Features

5.5. Computational Cost Analysis

5.6. Hoeffding Tree Structure Analysis

6. Discussion

7. External Validation Across Diverse Datasets

8. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI