Next Article in Journal
A Comprehensive Data Maturity Model for Data Pre-Analysis
Previous Article in Journal
Sharing Research Data in Collaborative Material Science and Engineering Projects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification

by
Kittipong Theephoowiang
and
Anantaporn Hanskunatai
*
Computer Science, School of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10152, Thailand
*
Author to whom correspondence should be addressed.
Data 2025, 10(4), 54; https://doi.org/10.3390/data10040054
Submission received: 14 January 2025 / Revised: 19 March 2025 / Accepted: 13 April 2025 / Published: 18 April 2025
(This article belongs to the Section Information Systems and Data Management)

Abstract

:
Imbalanced classification presents a significant challenge in real-world datasets, requiring innovative solutions to enhance performance. This study introduces a hybrid binary classification algorithm designed to effectively address this challenge. The algorithm identifies different data types, pairs them, and trains multiple models, which then vote on predictions using weighted strategies to ensure stable performance and minimize overfitting. Unlike some methods, it is designed to work consistently with both noisy and noise-free datasets, prioritizing overall stability rather than specific noise adjustments. The algorithm’s effectiveness is evaluated using Recall, G-Mean, and AUC, measuring its ability to detect the minority class while maintaining balance. The results reveal notable improvements in minority class detection, with Recall outperforming other methods in 16 out of 22 datasets, supported by paired t-tests. The algorithm also shows promising improvements in G-Mean and AUC, ranking first in 17 and 18 datasets, respectively. To further evaluate its performance, the study compares the proposed algorithm with previous methods using G-Mean. The comparison confirms that the proposed algorithm also exhibits strong performance, further highlighting its potential. These findings emphasize the algorithm’s versatility in handling diverse datasets and its ability to balance minority class detection with overall accuracy.

1. Introduction

Classification is a fundamental problem in machine learning. It performs well if the model is trained on datasets with balanced class frequencies, which are balanced or approximately equal (e.g., for binary, there is a 1:1 ratio between majority and minority) [1]. However, this balance is uncommon in real-world datasets, which often have significant differences between classes, leading to imbalanced datasets. Examples include datasets related to medical diagnosis [2,3,4,5], fraud detection [6,7,8] or credit evaluation [9,10,11]. Therefore, improving accuracy in imbalanced classification is a crucial challenge.
Imbalanced datasets have several disadvantages for classification, such as bias towards the majority class, where models tend to overlook the minority class, leading to poor performance in critical applications, such as medical diagnosis or fraud detection. Additionally, common evaluation metrics, such as accuracy, can be misleading because a model that predicts only the majority class may achieve high accuracy but fails to correctly classify minority class instances. Because this issue affects poor generalization due to biased learning from the majority class, the model develops skewed predictions, such as failing to detect minority class instances. This reduces its effectiveness in real-world predictions. For example, in medical diagnosis, a model trained on imbalanced data might excel at identifying common diseases (majority class) but fail to recognize rare conditions (minority class). This limitation stems from the model’s decision boundaries being overly influenced by the majority class, causing it to misclassify or overlook underrepresented patterns in real-world scenarios. Furthermore, it is often challenging and costly to collect sufficient and representative data for the minority class, thus making it difficult to create balanced datasets.
Although many real-world datasets involve multiclass problems, binary classification is simpler to optimize due to reduced complexity in decision boundaries and class interactions. Decomposing multiclass tasks into binary subproblems, such as one-vs.-rest or one-vs.-one strategies, simplifies learning and improves interpretability. While multiclass models use a single framework, binary decomposition lowers algorithmic complexity and mitigates overfitting. Given these advantages, this study focuses on binary classification, particularly in imbalanced scenarios.
In a binary imbalanced dataset, there are two classes, known as the minority and majority classes. Typically, the goal of an imbalanced classification is to detect the minority class. There are three primary methods to improve detection of the minority class, which is critical in applications such as fraud detection or rare disease diagnosis. Three main strategies address this challenge: data-level methods, algorithm-level methods, and hybrid methods [12].
Data-level methods adjust class distribution by oversampling the minority class (e.g., SMOTE [13]) or undersampling the majority class (e.g., Tomek Links [14]). Some advanced techniques are applied to remove noise or overlapped data (i.e., removing the ambiguous instances near class boundaries). These methods can enhance minority class detection but may lead to overfitting through synthetic data duplication, whereas the model might lose some information by undersampling the majority.
Algorithm-level methods modify learning algorithms to prioritize minority instances, such as including cost-sensitive learning (penalizing majority-class errors) and ensemble methods, such as AdaBoost [15], which iteratively adjust weights for misclassified samples. This approach can enhance model robustness, but these methods demand significant computational resources and parameter tuning.
Hybrid methods integrate data-level and algorithm-level methods, leveraging the strengths of both. For instance, SMOTEBoost [16] combines synthetic oversampling with boosting. While hybrid methods can provide balanced and robust solutions, they inherit complexity from both components and are resource intensive to implement.
To improve the accuracy of imbalanced classification, this work proposes a novel partition-based hybrid algorithm that strategically addresses class overlap and imbalance to enhance the performance of imbalanced classification. This approach operates in two phases: data partitioning and dynamic training and voting.
Data partitioning divides the dataset to categorize the data into four distinct groups:
  • D m i n o v e r : Minority instances overlapping with the majority class,
  • D m i n n o n : Minority instances in distinct regions,
  • D m a j o v e r : Majority instances overlapping with the minority class,
  • D m a j n o n : Majority instances in distinct regions.
For example, in medical data, D m i n o v e r might represent patients with symptoms common to both rare and common diseases.
The dynamic training and voting phase construct five different datasets by pairing four subsets from the data-partitioning phase (e.g., D m i n o v e r v s .   D m a j n o n ) to isolate specific learning challenges (overlap vs. separability). Then, each dataset is balanced using adaptive oversampling (e.g., SMOTE for D m i n o v e r ) to avoid overfitting. Five diverse models (e.g., SVM, Random Forest) are trained, and predictions are aggregated via weighted voting, prioritizing metrics such as Recall for minority detection.
This approach reduces bias by separately addressing overlap and imbalance, while the ensemble structure enhances robustness. For instance, models trained on D m i n n o n focus on pure minority patterns, improving rare-class detection without majority-class interference.
In model evaluation, this study uses Recall, G-Mean, and AUC. For Recall, this metric prioritizes minority-class detection, G-Mean ensures a balance between sensitivity and specificity, and AUC measures overall class separability. Together, these metrics overcome the limitations of the F1-score, which is sensitive to false positives in imbalanced data, and the Matthews Correlation Coefficient (MCC), which is less interpretable in binary contexts. Our experimental results demonstrate that this combination provides a comprehensive assessment of model performance, particularly in imbalanced and overlapping datasets.
The rest of this paper is arranged as follows: Section 2 reviews existing methods for handling imbalanced classification. Section 3 provides a detailed description of the proposed hybrid method. Section 4 shows the experimental design and the results, including baseline comparisons using resampling techniques and model training with both ensemble (bagging and AdaBoost) and non-ensemble approaches, utilizing Random Forest and Support Vector Classifier (SVC) with linear, RBF, and polynomial kernels as base estimators. Section 5 concludes the paper, summarizing the key findings, highlighting the advantages of the proposed method, and suggesting future research directions.

2. Related Works

The challenge of imbalanced classification has been extensively researched, leading to the development of various approaches. These can be classified into three techniques: data-level techniques, algorithm-level techniques, and hybrid techniques [12]. The examples are shown in Figure 1.
Data-level techniques balance class distribution by modifying the dataset and are divided into three types: oversampling, undersampling, and hybrid sampling. For example, random undersampling (RUS) removes instances from the majority class randomly, which may lead to the loss of important data [17,18]. This issue is mitigated by selectively removing majority-class instances using methods such as Tomek Link and clustering-based undersampling [14,19]. Conversely, oversampling techniques, such as random oversampling (ROS), add minority class examples randomly to prevent information loss, but they may cause overfitting [20,21]. The Synthetic Minority Oversampling Technique (SMOTE) creates synthetic minority instances, though it can marginalize data distribution [13]. Alternative techniques, such as Borderline-SMOTE, and Safe-SMOTE, have been proposed to overcome these limitations [31,32]. Hybrid sampling approaches combine the strengths of oversampling and undersampling to create balanced datasets [22,23,24]. These techniques simplify class distribution, making it easier for models to learn, and can be applied to any classification algorithm without modification. However, oversampling can cause overfitting by duplicating minority class samples, while undersampling can lead to the loss of valuable majority class information. Noise removal techniques may also inadvertently remove important data.
Algorithm-level techniques improve imbalanced data handling by modifying the learning algorithm. The basic techniques are cost-sensitive learning and ensemble learning [25,26]. Cost-sensitive learning gives more importance to the minority class, making it a popular choice for imbalanced data. Ensemble learning combines multiple models to enhance classification performance. For example, AdaBoost adapts by sequentially training weak learners and adjusting weights based on misclassification errors to emphasize difficult instances, thereby improving the handling of imbalanced data [15]. Random forests construct multiple decision trees during training and aggregate their predictions to improve accuracy and robustness [27]. Weighted SVM modifies the SVM algorithm by assigning different weights to classes, prioritizing the minority class [28]. These techniques help models handle imbalanced data better, preserve all data points, and improve robustness and accuracy. However, they can be computationally intensive, complex to implement, require extensive tuning, and may not generalize well if the underlying data distribution changes.
Hybrid techniques combine the advantages of data-level and algorithm-level methods, resulting in high performance for handling imbalanced datasets, which makes them popular. These techniques integrate ensemble models with resampling methods to improve performance. For example, SMOTEBoost combines SMOTE with the AdaBoost algorithm [16]. RUSBoost combines random undersampling with boosting techniques [29]. XGBoost is a powerful method that uses gradient boosting with regularization to improve performance and manage class imbalance [30]. Balanced random forest combines resampling techniques with the random forest algorithm. Hybrid techniques offer a balanced approach, mitigating the disadvantages of individual methods, often resulting in better performance and more robust models. However, their complexity increases due to the combination of multiple techniques, requiring significant computational resources, and the integration of different methods can be challenging and may not always lead to improvement.
In conclusion, various approaches have been developed to address imbalanced classification, each with its own benefits and drawbacks. The summary of three techniques is shown in Table 1. Hybrid techniques, combining data-level and algorithm-level methods, have significant potential to improve classification performance. They benefit from the strengths of both approaches, including improved class balance and enhanced model learning. However, challenges similar to those of individual techniques remain, such as overfitting, implementation complexity, and the need for careful tuning. Addressing these challenges is essential to optimize hybrid techniques in imbalanced classification. The next section will propose strategies to enhance the effectiveness of hybrid techniques, aiming to improve these drawbacks and further improve performance.

3. Motivation

To avoid biased predictions, numerous imbalanced classification techniques have been developed. However, many of these techniques fail to account for critical factors such as data density variations and overlapping class regions. These oversights can severely degrade classification performance. For example, traditional resampling methods may overgeneralize minority class features or inadequately resolve overlapping regions, leading to poor Recall for minority instances and increased misclassification rates.
Recently, density-based and latent space mapping techniques have shown a trend to improve this issue. For example, hybrid imbalanced classification models based on data density (HICD) leverage density-aware partitioning to enhance model performance [35]. This approach segments the dataset into distinct density regions, allowing for more targeted resampling and better identification of minority-class instances. However, HICD does not fully mitigate class overlap and may struggle with noisy data, potentially introducing classification errors.
Similarly, techniques that normalize data points under class discrepancy constraints attempt to map data into latent spaces to reduce classification complexity and enhance separability [36]. By transforming the original feature space into a latent representation, these methods aim to form distinct subclusters, facilitating more effective classification. However, they may struggle to maintain the original data structure, leading to information loss and potential within-class imbalances. This limitation arises because the mapping process may not consistently preserve crucial feature relationships across different density regions.
Additionally, Mayabadi and Saadatfar proposed two density-based sampling algorithms: one that employed undersampling to remove high-density samples from the majority class and another that combined undersampling and oversampling [37]. While these methods aimed to balance class distributions and reduce noise, they lacked a robust mechanism to distinguish between noise and valuable minority instances, potentially leading to information loss and reduced generalization ability.
In 2023, Tao et al. introduced self-adaptive oversampling methods, which dynamically adjust the resampling process based on minority class complexity [38]. This approach generates synthetic minority instances within adaptive hyperspheres while avoiding majority class instances, thereby reducing class overlap and enhancing minority class recognition. However, while this technique effectively minimizes overlap and mitigates outliers, it may still struggle to generate sufficiently diverse and representative synthetic samples, potentially limiting generalization.
Although recent advances have improved imbalanced classification, a significant gap remains in developing an integrated approach that effectively handles both class imbalance and data overlap while preserving dataset structure. Motivated by these limitations, this study proposes a novel algorithm that explicitly considers class overlap by incorporating data density insights. This proposed hybrid algorithm integrates density-based resampling, data partitioning and adaptive oversampling strategies. This approach aims to enhance minority-class recognition while maintaining structural integrity and minimizing the impact of overlapping instances. The next section details the methodology of the proposed hybrid algorithm, outlining the specific steps and techniques employed in data partitioning and data matching.

4. Designed Algorithms

A hybrid algorithm has been developed to address imbalanced binary classification, denoted by majority class ( D m a j ) and minority class ( D m i n ). This algorithm consists of two main components: data characterization and data matching. In the data characterization stage, the data are identified into four types based on both the radius and the number of neighboring points. These types are Majority Overlap ( D m a j o v e r ), Minority Overlap ( D m i n o v e r ), Minority Non-Overlap ( D m i n n o n ), and Majority Non-Overlap ( D m a j n o n ). In the data-matching stage, the algorithm combines these four types into five distinct sets:
  • Set 0: Original (all parts combined),
  • Set 1: Minority Overlap vs. Majority Non-Overlap,
  • Set 2: Majority Overlap vs. Minority Non-Overlap,
  • Set 3: Minority Overlap vs. Majority Overlap,
  • Set 4: Minority Non-Overlap vs. Majority Non-Overlap.
The overall process of the proposed algorithm is illustrated in Figure 2.

4.1. Data Characterization

In the data characterization stage, the dataset is categorized into four distinct groups based on radius and the number of neighboring points. The details are described in Algorithms 1 through 4. The four types resulting from this stage are:
  • D m a j n o n : Majority class data points that do not overlap with the minority class.
  • D m i n n o n .: Minority class data points that do not overlap with the majority class.
  • D m a j o v e r : Majority class data points that overlap with the minority class.
  • D m i n o v e r : Minority class data points that overlap with the majority class.
To facilitate readability, the notations of all variables are summarized in Table 2.
Algorithm 1 shows the overall process of the data characterization stage, which uses the radius and minimum neighborhood. The radius and minimum neighborhood are calculated by Algorithm 2 and Algorithm 3, respectively. These computed values are then used to determine the type of data, with the process described in Algorithm 4: Data Typing for Overlapping Instances.
Algorithm 1: Data characterization
Input: Original dataset
Output :   D m a j n o n ,   D m i n n o n ,   D m a j o v e r ,   D m i n o v e r
Pseudo Code:
  //1. Separate the dataset by class into two separate sets:
     D m a j = S e p a r a t e D a t a s e t B y C l a s s d a t a s e t ,   majority
     D m i n   = S e p a r a t e D a t a s e t B y C l a s s d a t a s e t ,   minority
  //2. Calculate the radius and minimum neighbors for minority class instances:
     r m i n = C a l c u l a t e R a d i u s D m a j ,   D m i n //Compute the radius for minority instances
     m i n n e i m i n = C a l c u l a t e M i n i m u m N e i g h b o r s D m a j ,   D m i n ,   r m i n Determine minimum neighbors
  //3. Execute the function to type overlapping instances:
   D m a j n o n ,   D m i n n o n ,   D m a j o v e r ,   D m i n o v e r = T y p e O v e r l a p p i n g I n s t a n c e s D m a j ,   D m i n ,   m i n n e i m i n ,   r m i n
  //4. Return the updated feature matrices and overlapping instances:
   Return   D m a j n o n ,   D m i n n o n ,   D m a j o v e r ,   D m i n o v e r
Return:
   D m a j n o n : Non-overlapping instances for the majority class
   D m i n n o n : Non-overlapping instances for the minority class
   D m a j o v e r l a p : Overlapping instances for the majority class
   D m i n o v e r l a p : Overlapping instances for the minority class
Algorithm 1 serves as the overall framework for classifying data, incorporating several subfunctions explained in Algorithms 2–4. The S e p a r a t e D a t a s e t B y C l a s s function splits the dataset into majority ( D m a j ) and minority ( D m i n ) classes based on their labels. The C a l c u l a t e R a d i u s function determines the radius threshold, as described in Algorithm 2. The C a l c u l a t e M i n i m u m N e i g h b o r s function computes the minimum number of instances required to classify each instance, as explained in Algorithm 3. Finally, the T y p e O v e r l a p p i n g I n s t a n c e s function identifies overlapping instances using radius and neighbor thresholds, following the approach outlined in Algorithm 4.
The radius threshold r m i n is computed using pairwise distances between minority and majority class instances. For each minority instance, distances to all majority instances are calculated (e.g., Euclidean distance), and the n -th percentile (e.g., 75th percentile) of these distances is derived. The final r m i n is defined as the average of these percentile values across all minority instances ( r m i n = 1 n i = 1 n ( 75 th   percentile   distance ) ). While this process, implemented in the C o m p u t e P a i r w i s e D i s t a n c e s function, introduces a computational complexity of O n m i n n m a j , it ensures robust identification of overlapping regions. The complete procedure is formalized in Algorithm 2.
Algorithm 2: Radius calculation
Input :   Minority   and   majority   class   instances   ( D m i n ,   D m a j )
Output :   Radius   ( r m i n )
Pseudo Code:
  //1. Compute all pairwise distances between instances of the minority and majority classes using the distance metric (e.g., Euclidean distance):
     d i s t a n c e M a t r i x = C o m p u t e P a i r w i s e D i s t a n c e s D m i n ,   D m a j
  //2. Calculate the percentile distances for each instance in the distance matrix (e.g., 75th percentile):
     p e r c e n t i l e D i s t a n c e s = C a l c u l a t e P e r c e n t i l e D i s t a n c e s d i s t a n c e M a t r i x
   / / 3 .   Determine   the   mean   of   these   percentile   distances   to   obtain   the   radius   ( r m i n ) for the minority class instances:
         r m i n = C a l c u l a t e M e a n p e r c e n t i l e D i s t a n c e s
Return:
   Radius   ( r m i n ) for the minority class instances.
To better understand the radius calculation, consider the following example.
Let 1,2 D m i n and the majority class set be
D m a j = ( 3,3 ) , ( 2,5 ) , ( 6,2 ) , ( 5,4 ) , ( 1,6 ) , ( 4,1 ) , ( 7,3 ) , ( 3,5 ) , ( 2,1 )
Since the Euclidean distance is defined as:
d = x 1 x 2 2 + y 1 y 2 2
we compute the distances between 1,2 and each point in D m a j as follows:
d 1,2 , 3,3 = 3 1 2 + 3 2 2 = 4 + 1 = 5 2.24. d 1,2 , 2,5 = 2 1 2 + 5 2 2 = 1 + 9 = 10 3.16. d 1,2 , 6,2 = 6 1 2 + 2 2 2 = 25 + 0 = 25 = 5.00. d 1,2 , 5,4 = 5 1 2 + 4 2 2 = 16 + 4 = 20 4.47. d 1,2 , 1,6 = 1 1 2 + 6 2 2 = 0 + 16 = 16 = 4.00. d 1,2 , 4,1 = 4 1 2 + 1 2 2 = 9 + 1 = 10 3.16. d 1,2 , 7,3 = 7 1 2 + 3 2 2 = 36 + 1 = 37 6.08. d 1,2 , 3,5 = 3 1 2 + 5 2 2 = 4 + 9 = 13 3.61. d 1,2 , 2,1 = 2 1 2 + 1 2 2 = 1 + 1 = 2 1.41.
Then, find the 75th percentile, sorting the distances in ascending order:
[ 1.41 ,   2.24 ,   3.16 ,   3.16 ,   3.61 ,   4.00 ,   4.47 ,   5.00 ,   6.08 ]
To find the 75th percentile:
P = 75 100 × 9 = 6.75
Rounding up, the 7th smallest distance is 4.47.
Therefore, the radius corresponding to the 75th percentile for the point ( 1,2 ) is 4.47.
Algorithm 3 describes the process of calculating the minimum number of neighbors, which utilizes the radius obtained from Algorithm 2. This process involves finding the minimum number of neighbors required based on the radius by counting the number of points within this radius.
Algorithm 3: Minimum Neighbor Calculation
Input :   Minority   class   instances   ( D m i n ) ,   Radius   ( r m i n )
Output :   Minimum   number   of   neighbors   ( m i n p m i n )
Pseudo Code:
   / / 1 .   Compute   the   distance   of   each   minority   instance   to   its   neighbors   within   the   radius   ( r m i n ).
     n e i g h b o r D i s t a n c e s = C o m p u t e D i s t a n c e s W i t h i n R a d i u s D m i n ,   r m i n
   / / 2 .   Determine   the   minimum   number   of   neighbors   required   for   each   minority   instance   based   on   the   calculated   radius   ( r m i n ).
     m i n n e i m i n = C a l c u l a t e M i n i m u m N e i g h b o r s n e i g h b o r D i s t a n c e s
Return:
   Minimum   number   of   neighbors   ( m i n n e i m i n ) for the minority class instances.
To classify the dataset into four types, Algorithm 4 processes each minority class instance using the radius calculated from Algorithm 2 ( r m i n ). It first counts the majority class neighbors for each minority instance within this radius. It then considers the majority neighbors that are within half of the radius r m i n 2 . If the number of these close majority neighbors is less than or equal to the minimum required neighborhood ( m i n n e i m i n ), the minority instance is classified as overlapping, and the close majority neighbors are also considered overlapping. Thus, the algorithm identifies instances as either overlapping or non-overlapping based on these criteria.
Algorithm 4: Data Typing for Overlapping Instances
Input :   Dataset ,   Minority   class   instances   ( D m i n ) ,   Majority   class   instances   ( D m a j ) ,   Minimum   number   of   neighbors   for   minority   class   ( m i n n e i m i n ) ,   Radius   for   minority   class   ( r m i n )
Output :   D m a j n o n ,   D m i n n o n ,   D m a j o v e r ,   D m i n o v e r
Pseudo Code:
  //Initialize sets for non-overlapping and overlapping instances
   D m i n n o n = S e t
   D m a j n o n = S e t
   D m i n o v e r l a p = S e t
   D m a j o v e r l a p = S e t
   / / 1 .   For   each   instance   in   the   minority   class   ( D m i n ):
   For   each   m i n o r i t y I n s t a n c e   in   D m i n :
     / / Calculate   the   distance   to   all   instances   in   the   majority   class   ( D m a j )   within   the   radius   ( r m i n )
       d i s t a n c e s = C o m p u t e D i s t a n c e s m i n o r i t y I n s t a n c e ,   D m a j ,   r m i n
    //Identify instances in the majority class that are within half of the radius from the minority instance
       c l o s e N e i g h b o r s = I d e n t i f y C l o s e N e i g h b o r s d i s t a n c e s , r m i n 2
     / / If   the   number   of   close   majority   neighbors   is   less   than   or   equal   to   the   minimum   required   neighborhood   size   ( m i n n e i m i n )
     If   C o u n t c l o s e N e i g h b o r s m i n p m i n :
      //The minority instance is considered an overlap
         D m i n o v e r l a p . A d d m i n o r i t y I n s t a n c e
      //Add the close majority neighbors to the majority overlap set
         D m a j o v e r l a p . A d d A l l c l o s e N e i g h b o r s
    Else:
      //Otherwise, add the minority instance to non-overlapping
         D m i n n o n . A d d m i n o r i t y I n s t a n c e
      //Also add all majority instances to non-overlapping if not in overlap set
       For   each   m a j o r i t y I n s t a n c e   in   c l o s e N e i g h b o r s :
        If   m a j o r i t y I n s t a n c e   not   in   D m a j o v e r l a p :
         D m a j n o n . A d d m a j o r i t y I n s t a n c e
  //2. Return updated feature matrices for the minority and majority classes, and sets of overlapping instances
   Return   D m i n n o n ,   D m a j n o n ,   D m i n o v e r l a p ,   D m a j o v e r l a p
Return:
   D m a j n o n : Non-overlapping instances for the majority class
   D m i n n o n : Non-overlapping instances for the minority class
   D m a j o v e r l a p : Overlapping instances for the majority class
   D m i n o v e r l a p : Overlapping instances for the minority class
The data comparison is shown in Figure 3. In Figure 3a, the example dataset is presented, with the majority class represented by red stars and the minority class by blue dots. Figure 3b shows the result of identified data, where orange stars represent the majority class and purple dots indicate the minority class in overlapping regions. Understanding the distribution and overlap of the minority and majority classes helps the model differentiate between them more effectively. By categorizing the data into minority overlap, majority overlap, minority non-overlap, and majority non-overlap, the model gains valuable insights into the dataset’s structure. This enables the model to learn from specific patterns, improving its ability to handle class imbalances and enhancing classification accuracy. The next stage involves data matching, which prepares the data to train five distinct models. This process allows the model to focus on nuanced aspects, such as overlapping or distinctly separated regions, improving generalization and prediction accuracy.

4.2. Data Matching

In the data matching stage, the four defined types are combined to form five distinct sets (Set 0–Set 4). This step is crucial for understanding the similarities and differences between the minority and majority classes in each type. The process is illustrated in Figure 4.
Since Figure 4 illustrates the steps of data matching, Figure 5 provides a detailed example for better understanding. It presents the matched data, which correspond to the same data shown in Figure 3.
After matching the data, each set is resampled using various techniques to balance the classes. The resampling methods include SMOTE, Borderline-SMOTE (b-SMOTE), Safe-SMOTE (s-SMOTE), and Random Oversampling (ROS). This resampling ensures that the models learn more effectively without bias, allowing them to better generalize the characteristics of each dataset (Set 0–Set 4).
The proposed method was designed to train five machine learning models, each addressing a unique aspect of the data:
  • Baseline Model: The first model serves as the baseline, trained using the original dataset (Set 0) without any resampling.
  • Overlap Differentiation Models: The second and third models focus on distinguishing overlapping from non-overlapping subsets, specifically minority overlap versus majority non-overlap (Set 1) and majority overlap versus minority non-overlap (Set 2).
  • In-depth Overlap Analysis Model: The fourth model is dedicated to an in-depth analysis of the overlapping subsets, specifically minority overlap versus majority overlap (Set 3).
  • Non-overlapping Subset Model: The fifth model examines the dataset after excluding the overlapping elements, focusing on minority non-overlap versus majority non-overlap (Set 4).
This approach, which utilizes distinct training sets, effectively captures the dataset’s complex patterns. To reduce bias in predictions, this study employs a weighted voting strategy based on Recall, AUC, and G-Mean, ensuring that models with stronger performance have a greater influence on the final decision.
The weighting mechanism follows a weighted average method, calculated as follows:
w i = M i i = 1 n M i
where w i is the weight assigned to model i , and M i represents Recall, AUC, or G-Mean, depending on the weighting strategy. In the case of weighting by the average of all three metrics, the weight is computed as:
w i = R e c a l l i + G - M e a n i + A U C i i = 1 n R e c a l l i + G - M e a n i + A U C i
The voting mechanism is validated in five ways:
  • Non-weighted (Non–W): A simple majority vote ensures fairness by treating all models equally, which is beneficial when each metric is non-prioritized and every model has a similar performance level.
  • Weighted by Recall (W–R): Models with higher Recall receive greater influence. This approach is effective for imbalanced datasets, where detecting rare cases is crucial.
  • Weighted by G-Mean (W–G): Models with higher G-Mean contribute more to the final decision. This method balances sensitivity and specificity, ensuring both classes are well represented in the final prediction.
  • Weighted by AUC (W–A): Models with higher AUC scores have stronger voting power. This enhances class distinction across various thresholds.
  • Weighted by Average (W–avg): The model’s influence is determined by the average of its Recall, AUC, and G-Mean scores. This method provides a balanced approach by considering multiple performance metrics.
This setup ensures a thorough evaluation of each model’s impact on overall performance.
The next section will present the results of the designed algorithm, comparing its performance to that of baseline models, which utilize resampling based on SMOTE and employ Random Forest and SVC as base classifiers.

5. Experimental Results and Discussion

This section presents the results and discussion of the proposed algorithm. The algorithm is compared with baseline models using a process that involves resampling and training machine learning models. A paired t-test is conducted to identify any significant differences between the results.

5.1. Experimental Design

This study compares the proposed algorithm with baseline models, which follow the same workflow: applying resampling techniques, training machine learning models, and then making predictions. Resampling methods include SMOTE, Borderline-SMOTE, Safe-SMOTE, and Random Over Sampling (ROS). Machine learning models are Random Forest (RF) and Support Vector Classifier (SVC) with a linear kernel.
To expand the comparison, ensemble methods are also considered, including Bagging and AdaBoost, using Random Forest and SVC as base classifiers. Models are evaluated with five-fold cross-validation using Recall, G-Mean, and AUC.
The proposed algorithm incorporates five voting mechanisms: non-weighted and weighted voting based on Recall, G-Mean, AUC, and their combined average. A paired t-test is used to determine significant differences between the best results (highest Recall, G-Mean, and AUC) of the baseline models and the proposed algorithm.
The experimental methodology is illustrated in Figure 6.

5.2. Datasets

This study uses 22 datasets with varying imbalance ratios, including yeast, wine quality, stroke, microcalcification, and water quality from UCI, Keel, and Kaggle [39,40,41,42,43,44,45]. The multiclass Yeast and Wine datasets are converted to binary classification, and the imbalance ratio is labeled. For example, “Yeast 143” has an imbalance ratio of 14.3. Detailed dataset information is in Table 3.

5.3. Results of Proposed Algorithm

Table 4, Table 5, Table 6 and Table 7 show the best-performing models for both the baseline and proposed algorithms, including Recall, G-Mean, and AUC values. Bold values indicate the best result, while bold values marked with an asterisk (*) represent a statistically significant difference based on the paired t-test at a 0.05 significance level, with the corresponding p-value shown in parentheses.
For readability, Table 4, Table 5, Table 6 and Table 7 use abbreviated names for the resampling methods, machine learning models, and weighting strategies. The resampling methods include SMOTE, Safe-SMOTE (S-SMOTE), Borderline-SMOTE (B-SMOTE), and Random Oversampling (ROS). The machine learning models are Support Vector Classification (SVC) and Random Forest (RF). The weighting strategies are non-weighted (Non-W), weighted by Recall (W–R), weighted by G-Mean (W–G), weighted by AUC (W–A), and weighted by average (W–Avg).
The results show that the proposed algorithm improves performance in several metrics. The non-weighted voting mechanism often matches or outperforms others, suggesting weighted techniques have a minimal impact on performance.
Table 4, Table 5, Table 6 and Table 7 demonstrate the superior performance of the proposed algorithm across multiple datasets, with consistent gains in Recall, G-Mean, and AUC compared to the baseline, though statistical significance varies.
For the Yeast datasets (Table 4), the proposed algorithm outperforms the baseline in seven out of ten datasets for Recall, eight for G-Mean, and nine for AUC. Significant Recall improvements are seen in Yeast 246 and Yeast 914, while G-Mean shows notable gains in Yeast 908 and Yeast 3273. AUC trends align closely with G-Mean. Non-weighted voting yields the best G-Mean and AUC, whereas Recall-based weighting enhances Recall.
For the Wine Red datasets (Table 5), the proposed algorithm significantly improves Recall in WineRedQ5–Q7, with WineRedQ6 achieving the highest gains. G-Mean and AUC improvements are minor, with the baseline slightly outperforming in WineRedQ4 and WineRedQ5. Weighted voting by Recall enhances Recall, while non-weighted voting performs best for G-Mean and AUC.
For the Wine White datasets (Table 6), Recall improvements are consistent across all cases, with significant gains in WineWhiteQ5–Q8. G-Mean and AUC improvements vary, with notable increases in WineWhiteQ7 and WineWhiteQ8. Weighting by Recall enhances Recall, while G-Mean-based weighting proves effective for G-Mean and AUC, particularly in WineWhiteQ7 and WineWhiteQ8.
For the Stroke, Microcal, and Water datasets (Table 7), the proposed algorithm significantly outperforms the baseline in Water, increasing Recall from 0.70400 to 0.96400. G-Mean and AUC show strong improvements in Stroke and Water, while gains in Microcal are minimal. Non-weighted voting yields the best G-Mean and AUC results in Stroke and Microcal, while Recall-based weighting proves crucial for Recall improvements in Water.
The proposed algorithm consistently improves Recall across most datasets while achieving competitive performance in G-Mean and AUC. The impact of voting mechanisms varies as follows: non-weighted voting is optimal for G-Mean and AUC, while Recall-based weighting significantly enhances Recall. Performance gains are dataset-dependent, with particularly strong improvements observed in the Water, Stroke, and Yeast subsets. These results highlight the effectiveness of the proposed algorithm in handling imbalanced datasets while demonstrating adaptability across different data distributions.
From Table 4, Table 5, Table 6 and Table 7, where some results are equal, Figure 7, Figure 8, Figure 9 and Figure 10 expand the comparison by illustrating the best outcomes for each weighted technique. These results highlight the proposed algorithm’s effectiveness, particularly with the Recall-based weighting strategy, which outperforms the baseline and other methods in several instances. Notably, Recall achieves the highest performance in 16 out of 22 datasets, G-Mean in 17 datasets, and AUC in 18 datasets.
Figure 7 compares the proposed algorithm with the baseline model for the Yeast dataset in terms of three metrics: (a) Recall, (b) G-Mean, and (c) AUC. Seven out of ten datasets show that the suggested algorithm performs better than the baseline, with strong Recall performance shown in Figure 7a. On the other hand, Figure 7b,c demonstrate superior outcomes for methods that do not employ metric weighting, implying that weighting could introduce bias and cause overfitting or decreased generalization. Weighting may disrupt the balance between class-wise performance (as measured by G-Mean and AUC), whereas non-weighted approaches maintain a more comprehensive balance.
The Yeast dataset, a well-known benchmark for imbalanced classification, is used to compare the proposed algorithm with previous methods, with G-Mean as the primary evaluation metric. G-Mean is chosen because it effectively balances sensitivity and specificity. The comparison results are presented in Table 8. Note that “CW [X]” refers to the comparative work with reference [X], and the values in the table represent the best performance of the proposed algorithm in each respective study, averaged over five folds. The missing data indicate that the paper did not use that particular dataset. Bold values in the table indicate the best performance among all compared methods for each Yeast dataset.
From Table 8, the proposed method demonstrates robust and consistent performance across the Yeast datasets, frequently surpassing or matching the results of prior studies (e.g., CW [35], CW [36], and others). For example, on Yeast 143, the proposed method achieves 0.82228, significantly outperforming CW [35] (0.6555) and CW [36] (0.7593). It also exhibits competitive performance against CW [38], with results of 0.70828 for Yeast 246 compared to 0.743, 0.94212 for Yeast 908 compared to 0.937, and 0.98711 for Yeast 3273, exceeding CW [38]’s 0.962. While the proposed method occasionally performs lower than some comparative approaches on specific datasets, it consistently outperforms most others overall, showcasing its reliability and adaptability.
The absence of results in some comparative works highlights that certain datasets were not used in those studies. This emphasizes the broader applicability of the proposed method, as it provides valuable performance metrics even where prior works lack data. By consistently delivering high G-Mean values across diverse datasets and conditions, the proposed algorithm proves to be a more comprehensive and dependable solution for imbalanced classification challenges compared to previous methods.
Comparable patterns can be seen in the Wine datasets (Figure 8 and Figure 9), where G-Mean and AUC only marginally improve, while Recall, particularly in weighted models, significantly exceeds the baseline. Recall triumphs in three of the four datasets in the Wine Red results, while G-Mean and AUC win in two. Recall prevails in each of the five sets for the Wine White results, while G-Mean and AUC do so in four of the five. However, some datasets, such as Wine White Q6 and Wine Red Q6, have lower AUC and G-Mean values, suggesting that although weighting by Recall improves true positive detection, it might sometimes upset the equilibrium.
The stroke dataset (Figure 10) exhibits slightly lower Recall than the baseline, but higher G-Mean and AUC, suggesting better overall balance even with a few positives missing. The microcal and water datasets show similar results. The microcal dataset demonstrates slight gains in performance across the board in terms of all metrics. On the other hand, the water dataset shows notable gains in Recall, G-Mean, and AUC, indicating that the model successfully impacts a balance between positive and negative class detection, resulting in a strong predictive model.

5.4. Discussion

The results demonstrate that the proposed algorithm effectively enhances Recall, particularly in highly imbalanced datasets, such as Yeast914 ( I R = 9.14 ), WineWhiteQ8 ( I R = 26.99 ), Water ( I R = 7.77 ), and WineRedQ7 ( I R = 7.04 ), where Recall-weighted strategies significantly outperform the baseline ( p < 0.05 , Table 4, Table 5, Table 6 and Table 7). However, in some datasets, such as Yeast 143 and Stroke, the improvements are negligible, highlighting that Recall-based weighting does not always yield substantial benefits. Additionally, statistical tests could not be performed for Yeast 1225 and Yeast 3273 due to identical values or low variance.
In terms of G-Mean, non-weighted methods generally maintain a better balance and avoid overfitting. However, improvements are observed in datasets, such as Yeast 3273, Stroke, and Water, where G-Mean-weighted strategies significantly outperform the baseline ( p < 0.05 , Table 4 and Table 7). Conversely, datasets, such as Yeast 914 and WineRedQ4 exhibit no significant differences, suggesting that G-Mean weighting does not always enhance model performance.
For AUC, weighting strategies improve performance in datasets such as Yeast 3273, WineWhiteQ8, Stroke, and Water ( p < 0.05 , Table 4, Table 5, Table 6 and Table 7). However, many datasets, including Yeast 143 and WineRedQ5, show high p-values, indicating that AUC-based weighting does not consistently lead to significant improvements. This highlights the importance of dataset characteristics when selecting weighting strategies.
The comparative analysis with previous works confirms the stability and effectiveness of the proposed algorithm. For example, in the Yeast datasets (Table 4), the proposed method achieves notable Recall improvements, outperforming CW [38] on Yeast 3273, with a G-Mean of 0.98711 compared to 0.962 in prior methods. These results demonstrate the algorithm’s adaptability across diverse datasets, addressing limitations observed in previous approaches.
Dataset characteristics strongly influence model performance. Highly imbalanced datasets (e.g., Yeast 3057 and Microcalcification) pose challenges for Recall improvement, while larger datasets (e.g., Water) exhibit more stable performance gains. Simpler datasets with fewer features, such as Microcalcification, tend to yield stronger performance across all metrics. Achieving optimal performance requires balancing the imbalance ratio, dataset size, and feature count.
For voting techniques, Recall-weighted strategies excel in Recall improvement, whereas non-weighted methods better maintain G-Mean and AUC balance. However, in some cases, comparable results across methods (as seen in Figure 7, Figure 8, Figure 9 and Figure 10) suggest potential areas for optimization. Balancing Recall, G-Mean, and AUC remains a challenge, as excessive weighting can disrupt the overall metric balance. Smaller datasets benefit from robust validation techniques, such as stratified k-fold cross-validation, to enhance result stability. While resampling techniques, such as ROS and Safe-SMOTE help moderate imbalances, feature-rich datasets may require advanced techniques, such as feature selection or dimensionality reduction.
Future research should explore adaptive resampling and dynamic weighting strategies to further refine performance. Enhancing metric prioritization, improving validation techniques, and increasing dataset diversity will strengthen the generalizability and stability of the proposed algorithm.

6. Conclusions

The proposed algorithm introduces a hybrid framework to address challenges in binary imbalanced datasets. It begins with a data identification process that classifies instances into four types, forming five separate datasets. These datasets are resampled using SMOTE, Borderline SMOTE, Safe SMOTE, or Random Oversampling and subsequently used to train machine learning models, including Random Forest, Support Vector Classification, Bagging, and Adaboost. Random Forest and Support Vector Classification serve as base classifiers. In the final step, five weighting strategies—non-weighted, weighted by Recall, G-Mean, AUC, and average—are applied to generate predictions, which are then evaluated against a baseline using Recall, G-Mean, and AUC.
The results show that the Recall-weighted strategy consistently delivers the highest Recall improvements across most datasets. Statistical tests confirm significant gains in Recall and G-Mean compared to the baseline, particularly in highly imbalanced datasets, such as WineWhiteQ8 and Water. However, the weighted-by-AUC strategy fails to achieve statistical significance ( p > 0.05 in many cases), suggesting limited effectiveness in certain datasets, such as Yeast 143 and Yeast 1225.
While the algorithm notably improves Recall, its impact on G-Mean and AUC remains moderate, highlighting areas for refinement. Future improvements could involve developing a refined weighting strategy that better balances Recall, G-Mean, and AUC. Alternatively, optimizing the data identification process or designing a tailored resampling strategy could further enhance overall performance. The addition of techniques, such as adaptive resampling and dynamic weighting could help refine the algorithm’s ability to improve Recall without sacrificing G-Mean or AUC.
Beyond enhancing classification effectiveness, future work should prioritize improving the computational efficiency of the proposed method. This includes reducing the complexity of distance calculations in data partitioning and resampling, as well as minimizing processing time for large datasets. Strategies, such as optimizing feature selection, implementing parallel processing, and integrating more efficient sampling algorithms, could significantly reduce computational overhead, making the algorithm more scalable for real-world applications.
These enhancements would lead to a more robust and generalizable model, capable of consistently outperforming the baseline across diverse datasets. Expanding the dataset diversity and exploring advanced validation techniques will further strengthen the proposed algorithm’s applicability in real-world scenarios.

Author Contributions

All authors contributed to the paper. K.T. was mainly responsible for the experimental work, while both K.T. and A.H. worked together on conceptualization. A.H., in the role of research advisor, oversaw the verification and final editing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and codes can be requested from the author.

Acknowledgments

This work was supported by King Mongkut’s Institute of Technology Ladkrabang.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  2. Nasrollahpour, H.; Isildak, I.; Rashidi, M.-R.; Hashemi, E.A.; Naseri, A.; Khalilzadeh, B. Ultrasensitive bioassaying of HER-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as a nanobiocompatible platform. Cancer Nanotechnol. 2021, 12, 10. [Google Scholar] [CrossRef]
  3. Guo, K.; Wang, Y.; Kang, J.; Zhang, J.; Cao, R. Core dataset extraction from unlabeled medical big data for lesion localization. Big Data Res. 2021, 24, 100185. [Google Scholar] [CrossRef]
  4. Cheng, S.; Wu, Y.; Li, Y.; Yao, F.; Min, F. TWD-SFNN: Three-way decisions with a single hidden layer feedforward neural network. Inf. Sci. 2021, 579, 15–32. [Google Scholar] [CrossRef]
  5. Wu, C.; Luo, C.; Xiong, N.; Zhang, W.; Kim, T.-H. A greedy deep learning method for medical disease analysis. IEEE Access 2018, 6, 20021–20030. [Google Scholar] [CrossRef]
  6. Wei, W.; Li, J.; Cao, L.; Ou, Y.; Chen, J. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web-Internet Web Inf. Syst. 2013, 16, 449–475. [Google Scholar] [CrossRef]
  7. Niu, K.; Zhang, Z.; Liu, Y.; Li, R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Inf. Sci. 2020, 536, 120–134. [Google Scholar] [CrossRef]
  8. Daliri, S. Using harmony search algorithm in neural networks to improve fraud detection in the banking system. Comput. Intell. Neurosci. 2020, 2020, 6503459. [Google Scholar] [CrossRef]
  9. Cui, L.; Bai, L.; Wang, Y.; Jin, X.; Hancock, E.R. Internet financing credit risk evaluation using multiple structural interacting elastic net feature selection. Pattern Recognit. 2021, 114, 107835. [Google Scholar] [CrossRef]
  10. Yang, J.; Xiong, N.; Vasilakos, A.V.; Fang, Z.; Park, D.; Xu, X.; Yoon, S.; Xie, S.; Yang, Y. A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications. IEEE Syst. J. 2011, 5, 574–583. [Google Scholar] [CrossRef]
  11. Xia, F.; Hao, R.; Li, J.; Xiong, N.; Yang, L.T.; Zhang, Y. Adaptive GTS allocation in IEEE 802.15.4 for real-time wireless sensor networks. J. Syst. Archit. 2013, 59 Pt D, 1231–1242. [Google Scholar] [CrossRef]
  12. Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
  13. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  14. Devi, D.; Biswas, S.K.; Purkayastha, B. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognit. Lett. 2017, 93, 3–12. [Google Scholar] [CrossRef]
  15. Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  16. Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In Knowledge Discovery in Databases: PKDD 2003, Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar]
  17. Batuwita, R.; Palade, V. Efficient resampling methods for training support vector machines with imbalanced datasets. In Proceedings of the International Joint Conference on Neural Networks 2010, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
  18. Estabrooks, A.; Jo, T.; Japkowicz, N. A multiple resampling method for learning from imbalanced datasets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef]
  19. Lin, W.-C.; Tsai, C.-F.; Hu, Y.-H.; Jhang, J.-S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
  20. Fernandez, A.; Garcia, S.; del Jesus, M.J.; Herrera, F. A study of the behaviour of linguistic fuzzy rule-based classification systems in the framework of imbalanced datasets. Fuzzy Sets Syst. 2008, 159, 2378–2398. [Google Scholar] [CrossRef]
  21. Fernandez, A.; del Jesus, M.J.; Herrera, F. On the 2-tuples based genetic tuning performance for fuzzy rule-based classification systems in imbalanced datasets. Inf. Sci. 2010, 180, 1268–1291. [Google Scholar] [CrossRef]
  22. Qian, Y.; Liang, Y.; Li, M.; Feng, G.; Shi, X. A Resampling Ensemble Algorithm for Classification of Imbalance Problems. Neurocomputing 2014, 143, 57–67. [Google Scholar] [CrossRef]
  23. Batista, G.; Bazzan, A.; Monard, M.C. Balancing Training Data for Automated Annotation of Keywords: A Case Study. In Proceedings of the II Brazilian Workshop on Bioinformatics, São Paulo, Brazil, 3–5 December 2003; pp. 10–18. [Google Scholar]
  24. Kumar, P.; Kumar, R.; Srivastava, G.; Gupta, G.P.; Tripathi, R.; Gadekallu, T.R.; Xiong, N.N. PPSF: A Privacy-Preserving and Secure Framework Using Blockchain-Based Machine Learning for IoT-Driven Smart Cities. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2326–2341. [Google Scholar] [CrossRef]
  25. Elkan, C. The Foundations of Cost-Sensitive Learning. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; pp. 973–978. [Google Scholar]
  26. Barandela, R.; Sánchez, J.S.; Valdovinos, R.M. New Applications of Ensembles of Classifiers. Pattern Anal. Appl. 2003, 6, 245–256. [Google Scholar] [CrossRef]
  27. Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data; Technical Report; University of California: Berkeley, CA, USA, 2004. [Google Scholar]
  28. Yang, X.; Song, Q.; Cao, A. Weighted Support Vector Machine for Data Classification. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; pp. 859–864. [Google Scholar] [CrossRef]
  29. Seiffert, C.; Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern.—Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
  30. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  31. Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new oversampling method in imbalanced datasets learning. In Advances in Intelligent Computing, Proceedings of the ICIC 2005, Hefei, China, 23–26 August 2005; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3644, pp. 878–887. [Google Scholar]
  32. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalance problem. In Advances in Knowledge Discovery and Data Mining, Proceedings of the PAKDD 2009, Bangkok, Thailand, 27–30 April 2009; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5476, pp. 475–482. [Google Scholar] [CrossRef]
  33. Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
  34. Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
  35. Shi, S.; Li, J.; Zhu, D.; Yang, F.; Xu, Y. A Hybrid Imbalanced Classification Model Based on Data Density. Inf. Sci. 2023, 624, 50–67. [Google Scholar] [CrossRef]
  36. Huang, Z.; Gao, X.; Chen, W.; Cheng, Y.; Xue, B.; Meng, Z.; Zhang, G.; Fu, S. An Imbalanced Binary Classification Method via Space Mapping Using Normalizing Flows with Class Discrepancy Constraints. Inf. Sci. 2023, 623, 493–523. [Google Scholar] [CrossRef]
  37. Mayabadi, S.; Saadatfar, H. Two Density-Based Sampling Approaches for Imbalanced and Overlapping Data. Knowl.-Based Syst. 2022, 241, 108217. [Google Scholar] [CrossRef]
  38. Tao, X.; Guo, X.; Zheng, Y.; Zhang, X.; Chen, Z. Self-Adaptive Oversampling Method Based on the Complexity of Minority Data in Imbalanced Datasets Classification. Knowl.-Based Syst. 2023, 277, 110795. [Google Scholar] [CrossRef]
  39. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  40. Nakai, K. Yeast. In UCI Machine Learning Repository; University of California, Irvine, School of Information and Computer Sciences: Irvine, CA, USA, 1991. [Google Scholar] [CrossRef]
  41. Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef]
  42. Alcalá-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  43. Fedesoriano. Stroke Prediction Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data (accessed on 15 November 2024).
  44. Mssmartypants. Water Quality Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/mssmartypants/water-quality (accessed on 15 November 2024).
  45. Sudhanshu. Microcalcification Classification Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/sudhanshu2198/microcalcification-classification/data (accessed on 15 November 2024).
  46. Mathew, J.; Pang, C.K.; Luo, M.; Leong, W.H. Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4065–4076. [Google Scholar] [CrossRef]
  47. Zhao, J.; Jin, J.; Chen, S.; Zhang, R.; Yu, B.; Liu, Q. A Weighted Hybrid Ensemble Method for Classifying Imbalanced Data. Knowl.-Based Syst. 2020, 203, 106087. [Google Scholar] [CrossRef]
  48. Guo, J.; Wu, H.; Chen, X.; Lin, W. Adaptive SV-Borderline SMOTE-SVM Algorithm for Imbalanced Data Classification. Appl. Soft Comput. 2024, 150, 110986. [Google Scholar] [CrossRef]
  49. Li, F.; Wang, B.; Shen, Y.; Wang, P.; Li, Y. An Overlapping Oriented Imbalanced Ensemble Learning Algorithm with Weighted Projection Clustering Grouping and Consistent Fuzzy Sample Transformation. Inf. Sci. 2023, 637, 118955. [Google Scholar] [CrossRef]
Figure 1. Examples of imbalanced classification techniques [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30].
Figure 1. Examples of imbalanced classification techniques [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30].
Data 10 00054 g001
Figure 2. Flow chart of proposed algorithm.
Figure 2. Flow chart of proposed algorithm.
Data 10 00054 g002
Figure 3. The result of data partition which (a) is the original data set and (b) is the data set that identified.
Figure 3. The result of data partition which (a) is the original data set and (b) is the data set that identified.
Data 10 00054 g003
Figure 4. The process of data matching in each set: Set 0–Set 4 (ae).
Figure 4. The process of data matching in each set: Set 0–Set 4 (ae).
Data 10 00054 g004
Figure 5. The examples of data matching in each set: Set 0–Set 4 (ae).
Figure 5. The examples of data matching in each set: Set 0–Set 4 (ae).
Data 10 00054 g005
Figure 6. Experimental methods: (a) Baseline (b) Proposed Algorithm.
Figure 6. Experimental methods: (a) Baseline (b) Proposed Algorithm.
Data 10 00054 g006
Figure 7. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Yeast dataset.
Figure 7. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Yeast dataset.
Data 10 00054 g007
Figure 8. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Wine Red datasets.
Figure 8. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Wine Red datasets.
Data 10 00054 g008
Figure 9. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Wine White datasets.
Figure 9. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Wine White datasets.
Data 10 00054 g009
Figure 10. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Stroke, Microcal, and Water datasets.
Figure 10. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Stroke, Microcal, and Water datasets.
Data 10 00054 g010
Table 1. Advantages and disadvantages of each imbalanced classification techniques.
Table 1. Advantages and disadvantages of each imbalanced classification techniques.
TechniquesAdvantageDisadvantage
Data-level Techniques
  • Balances the class distribution, making it easier for the model to learn.
  • Can be applied to many classification algorithms without modification.
  • Oversampling can provide the minority class with more examples, which improves detection.
  • Oversampling can lead to overfitting by duplicating minority class samples.
  • Undersampling can result in loss of valuable information from the majority class.
  • Noise removal techniques might inadvertently remove important data.
Algorithm-level Techniques
  • Makes better imbalanced classification models.
  • Ensemble methods can improve models’ performance.
  • Can keep the original dataset, preserving all data points.
  • Might be computationally intensive and complex to implement.
  • Might require extensive tuning and experimentation to achieve optimal performance.
  • Might not generalize well if the underlying data distribution changes.
Hybrid Techniques
  • Can provide a balanced approach that mitigates the disadvantages of using either technique alone.
  • Often results in better overall performance and more robust models.
  • Flexible and can be adapted to different types of data and problems.
  • Complexity increases due to the combination of multiple techniques.
  • May require significant computational resources and expertise to implement effectively.
  • The integration of different techniques can be challenging and may not always lead to the desired improvement.
Sources: Adapted from He and Garcia (2009) [1], Rezvani, S.; Wang, X. (2023) [12], Johnson and Khoshgoftaar (2019) [33], and Chen et al. (2024) [34].
Table 2. Notations of all variables.
Table 2. Notations of all variables.
NotationDescription
D m a j Majority class instances.
D m i n Minority class instances.
r m i n Radius threshold for minority class instances (calculated via percentile distances).
m i n n e i m i n Minimum number of neighbors required for minority class instances.
D m a j o v e r Majority class instances overlapping with the minority class.
D m i n o v e r Minority class instances overlapping with the majority class.
D m a j n o n Majority class instances in non-overlapping regions.
D m i n n o n Minority class instances in non-overlapping regions.
Set 0–Set 4 Five   datasets   constructed   by   pairing   subsets   ( e . g . ,   D m i n o v e r D m a j n o n ).
d i s t a n c e M a t r i x Pairwise distance matrix between minority and majority instances.
p e r c e n t i l e D i s t a n c e s Percentile   distances   used   to   compute   r m i n .
n e i g h b o r D i s t a n c e s Distances   of   minority   instances   to   neighbors   within   r m i n .
Table 3. All considered datasets.
Table 3. All considered datasets.
NameDatasetAttributesInstancesClass
Distribution
Imbalance
Ratio
Yeast 143Yeast8459429/3014.3
Yeast 24614841055/4292.46
Yeast 50814841240/2445.08
Yeast 908514463/519.08
Yeast 912506456/509.12
Yeast 9141004905/999.14
Yeast 935528477/519.35
Yeast 1225464429/3512.25
Yeast 3057947917/3030.57
Yeast 327314841440/4432.73
WineRedQ4Wine Quality11159953/154629.17
WineRedQ5681/9181.35
WineRedQ6638/9611.51
WineRedQ7199/14007.04
WineWhiteQ4114838163/473529.05
WineWhiteQ51457/34412.36
WineWhiteQ62198/27001.23
WineWhiteQ7880/40184.57
WineWhiteQ8175/472326.99
StrokeStroke104908209/469922.48
MicrocalMicrocalcification611,183260/10,92342.01
WaterWater Quality207996912/70847.77
Table 4. Summary of Best Recall, G-Mean, and AUC values for the Yeast dataset.
Table 4. Summary of Best Recall, G-Mean, and AUC values for the Yeast dataset.
DatasetsAlgorithmBest
Recall
DetailBest
G-Mean
DetailBest AUCDetail
Yeast 143Baseline0.86667S-SMOTE0.80803• ROS0.80853• ROS
Boosting SVC• SVC• SVC
Proposed0.83333• S-SMOTE0.82228S-SMOTE0.82248S-SMOTE
• Boosting SVCSVCSVC
• W–RNon–WNon–W
Yeast 246Baseline0.81177• B-SMOTE0.70388• ROS0.71012• ROS
• SVC• Bagging RF• Bagging RF
Proposed0.96235 *
(0.0042)
ROS0.70828ROS0.71410ROS
Boosting RFBagging RFBagging RF
W–RNon–WW–G
Yeast 508Baseline0.81923• B-SMOTE0.79436• ROS0.80102• ROS
• Boosting SVC• Boosting SVC• Boosting SVC
Proposed0.87692ROS0.80338ROS0.80361ROS
Bagging RFSVCSVC
W–RNon–WNon–W
Yeast 908Baseline0.88333• B-SMOTE0.91206• S-SMOTE0.91355• S-SMOTE
• Boosting SVC• Bagging SVC• SVC
Proposed0.93333ROS0.94212 *
(0.04187)
ROS0.94249 *
(0.04206)
ROS
SVCSVCSVC
Non–WNon–WNon–W
Yeast 912Baseline0.85714• ROS0.78421ROS0.78752ROS
• Bagging SVCBagging SVCBagging SVC
Proposed0.91429B-SMOTE0.78160• B-SMOTE0.78541• B-SMOTE
Boosting SVC• SVC• SVC
W–R• W–G• W–A
Yeast 914Baseline0.76191• SMOTE0.81985SMOTE0.82206• SMOTE
• Bagging SVCSVC• SVC
Proposed0.96190 *
(0.00036)
ROS0.81359• SMOTE0.82429SMOTE
Boosting RF• Bagging RFBagging RF
W–R• Non–WNon–W
Yeast 935Baseline0.71429• SMOTE0.76379• S-SMOTE0.77994• S-SMOTE
• Bagging SVC• Bagging RF• Bagging RF
Proposed0.74286B-SMOTE0.78540B-SMOTE0.79928SMOTE
SVCRFBoosting RF
Non–WW–GW–A
Yeast 1225Baseline0.83333SMOTE0.81556• B-SMOTE0.83218• B-SMOTE
Boosting SVC • RF• RF
Proposed0.83333B-SMOTE0.85685ROS0.86379ROS
Boosting SVCSVCSVC
Non–WW–GW–A
Yeast 3057Baseline0.84000• ROS0.66553• ROS0.67838• ROS
• Boosting SVC• Bagging SVC• Bagging SVC
Proposed0.92000SMOTE0.69037ROS0.69730ROS
Boosting SVCSVCSVC
W–RW–GW–G
Yeast 3273Baseline1.00000ROS0.97441• SMOTE0.97474• SMOTE
Boosting SVC • SVC• SVC
Proposed1.00000SMOTE0.98711 *
(0.00025)
SMOTE0.98720 *
(0.00025)
SMOTE
Boosting SVCSVCSVC
W–RNon–WNon–W
Table 5. Summary of Best Recall, G-Mean, and AUC values for the Wine Red datasets.
Table 5. Summary of Best Recall, G-Mean, and AUC values for the Wine Red datasets.
DatasetsAlgorithmBest
Recall
DetailBest
G-Mean
DetailBest AUCDetail
WineRedQ4Baseline0.90000SMOTE0.80205SMOTE0.80807SMOTE
SVCSVCSVC
Proposed0.90000ROS0.79468• ROS0.80097• ROS
Bagging SVC• Bagging SVC• Bagging SVC
Non–W• Non–W • Non–W
WineRedQ5Baseline0.89692• SMOTE0.77443SMOTE0.77591SMOTE
• Boosting SVCBoosting RFBoosting RF
Proposed0.97846 *
(0.00374)
B-SMOTE0.77267• SMOTE0.77320• SMOTE
Bagging RF• Boosting RF• Boosting RF
W–R• Non–W• Non–W
WineRedQ6Baseline0.67879• Safe-SMOTE0.70162• B-SMOTE0.70293• B-SMOTE
• SVC• Bagging RF• Bagging RF
Proposed0.96667 *
(0.00003)
ROS0.70804B-SMOTE0.70835B-SMOTE
RFBagging RFBagging RF
W–RW–GW–G
WineRedQ7Baseline0.89524• S-SMOTE0.81897• B-SMOTE0.82141• B-SMOTE
• Boosting SVC• Bagging SVC• Bagging SVC
Proposed0.98095 *
(0.00605)
SMOTE0.82385ROS0.82653ROS
Boosting RFSVCSVC
W–RNon–WNon–W
Table 6. Summary of Best Recall, G-Mean, and AUC values for the Wine White datasets.
Table 6. Summary of Best Recall, G-Mean, and AUC values for the Wine White datasets.
DatasetsAlgorithmBest
Recall
DetailBest
G-Mean
DetailBest AUCDetail
WineWhiteQ4Baseline0.78400• SMOTE0.78848S-SMOTE0.78886S-SMOTE
• SVCSVCSVC
Proposed0.80800S-SMOTE0.78799• ROS0.78831• ROS
Boosting SVC• SVC• SVC
W–R• Non–W• Non–W
WineWhiteQ5Baseline0.86804• B-SMOTE0.76996• B-SMOTE0.77319• B-SMOTE
• Boosting SVC• Bagging RF• Bagging RF
Proposed0.97113 *
(0.00005)
B-SMOTE0.77350B-SMOTE0.77548B-SMOTE
Boosting RFBagging RFBagging RF
W–RW–GW–G
WineWhiteQ6Baseline0.92500• S-SMOTE0.71466• ROS0.71504• ROS
• Boosting SVC• Bagging RF• Bagging RF
Proposed0.99583 *
(0.2256)
ROS0.71982B-SMOTE0.72000B-SMOTE
Boosting RFBagging RFBagging RF
W–RW–GW–G
WineWhiteQ7Baseline0.79375• B-SMOTE0.74659• ROS0.76218• ROS
• Bagging SVC• Bagging RF• Bagging RF
Proposed0.96979 *
(0.000003)
ROS0.75287 *
(0.02169)
B-SMOTE0.76653 *
(0.03713)
ROS
Bagging RFBagging RFBagging RF
W–RW–GW–G
WineWhiteQ8Baseline0.73714• SMOTE0.70885• ROS0.71270• S-SMOTE
• Boosting SVC• Boosting SVC• Bagging RF
Proposed0.88571 *
(0.00045)
S-SMOTE0.78596 *
(0.00735)
B-SMOTE0.78815 *
(0.00232)
B-SMOTE
Bagging RFBoosting RFBoosting RF
W–RW–RW–R
Table 7. Summary of Best Recall, G-Mean, and AUC values for the Stroke, Microcal, and Water datasets.
Table 7. Summary of Best Recall, G-Mean, and AUC values for the Stroke, Microcal, and Water datasets.
DatasetsAlgorithmBest
Recall
DetailBest
G-Mean
DetailBest AUCDetail
StrokeBaseline0.94717• SMOTE0.73554• ROS0.75056• ROS
• Boosting SVC• Boosting SVC• Boosting SVC
Proposed0.93585• ROS0.75936 *
(0.00173)
S-SMOTE0.76684 *
(0.01101)
S-SMOTE
• SVCBoosting SVCBoosting SVC
• W–RNon–WNon–W
MicrocalBaseline0.95200• B-SMOTE0.91004• ROS0.91038• ROS
• Boosting SVC• Bagging SVC• Bagging SVC
Proposed0.96000B-SMOTE0.91360SMOTE0.91364SMOTE
Boosting SVCSVCSVC
Non–WNon–WNon–W
WaterBaseline0.70400• B-SMOTE0.74155• ROS0.74486• ROS
• Bagging SVC• Bagging RF• Bagging RF
Proposed0.96400 *
(0.00001)
SMOTE0.79700 *
(0.0023)
ROS0.79764 *
(0.0002)
ROS
RFSVCSVC
W–RNon–WNon–W
Table 8. Comparison of G-Mean Performance Across Different Methods.
Table 8. Comparison of G-Mean Performance Across Different Methods.
DatasetsCW [35]CW [36]CW [37]CW [38]CW [46]CW [47]CW [48]CW [49]Proposed
Yeast 1430.65550.7593--0.76-0.6970.75730.82228
Yeast 246--0.7180.7430.72-0.683-0.70828
Yeast 9080.86740.84940.9540.9370.9-0.8630.92310.94212
Yeast 912----0.74-0.6270.72630.7816
Yeast 914----0.80.76420.76-0.81359
Yeast 935-0.789--0.81-0.779-0.7854
Yeast 3057----0.730.66050.657-0.69037
Yeast 32730.9601--0.9620.960.9390.948-0.98711
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Theephoowiang, K.; Hanskunatai, A. A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data 2025, 10, 54. https://doi.org/10.3390/data10040054

AMA Style

Theephoowiang K, Hanskunatai A. A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data. 2025; 10(4):54. https://doi.org/10.3390/data10040054

Chicago/Turabian Style

Theephoowiang, Kittipong, and Anantaporn Hanskunatai. 2025. "A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification" Data 10, no. 4: 54. https://doi.org/10.3390/data10040054

APA Style

Theephoowiang, K., & Hanskunatai, A. (2025). A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data, 10(4), 54. https://doi.org/10.3390/data10040054

Article Metrics

Back to TopTop