A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification

Theephoowiang, Kittipong; Hanskunatai, Anantaporn

doi:10.3390/data10040054

Open AccessArticle

A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification

by

Kittipong Theephoowiang

and

Anantaporn Hanskunatai

^*

Computer Science, School of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10152, Thailand

^*

Author to whom correspondence should be addressed.

Data 2025, 10(4), 54; https://doi.org/10.3390/data10040054

Submission received: 14 January 2025 / Revised: 19 March 2025 / Accepted: 13 April 2025 / Published: 18 April 2025

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

:

Imbalanced classification presents a significant challenge in real-world datasets, requiring innovative solutions to enhance performance. This study introduces a hybrid binary classification algorithm designed to effectively address this challenge. The algorithm identifies different data types, pairs them, and trains multiple models, which then vote on predictions using weighted strategies to ensure stable performance and minimize overfitting. Unlike some methods, it is designed to work consistently with both noisy and noise-free datasets, prioritizing overall stability rather than specific noise adjustments. The algorithm’s effectiveness is evaluated using Recall, G-Mean, and AUC, measuring its ability to detect the minority class while maintaining balance. The results reveal notable improvements in minority class detection, with Recall outperforming other methods in 16 out of 22 datasets, supported by paired t-tests. The algorithm also shows promising improvements in G-Mean and AUC, ranking first in 17 and 18 datasets, respectively. To further evaluate its performance, the study compares the proposed algorithm with previous methods using G-Mean. The comparison confirms that the proposed algorithm also exhibits strong performance, further highlighting its potential. These findings emphasize the algorithm’s versatility in handling diverse datasets and its ability to balance minority class detection with overall accuracy.

Keywords:

imbalanced classification; hybrid algorithm; minority class detection

1. Introduction

Classification is a fundamental problem in machine learning. It performs well if the model is trained on datasets with balanced class frequencies, which are balanced or approximately equal (e.g., for binary, there is a 1:1 ratio between majority and minority) [1]. However, this balance is uncommon in real-world datasets, which often have significant differences between classes, leading to imbalanced datasets. Examples include datasets related to medical diagnosis [2,3,4,5], fraud detection [6,7,8] or credit evaluation [9,10,11]. Therefore, improving accuracy in imbalanced classification is a crucial challenge.

Imbalanced datasets have several disadvantages for classification, such as bias towards the majority class, where models tend to overlook the minority class, leading to poor performance in critical applications, such as medical diagnosis or fraud detection. Additionally, common evaluation metrics, such as accuracy, can be misleading because a model that predicts only the majority class may achieve high accuracy but fails to correctly classify minority class instances. Because this issue affects poor generalization due to biased learning from the majority class, the model develops skewed predictions, such as failing to detect minority class instances. This reduces its effectiveness in real-world predictions. For example, in medical diagnosis, a model trained on imbalanced data might excel at identifying common diseases (majority class) but fail to recognize rare conditions (minority class). This limitation stems from the model’s decision boundaries being overly influenced by the majority class, causing it to misclassify or overlook underrepresented patterns in real-world scenarios. Furthermore, it is often challenging and costly to collect sufficient and representative data for the minority class, thus making it difficult to create balanced datasets.

Although many real-world datasets involve multiclass problems, binary classification is simpler to optimize due to reduced complexity in decision boundaries and class interactions. Decomposing multiclass tasks into binary subproblems, such as one-vs.-rest or one-vs.-one strategies, simplifies learning and improves interpretability. While multiclass models use a single framework, binary decomposition lowers algorithmic complexity and mitigates overfitting. Given these advantages, this study focuses on binary classification, particularly in imbalanced scenarios.

In a binary imbalanced dataset, there are two classes, known as the minority and majority classes. Typically, the goal of an imbalanced classification is to detect the minority class. There are three primary methods to improve detection of the minority class, which is critical in applications such as fraud detection or rare disease diagnosis. Three main strategies address this challenge: data-level methods, algorithm-level methods, and hybrid methods [12].

Data-level methods adjust class distribution by oversampling the minority class (e.g., SMOTE [13]) or undersampling the majority class (e.g., Tomek Links [14]). Some advanced techniques are applied to remove noise or overlapped data (i.e., removing the ambiguous instances near class boundaries). These methods can enhance minority class detection but may lead to overfitting through synthetic data duplication, whereas the model might lose some information by undersampling the majority.

Algorithm-level methods modify learning algorithms to prioritize minority instances, such as including cost-sensitive learning (penalizing majority-class errors) and ensemble methods, such as AdaBoost [15], which iteratively adjust weights for misclassified samples. This approach can enhance model robustness, but these methods demand significant computational resources and parameter tuning.

Hybrid methods integrate data-level and algorithm-level methods, leveraging the strengths of both. For instance, SMOTEBoost [16] combines synthetic oversampling with boosting. While hybrid methods can provide balanced and robust solutions, they inherit complexity from both components and are resource intensive to implement.

To improve the accuracy of imbalanced classification, this work proposes a novel partition-based hybrid algorithm that strategically addresses class overlap and imbalance to enhance the performance of imbalanced classification. This approach operates in two phases: data partitioning and dynamic training and voting.

Data partitioning divides the dataset to categorize the data into four distinct groups:

$D_{{m i n}_{o v e r}}$ : Minority instances overlapping with the majority class,
$D_{{m i n}_{n o n}}$ : Minority instances in distinct regions,
$D_{{m a j}_{o v e r}}$ : Majority instances overlapping with the minority class,
$D_{{m a j}_{n o n}}$ : Majority instances in distinct regions.

For example, in medical data,

D_{{m i n}_{o v e r}}

might represent patients with symptoms common to both rare and common diseases.

The dynamic training and voting phase construct five different datasets by pairing four subsets from the data-partitioning phase (e.g.,

D_{{m i n}_{o v e r}} v s . D_{{m a j}_{n o n}}

) to isolate specific learning challenges (overlap vs. separability). Then, each dataset is balanced using adaptive oversampling (e.g., SMOTE for

D_{{m i n}_{o v e r}}

) to avoid overfitting. Five diverse models (e.g., SVM, Random Forest) are trained, and predictions are aggregated via weighted voting, prioritizing metrics such as Recall for minority detection.

This approach reduces bias by separately addressing overlap and imbalance, while the ensemble structure enhances robustness. For instance, models trained on

D_{{m i n}_{n o n}}

focus on pure minority patterns, improving rare-class detection without majority-class interference.

In model evaluation, this study uses Recall, G-Mean, and AUC. For Recall, this metric prioritizes minority-class detection, G-Mean ensures a balance between sensitivity and specificity, and AUC measures overall class separability. Together, these metrics overcome the limitations of the F1-score, which is sensitive to false positives in imbalanced data, and the Matthews Correlation Coefficient (MCC), which is less interpretable in binary contexts. Our experimental results demonstrate that this combination provides a comprehensive assessment of model performance, particularly in imbalanced and overlapping datasets.

The rest of this paper is arranged as follows: Section 2 reviews existing methods for handling imbalanced classification. Section 3 provides a detailed description of the proposed hybrid method. Section 4 shows the experimental design and the results, including baseline comparisons using resampling techniques and model training with both ensemble (bagging and AdaBoost) and non-ensemble approaches, utilizing Random Forest and Support Vector Classifier (SVC) with linear, RBF, and polynomial kernels as base estimators. Section 5 concludes the paper, summarizing the key findings, highlighting the advantages of the proposed method, and suggesting future research directions.

2. Related Works

The challenge of imbalanced classification has been extensively researched, leading to the development of various approaches. These can be classified into three techniques: data-level techniques, algorithm-level techniques, and hybrid techniques [12]. The examples are shown in Figure 1.

Data-level techniques balance class distribution by modifying the dataset and are divided into three types: oversampling, undersampling, and hybrid sampling. For example, random undersampling (RUS) removes instances from the majority class randomly, which may lead to the loss of important data [17,18]. This issue is mitigated by selectively removing majority-class instances using methods such as Tomek Link and clustering-based undersampling [14,19]. Conversely, oversampling techniques, such as random oversampling (ROS), add minority class examples randomly to prevent information loss, but they may cause overfitting [20,21]. The Synthetic Minority Oversampling Technique (SMOTE) creates synthetic minority instances, though it can marginalize data distribution [13]. Alternative techniques, such as Borderline-SMOTE, and Safe-SMOTE, have been proposed to overcome these limitations [31,32]. Hybrid sampling approaches combine the strengths of oversampling and undersampling to create balanced datasets [22,23,24]. These techniques simplify class distribution, making it easier for models to learn, and can be applied to any classification algorithm without modification. However, oversampling can cause overfitting by duplicating minority class samples, while undersampling can lead to the loss of valuable majority class information. Noise removal techniques may also inadvertently remove important data.

Algorithm-level techniques improve imbalanced data handling by modifying the learning algorithm. The basic techniques are cost-sensitive learning and ensemble learning [25,26]. Cost-sensitive learning gives more importance to the minority class, making it a popular choice for imbalanced data. Ensemble learning combines multiple models to enhance classification performance. For example, AdaBoost adapts by sequentially training weak learners and adjusting weights based on misclassification errors to emphasize difficult instances, thereby improving the handling of imbalanced data [15]. Random forests construct multiple decision trees during training and aggregate their predictions to improve accuracy and robustness [27]. Weighted SVM modifies the SVM algorithm by assigning different weights to classes, prioritizing the minority class [28]. These techniques help models handle imbalanced data better, preserve all data points, and improve robustness and accuracy. However, they can be computationally intensive, complex to implement, require extensive tuning, and may not generalize well if the underlying data distribution changes.

Hybrid techniques combine the advantages of data-level and algorithm-level methods, resulting in high performance for handling imbalanced datasets, which makes them popular. These techniques integrate ensemble models with resampling methods to improve performance. For example, SMOTEBoost combines SMOTE with the AdaBoost algorithm [16]. RUSBoost combines random undersampling with boosting techniques [29]. XGBoost is a powerful method that uses gradient boosting with regularization to improve performance and manage class imbalance [30]. Balanced random forest combines resampling techniques with the random forest algorithm. Hybrid techniques offer a balanced approach, mitigating the disadvantages of individual methods, often resulting in better performance and more robust models. However, their complexity increases due to the combination of multiple techniques, requiring significant computational resources, and the integration of different methods can be challenging and may not always lead to improvement.

In conclusion, various approaches have been developed to address imbalanced classification, each with its own benefits and drawbacks. The summary of three techniques is shown in Table 1. Hybrid techniques, combining data-level and algorithm-level methods, have significant potential to improve classification performance. They benefit from the strengths of both approaches, including improved class balance and enhanced model learning. However, challenges similar to those of individual techniques remain, such as overfitting, implementation complexity, and the need for careful tuning. Addressing these challenges is essential to optimize hybrid techniques in imbalanced classification. The next section will propose strategies to enhance the effectiveness of hybrid techniques, aiming to improve these drawbacks and further improve performance.

3. Motivation

To avoid biased predictions, numerous imbalanced classification techniques have been developed. However, many of these techniques fail to account for critical factors such as data density variations and overlapping class regions. These oversights can severely degrade classification performance. For example, traditional resampling methods may overgeneralize minority class features or inadequately resolve overlapping regions, leading to poor Recall for minority instances and increased misclassification rates.

Recently, density-based and latent space mapping techniques have shown a trend to improve this issue. For example, hybrid imbalanced classification models based on data density (HICD) leverage density-aware partitioning to enhance model performance [35]. This approach segments the dataset into distinct density regions, allowing for more targeted resampling and better identification of minority-class instances. However, HICD does not fully mitigate class overlap and may struggle with noisy data, potentially introducing classification errors.

Similarly, techniques that normalize data points under class discrepancy constraints attempt to map data into latent spaces to reduce classification complexity and enhance separability [36]. By transforming the original feature space into a latent representation, these methods aim to form distinct subclusters, facilitating more effective classification. However, they may struggle to maintain the original data structure, leading to information loss and potential within-class imbalances. This limitation arises because the mapping process may not consistently preserve crucial feature relationships across different density regions.

Additionally, Mayabadi and Saadatfar proposed two density-based sampling algorithms: one that employed undersampling to remove high-density samples from the majority class and another that combined undersampling and oversampling [37]. While these methods aimed to balance class distributions and reduce noise, they lacked a robust mechanism to distinguish between noise and valuable minority instances, potentially leading to information loss and reduced generalization ability.

In 2023, Tao et al. introduced self-adaptive oversampling methods, which dynamically adjust the resampling process based on minority class complexity [38]. This approach generates synthetic minority instances within adaptive hyperspheres while avoiding majority class instances, thereby reducing class overlap and enhancing minority class recognition. However, while this technique effectively minimizes overlap and mitigates outliers, it may still struggle to generate sufficiently diverse and representative synthetic samples, potentially limiting generalization.

Although recent advances have improved imbalanced classification, a significant gap remains in developing an integrated approach that effectively handles both class imbalance and data overlap while preserving dataset structure. Motivated by these limitations, this study proposes a novel algorithm that explicitly considers class overlap by incorporating data density insights. This proposed hybrid algorithm integrates density-based resampling, data partitioning and adaptive oversampling strategies. This approach aims to enhance minority-class recognition while maintaining structural integrity and minimizing the impact of overlapping instances. The next section details the methodology of the proposed hybrid algorithm, outlining the specific steps and techniques employed in data partitioning and data matching.

4. Designed Algorithms

A hybrid algorithm has been developed to address imbalanced binary classification, denoted by majority class (

D_{m a j}

) and minority class (

D_{m i n}

). This algorithm consists of two main components: data characterization and data matching. In the data characterization stage, the data are identified into four types based on both the radius and the number of neighboring points. These types are Majority Overlap (

D_{{m a j}_{o v e r}}

), Minority Overlap (

D_{{m i n}_{o v e r}}

), Minority Non-Overlap (

D_{{m i n}_{n o n}}

), and Majority Non-Overlap (

D_{{m a j}_{n o n}}

). In the data-matching stage, the algorithm combines these four types into five distinct sets:

Set 0: Original (all parts combined),
Set 1: Minority Overlap vs. Majority Non-Overlap,
Set 2: Majority Overlap vs. Minority Non-Overlap,
Set 3: Minority Overlap vs. Majority Overlap,
Set 4: Minority Non-Overlap vs. Majority Non-Overlap.

The overall process of the proposed algorithm is illustrated in Figure 2.

4.1. Data Characterization

In the data characterization stage, the dataset is categorized into four distinct groups based on radius and the number of neighboring points. The details are described in Algorithms 1 through 4. The four types resulting from this stage are:

$D_{{m a j}_{n o n}}$ : Majority class data points that do not overlap with the minority class.
$D_{{m i n}_{n o n}}$ .: Minority class data points that do not overlap with the majority class.
$D_{{m a j}_{o v e r}}$ : Majority class data points that overlap with the minority class.
$D_{{m i n}_{o v e r}}$ : Minority class data points that overlap with the majority class.

To facilitate readability, the notations of all variables are summarized in Table 2.

Algorithm 1 shows the overall process of the data characterization stage, which uses the radius and minimum neighborhood. The radius and minimum neighborhood are calculated by Algorithm 2 and Algorithm 3, respectively. These computed values are then used to determine the type of data, with the process described in Algorithm 4: Data Typing for Overlapping Instances.

Algorithm 1: Data characterization

Input: Original dataset

Output : D_{{m a j}_{n o n}}

, D_{{m i n}_{n o n}}

, D_{{m a j}_{o v e r}}

, D_{{m i n}_{o v e r}}

Pseudo Code:
//1. Separate the dataset by class into two separate sets:

D_{m a j} = S e p a r a t e D a t a s e t B y C l a s s (d a t a s e t, majority)

D_{m i n} = S e p a r a t e D a t a s e t B y C l a s s (d a t a s e t, minority)

//2. Calculate the radius and minimum neighbors for minority class instances:

r_{m i n} = C a l c u l a t e R a d i u s (D_{m a j}, D_{m i n})

//Compute the radius for minority instances

m i n {n e i}_{m i n} = C a l c u l a t e M i n i m u m N e i g h b o r s (D_{m a j}, D_{m i n}, r_{m i n})

Determine minimum neighbors
//3. Execute the function to type overlapping instances:

D_{m a j_{n o n}}, D_{m i n_{n o n}}, D_{m a j_{o v e r}}, D_{m i n_{o v e r}} = T y p e O v e r l a p p i n g I n s t a n c e s (D_{m a j}, D_{m i n}, m i n {n e i}_{m i n}, r_{m i n})

//4. Return the updated feature matrices and overlapping instances:

Return D_{m a j_{n o n}}, D_{m i n_{n o n}}, D_{m a j_{o v e r}}, D_{m i n_{o v e r}}

Return:

D_{m a j_{n o n}}

: Non-overlapping instances for the majority class

D_{m i n_{n o n}}

: Non-overlapping instances for the minority class

D_{m a j_{o v e r l a p}}

: Overlapping instances for the majority class

D_{m i n_{o v e r l a p}}

: Overlapping instances for the minority class

Algorithm 1 serves as the overall framework for classifying data, incorporating several subfunctions explained in Algorithms 2–4. The

S e p a r a t e D a t a s e t B y C l a s s

function splits the dataset into majority (

D_{m a j}

) and minority (

D_{m i n}

) classes based on their labels. The

C a l c u l a t e R a d i u s

function determines the radius threshold, as described in Algorithm 2. The

C a l c u l a t e M i n i m u m N e i g h b o r s

function computes the minimum number of instances required to classify each instance, as explained in Algorithm 3. Finally, the

T y p e O v e r l a p p i n g I n s t a n c e s

function identifies overlapping instances using radius and neighbor thresholds, following the approach outlined in Algorithm 4.

The radius threshold

r_{m i n}

is computed using pairwise distances between minority and majority class instances. For each minority instance, distances to all majority instances are calculated (e.g., Euclidean distance), and the

n

-th percentile (e.g., 75th percentile) of these distances is derived. The final

r_{m i n}

is defined as the average of these percentile values across all minority instances (

r_{m i n} = \frac{1}{n} \sum_{i = 1}^{n} (75 th percentile distance)

). While this process, implemented in the

C o m p u t e P a i r w i s e D i s t a n c e s

function, introduces a computational complexity of

O (n_{m i n} \cdot n_{m a j})

, it ensures robust identification of overlapping regions. The complete procedure is formalized in Algorithm 2.

Algorithm 2: Radius calculation

Input : Minority and majority class instances (D_{m i n}

, D_{m a j}

)

Output : Radius (r_{m i n}

)

Pseudo Code:
//1. Compute all pairwise distances between instances of the minority and majority classes using the distance metric (e.g., Euclidean distance):

d i s t a n c e M a t r i x = C o m p u t e P a i r w i s e D i s t a n c e s (D_{m i n}, D_{m a j})

//2. Calculate the percentile distances for each instance in the distance matrix (e.g., 75th percentile):

p e r c e n t i l e D i s t a n c e s = C a l c u l a t e P e r c e n t i l e D i s t a n c e s (d i s t a n c e M a t r i x)

/ / 3 . Determine the mean of these percentile distances to obtain the radius (r_{m i n}

) for the minority class instances:

r_{m i n} = C a l c u l a t e M e a n (p e r c e n t i l e D i s t a n c e s)

Return:

Radius (r_{m i n}

) for the minority class instances.

To better understand the radius calculation, consider the following example.

Let

(1,2) \in D_{m i n}

and the majority class set be

D_{m a j} = \{(3,3), (2,5), (6,2), (5,4), (1,6), (4,1), (7,3), (3,5), (2,1)\}

Since the Euclidean distance is defined as:

d = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2}}

we compute the distances between

(1,2)

and each point in

D_{m a j}

as follows:

d ((1,2), (3,3)) = \sqrt{{(3 - 1)}^{2} + {(3 - 2)}^{2}} = \sqrt{4 + 1} = \sqrt{5} \approx 2.24. d ((1,2), (2,5)) = \sqrt{{(2 - 1)}^{2} + {(5 - 2)}^{2}} = \sqrt{1 + 9} = \sqrt{10} \approx 3.16. d ((1,2), (6,2)) = \sqrt{{(6 - 1)}^{2} + {(2 - 2)}^{2}} = \sqrt{25 + 0} = \sqrt{25} = 5.00. d ((1,2), (5,4)) = \sqrt{{(5 - 1)}^{2} + {(4 - 2)}^{2}} = \sqrt{16 + 4} = \sqrt{20} \approx 4.47. d ((1,2), (1,6)) = \sqrt{{(1 - 1)}^{2} + {(6 - 2)}^{2}} = \sqrt{0 + 16} = \sqrt{16} = 4.00. d ((1,2), (4,1)) = \sqrt{{(4 - 1)}^{2} + {(1 - 2)}^{2}} = \sqrt{9 + 1} = \sqrt{10} \approx 3.16. d ((1,2), (7,3)) = \sqrt{{(7 - 1)}^{2} + {(3 - 2)}^{2}} = \sqrt{36 + 1} = \sqrt{37} \approx 6.08. d ((1,2), (3,5)) = \sqrt{{(3 - 1)}^{2} + {(5 - 2)}^{2}} = \sqrt{4 + 9} = \sqrt{13} \approx 3.61. d ((1,2), (2,1)) = \sqrt{{(2 - 1)}^{2} + {(1 - 2)}^{2}} = \sqrt{1 + 1} = \sqrt{2} \approx 1.41.

Then, find the 75th percentile, sorting the distances in ascending order:

[1.41, 2.24, 3.16, 3.16, 3.61, 4.00, 4.47, 5.00, 6.08]

To find the 75th percentile:

P = \frac{75}{100} \times 9 = 6.75

Rounding up, the 7th smallest distance is 4.47.

Therefore, the radius corresponding to the 75th percentile for the point

(1,2)

is 4.47.

Algorithm 3 describes the process of calculating the minimum number of neighbors, which utilizes the radius obtained from Algorithm 2. This process involves finding the minimum number of neighbors required based on the radius by counting the number of points within this radius.

Algorithm 3: Minimum Neighbor Calculation

Input : Minority class instances (D_{m i n}

), Radius (r_{m i n}

)

Output : Minimum number of neighbors (m i n p_{m i n}

)

Pseudo Code:

/ / 1 . Compute the distance of each minority instance to its neighbors within the radius (r_{m i n}

).

n e i g h b o r D i s t a n c e s = C o m p u t e D i s t a n c e s W i t h i n R a d i u s (D_{m i n}, r_{m i n})

/ / 2 . Determine the minimum number of neighbors required for each minority instance based on the calculated radius (r_{m i n}

).

m i n {n e i}_{m i n} = C a l c u l a t e M i n i m u m N e i g h b o r s (n e i g h b o r D i s t a n c e s)

Return:

Minimum number of neighbors (m i n {n e i}_{m i n}

) for the minority class instances.

To classify the dataset into four types, Algorithm 4 processes each minority class instance using the radius calculated from Algorithm 2 (

r_{m i n}

). It first counts the majority class neighbors for each minority instance within this radius. It then considers the majority neighbors that are within half of the radius

(\frac{r_{m i n}}{2})

. If the number of these close majority neighbors is less than or equal to the minimum required neighborhood (

m i n {n e i}_{m i n}

), the minority instance is classified as overlapping, and the close majority neighbors are also considered overlapping. Thus, the algorithm identifies instances as either overlapping or non-overlapping based on these criteria.

Algorithm 4: Data Typing for Overlapping Instances

Input : Dataset, Minority class instances (D_{m i n}

), Majority class instances (D_{m a j}

), Minimum number of neighbors for minority class (m i n {n e i}_{m i n}

), Radius for minority class (r_{m i n}

)

Output : D_{{m a j}_{n o n}}

, D_{{m i n}_{n o n}}

, D_{{m a j}_{o v e r}}

, D_{{m i n}_{o v e r}}

Pseudo Code:
//Initialize sets for non-overlapping and overlapping instances

D_{m i n_{n o n}} = S e t ()

D_{m a j_{n o n}} = S e t ()

D_{m i n_{o v e r l a p}} = S e t ()

D_{m a j_{o v e r l a p}} = S e t ()

/ / 1 . For each instance in the minority class (D_{m i n}

):

For each m i n o r i t y I n s t a n c e

in D_{m i n}

:

/ / Calculate the distance to all instances in the majority class (D_{m a j}

) within the radius (r_{m i n}

)

d i s t a n c e s = C o m p u t e D i s t a n c e s (m i n o r i t y I n s t a n c e, D_{m a j}, r_{m i n})

//Identify instances in the majority class that are within half of the radius from the minority instance

c l o s e N e i g h b o r s = I d e n t i f y C l o s e N e i g h b o r s (d i s t a n c e s, \frac{r_{m i n}}{2})

/ / If the number of close majority neighbors is less than or equal to the minimum required neighborhood size (m i n {n e i}_{m i n}

)

If C o u n t (c l o s e N e i g h b o r s) \leq m i n p_{m i n}

:
//The minority instance is considered an overlap

D_{m i n_{o v e r l a p}} . A d d (m i n o r i t y I n s t a n c e)

//Add the close majority neighbors to the majority overlap set

D_{m a j_{o v e r l a p}} . A d d A l l (c l o s e N e i g h b o r s)

Else:
//Otherwise, add the minority instance to non-overlapping

D_{m i n_{n o n}} . A d d (m i n o r i t y I n s t a n c e)

//Also add all majority instances to non-overlapping if not in overlap set

For each m a j o r i t y I n s t a n c e

in c l o s e N e i g h b o r s

:

If m a j o r i t y I n s t a n c e

not in D_{m a j_{o v e r l a p}}

:

D_{m a j_{n o n}} . A d d (m a j o r i t y I n s t a n c e)

//2. Return updated feature matrices for the minority and majority classes, and sets of overlapping instances

Return D_{m i n_{n o n}}

, D_{m a j_{n o n}}

, D_{m i n_{o v e r l a p}}

, D_{m a j_{o v e r l a p}}

Return:

D_{m a j_{n o n}}

: Non-overlapping instances for the majority class

D_{m i n_{n o n}}

: Non-overlapping instances for the minority class

D_{m a j_{o v e r l a p}}

: Overlapping instances for the majority class

D_{m i n_{o v e r l a p}}

: Overlapping instances for the minority class

The data comparison is shown in Figure 3. In Figure 3a, the example dataset is presented, with the majority class represented by red stars and the minority class by blue dots. Figure 3b shows the result of identified data, where orange stars represent the majority class and purple dots indicate the minority class in overlapping regions. Understanding the distribution and overlap of the minority and majority classes helps the model differentiate between them more effectively. By categorizing the data into minority overlap, majority overlap, minority non-overlap, and majority non-overlap, the model gains valuable insights into the dataset’s structure. This enables the model to learn from specific patterns, improving its ability to handle class imbalances and enhancing classification accuracy. The next stage involves data matching, which prepares the data to train five distinct models. This process allows the model to focus on nuanced aspects, such as overlapping or distinctly separated regions, improving generalization and prediction accuracy.

4.2. Data Matching

In the data matching stage, the four defined types are combined to form five distinct sets (Set 0–Set 4). This step is crucial for understanding the similarities and differences between the minority and majority classes in each type. The process is illustrated in Figure 4.

Since Figure 4 illustrates the steps of data matching, Figure 5 provides a detailed example for better understanding. It presents the matched data, which correspond to the same data shown in Figure 3.

After matching the data, each set is resampled using various techniques to balance the classes. The resampling methods include SMOTE, Borderline-SMOTE (b-SMOTE), Safe-SMOTE (s-SMOTE), and Random Oversampling (ROS). This resampling ensures that the models learn more effectively without bias, allowing them to better generalize the characteristics of each dataset (Set 0–Set 4).

The proposed method was designed to train five machine learning models, each addressing a unique aspect of the data:

Baseline Model: The first model serves as the baseline, trained using the original dataset (Set 0) without any resampling.
Overlap Differentiation Models: The second and third models focus on distinguishing overlapping from non-overlapping subsets, specifically minority overlap versus majority non-overlap (Set 1) and majority overlap versus minority non-overlap (Set 2).
In-depth Overlap Analysis Model: The fourth model is dedicated to an in-depth analysis of the overlapping subsets, specifically minority overlap versus majority overlap (Set 3).
Non-overlapping Subset Model: The fifth model examines the dataset after excluding the overlapping elements, focusing on minority non-overlap versus majority non-overlap (Set 4).

This approach, which utilizes distinct training sets, effectively captures the dataset’s complex patterns. To reduce bias in predictions, this study employs a weighted voting strategy based on Recall, AUC, and G-Mean, ensuring that models with stronger performance have a greater influence on the final decision.

The weighting mechanism follows a weighted average method, calculated as follows:

w_{i} = \frac{M_{i}}{\sum_{i = 1}^{n} M_{i}}

where

w_{i}

is the weight assigned to model

i

, and

M_{i}

represents Recall, AUC, or G-Mean, depending on the weighting strategy. In the case of weighting by the average of all three metrics, the weight is computed as:

w_{i} = \frac{{R e c a l l}_{i} + G - M e a n_{i} + A U C_{i}}{\sum_{i = 1}^{n} ({R e c a l l}_{i} + G - M e a n_{i} + A U C_{i})}

The voting mechanism is validated in five ways:

Non-weighted (Non–W): A simple majority vote ensures fairness by treating all models equally, which is beneficial when each metric is non-prioritized and every model has a similar performance level.
Weighted by Recall (W–R): Models with higher Recall receive greater influence. This approach is effective for imbalanced datasets, where detecting rare cases is crucial.
Weighted by G-Mean (W–G): Models with higher G-Mean contribute more to the final decision. This method balances sensitivity and specificity, ensuring both classes are well represented in the final prediction.
Weighted by AUC (W–A): Models with higher AUC scores have stronger voting power. This enhances class distinction across various thresholds.
Weighted by Average (W–avg): The model’s influence is determined by the average of its Recall, AUC, and G-Mean scores. This method provides a balanced approach by considering multiple performance metrics.

This setup ensures a thorough evaluation of each model’s impact on overall performance.

The next section will present the results of the designed algorithm, comparing its performance to that of baseline models, which utilize resampling based on SMOTE and employ Random Forest and SVC as base classifiers.

5. Experimental Results and Discussion

This section presents the results and discussion of the proposed algorithm. The algorithm is compared with baseline models using a process that involves resampling and training machine learning models. A paired t-test is conducted to identify any significant differences between the results.

5.1. Experimental Design

This study compares the proposed algorithm with baseline models, which follow the same workflow: applying resampling techniques, training machine learning models, and then making predictions. Resampling methods include SMOTE, Borderline-SMOTE, Safe-SMOTE, and Random Over Sampling (ROS). Machine learning models are Random Forest (RF) and Support Vector Classifier (SVC) with a linear kernel.

To expand the comparison, ensemble methods are also considered, including Bagging and AdaBoost, using Random Forest and SVC as base classifiers. Models are evaluated with five-fold cross-validation using Recall, G-Mean, and AUC.

The proposed algorithm incorporates five voting mechanisms: non-weighted and weighted voting based on Recall, G-Mean, AUC, and their combined average. A paired t-test is used to determine significant differences between the best results (highest Recall, G-Mean, and AUC) of the baseline models and the proposed algorithm.

The experimental methodology is illustrated in Figure 6.

5.2. Datasets

This study uses 22 datasets with varying imbalance ratios, including yeast, wine quality, stroke, microcalcification, and water quality from UCI, Keel, and Kaggle [39,40,41,42,43,44,45]. The multiclass Yeast and Wine datasets are converted to binary classification, and the imbalance ratio is labeled. For example, “Yeast 143” has an imbalance ratio of 14.3. Detailed dataset information is in Table 3.

5.3. Results of Proposed Algorithm

Table 4, Table 5, Table 6 and Table 7 show the best-performing models for both the baseline and proposed algorithms, including Recall, G-Mean, and AUC values. Bold values indicate the best result, while bold values marked with an asterisk (*) represent a statistically significant difference based on the paired t-test at a 0.05 significance level, with the corresponding p-value shown in parentheses.

For readability, Table 4, Table 5, Table 6 and Table 7 use abbreviated names for the resampling methods, machine learning models, and weighting strategies. The resampling methods include SMOTE, Safe-SMOTE (S-SMOTE), Borderline-SMOTE (B-SMOTE), and Random Oversampling (ROS). The machine learning models are Support Vector Classification (SVC) and Random Forest (RF). The weighting strategies are non-weighted (Non-W), weighted by Recall (W–R), weighted by G-Mean (W–G), weighted by AUC (W–A), and weighted by average (W–Avg).

The results show that the proposed algorithm improves performance in several metrics. The non-weighted voting mechanism often matches or outperforms others, suggesting weighted techniques have a minimal impact on performance.

Table 4, Table 5, Table 6 and Table 7 demonstrate the superior performance of the proposed algorithm across multiple datasets, with consistent gains in Recall, G-Mean, and AUC compared to the baseline, though statistical significance varies.

For the Yeast datasets (Table 4), the proposed algorithm outperforms the baseline in seven out of ten datasets for Recall, eight for G-Mean, and nine for AUC. Significant Recall improvements are seen in Yeast 246 and Yeast 914, while G-Mean shows notable gains in Yeast 908 and Yeast 3273. AUC trends align closely with G-Mean. Non-weighted voting yields the best G-Mean and AUC, whereas Recall-based weighting enhances Recall.

For the Wine Red datasets (Table 5), the proposed algorithm significantly improves Recall in WineRedQ5–Q7, with WineRedQ6 achieving the highest gains. G-Mean and AUC improvements are minor, with the baseline slightly outperforming in WineRedQ4 and WineRedQ5. Weighted voting by Recall enhances Recall, while non-weighted voting performs best for G-Mean and AUC.

For the Wine White datasets (Table 6), Recall improvements are consistent across all cases, with significant gains in WineWhiteQ5–Q8. G-Mean and AUC improvements vary, with notable increases in WineWhiteQ7 and WineWhiteQ8. Weighting by Recall enhances Recall, while G-Mean-based weighting proves effective for G-Mean and AUC, particularly in WineWhiteQ7 and WineWhiteQ8.

For the Stroke, Microcal, and Water datasets (Table 7), the proposed algorithm significantly outperforms the baseline in Water, increasing Recall from 0.70400 to 0.96400. G-Mean and AUC show strong improvements in Stroke and Water, while gains in Microcal are minimal. Non-weighted voting yields the best G-Mean and AUC results in Stroke and Microcal, while Recall-based weighting proves crucial for Recall improvements in Water.

The proposed algorithm consistently improves Recall across most datasets while achieving competitive performance in G-Mean and AUC. The impact of voting mechanisms varies as follows: non-weighted voting is optimal for G-Mean and AUC, while Recall-based weighting significantly enhances Recall. Performance gains are dataset-dependent, with particularly strong improvements observed in the Water, Stroke, and Yeast subsets. These results highlight the effectiveness of the proposed algorithm in handling imbalanced datasets while demonstrating adaptability across different data distributions.

From Table 4, Table 5, Table 6 and Table 7, where some results are equal, Figure 7, Figure 8, Figure 9 and Figure 10 expand the comparison by illustrating the best outcomes for each weighted technique. These results highlight the proposed algorithm’s effectiveness, particularly with the Recall-based weighting strategy, which outperforms the baseline and other methods in several instances. Notably, Recall achieves the highest performance in 16 out of 22 datasets, G-Mean in 17 datasets, and AUC in 18 datasets.

Figure 7 compares the proposed algorithm with the baseline model for the Yeast dataset in terms of three metrics: (a) Recall, (b) G-Mean, and (c) AUC. Seven out of ten datasets show that the suggested algorithm performs better than the baseline, with strong Recall performance shown in Figure 7a. On the other hand, Figure 7b,c demonstrate superior outcomes for methods that do not employ metric weighting, implying that weighting could introduce bias and cause overfitting or decreased generalization. Weighting may disrupt the balance between class-wise performance (as measured by G-Mean and AUC), whereas non-weighted approaches maintain a more comprehensive balance.

The Yeast dataset, a well-known benchmark for imbalanced classification, is used to compare the proposed algorithm with previous methods, with G-Mean as the primary evaluation metric. G-Mean is chosen because it effectively balances sensitivity and specificity. The comparison results are presented in Table 8. Note that “CW [X]” refers to the comparative work with reference [X], and the values in the table represent the best performance of the proposed algorithm in each respective study, averaged over five folds. The missing data indicate that the paper did not use that particular dataset. Bold values in the table indicate the best performance among all compared methods for each Yeast dataset.

From Table 8, the proposed method demonstrates robust and consistent performance across the Yeast datasets, frequently surpassing or matching the results of prior studies (e.g., CW [35], CW [36], and others). For example, on Yeast 143, the proposed method achieves 0.82228, significantly outperforming CW [35] (0.6555) and CW [36] (0.7593). It also exhibits competitive performance against CW [38], with results of 0.70828 for Yeast 246 compared to 0.743, 0.94212 for Yeast 908 compared to 0.937, and 0.98711 for Yeast 3273, exceeding CW [38]’s 0.962. While the proposed method occasionally performs lower than some comparative approaches on specific datasets, it consistently outperforms most others overall, showcasing its reliability and adaptability.

The absence of results in some comparative works highlights that certain datasets were not used in those studies. This emphasizes the broader applicability of the proposed method, as it provides valuable performance metrics even where prior works lack data. By consistently delivering high G-Mean values across diverse datasets and conditions, the proposed algorithm proves to be a more comprehensive and dependable solution for imbalanced classification challenges compared to previous methods.

Comparable patterns can be seen in the Wine datasets (Figure 8 and Figure 9), where G-Mean and AUC only marginally improve, while Recall, particularly in weighted models, significantly exceeds the baseline. Recall triumphs in three of the four datasets in the Wine Red results, while G-Mean and AUC win in two. Recall prevails in each of the five sets for the Wine White results, while G-Mean and AUC do so in four of the five. However, some datasets, such as Wine White Q6 and Wine Red Q6, have lower AUC and G-Mean values, suggesting that although weighting by Recall improves true positive detection, it might sometimes upset the equilibrium.

The stroke dataset (Figure 10) exhibits slightly lower Recall than the baseline, but higher G-Mean and AUC, suggesting better overall balance even with a few positives missing. The microcal and water datasets show similar results. The microcal dataset demonstrates slight gains in performance across the board in terms of all metrics. On the other hand, the water dataset shows notable gains in Recall, G-Mean, and AUC, indicating that the model successfully impacts a balance between positive and negative class detection, resulting in a strong predictive model.

5.4. Discussion

The results demonstrate that the proposed algorithm effectively enhances Recall, particularly in highly imbalanced datasets, such as Yeast914 (

I R = 9.14

), WineWhiteQ8 (

I R = 26.99

), Water (

I R = 7.77

), and WineRedQ7 (

I R = 7.04

), where Recall-weighted strategies significantly outperform the baseline (

p < 0.05

, Table 4, Table 5, Table 6 and Table 7). However, in some datasets, such as Yeast 143 and Stroke, the improvements are negligible, highlighting that Recall-based weighting does not always yield substantial benefits. Additionally, statistical tests could not be performed for Yeast 1225 and Yeast 3273 due to identical values or low variance.

In terms of G-Mean, non-weighted methods generally maintain a better balance and avoid overfitting. However, improvements are observed in datasets, such as Yeast 3273, Stroke, and Water, where G-Mean-weighted strategies significantly outperform the baseline (

p < 0.05

, Table 4 and Table 7). Conversely, datasets, such as Yeast 914 and WineRedQ4 exhibit no significant differences, suggesting that G-Mean weighting does not always enhance model performance.

For AUC, weighting strategies improve performance in datasets such as Yeast 3273, WineWhiteQ8, Stroke, and Water (

p < 0.05

, Table 4, Table 5, Table 6 and Table 7). However, many datasets, including Yeast 143 and WineRedQ5, show high p-values, indicating that AUC-based weighting does not consistently lead to significant improvements. This highlights the importance of dataset characteristics when selecting weighting strategies.

The comparative analysis with previous works confirms the stability and effectiveness of the proposed algorithm. For example, in the Yeast datasets (Table 4), the proposed method achieves notable Recall improvements, outperforming CW [38] on Yeast 3273, with a G-Mean of 0.98711 compared to 0.962 in prior methods. These results demonstrate the algorithm’s adaptability across diverse datasets, addressing limitations observed in previous approaches.

Dataset characteristics strongly influence model performance. Highly imbalanced datasets (e.g., Yeast 3057 and Microcalcification) pose challenges for Recall improvement, while larger datasets (e.g., Water) exhibit more stable performance gains. Simpler datasets with fewer features, such as Microcalcification, tend to yield stronger performance across all metrics. Achieving optimal performance requires balancing the imbalance ratio, dataset size, and feature count.

For voting techniques, Recall-weighted strategies excel in Recall improvement, whereas non-weighted methods better maintain G-Mean and AUC balance. However, in some cases, comparable results across methods (as seen in Figure 7, Figure 8, Figure 9 and Figure 10) suggest potential areas for optimization. Balancing Recall, G-Mean, and AUC remains a challenge, as excessive weighting can disrupt the overall metric balance. Smaller datasets benefit from robust validation techniques, such as stratified k-fold cross-validation, to enhance result stability. While resampling techniques, such as ROS and Safe-SMOTE help moderate imbalances, feature-rich datasets may require advanced techniques, such as feature selection or dimensionality reduction.

Future research should explore adaptive resampling and dynamic weighting strategies to further refine performance. Enhancing metric prioritization, improving validation techniques, and increasing dataset diversity will strengthen the generalizability and stability of the proposed algorithm.

6. Conclusions

The proposed algorithm introduces a hybrid framework to address challenges in binary imbalanced datasets. It begins with a data identification process that classifies instances into four types, forming five separate datasets. These datasets are resampled using SMOTE, Borderline SMOTE, Safe SMOTE, or Random Oversampling and subsequently used to train machine learning models, including Random Forest, Support Vector Classification, Bagging, and Adaboost. Random Forest and Support Vector Classification serve as base classifiers. In the final step, five weighting strategies—non-weighted, weighted by Recall, G-Mean, AUC, and average—are applied to generate predictions, which are then evaluated against a baseline using Recall, G-Mean, and AUC.

The results show that the Recall-weighted strategy consistently delivers the highest Recall improvements across most datasets. Statistical tests confirm significant gains in Recall and G-Mean compared to the baseline, particularly in highly imbalanced datasets, such as WineWhiteQ8 and Water. However, the weighted-by-AUC strategy fails to achieve statistical significance (

p > 0.05

in many cases), suggesting limited effectiveness in certain datasets, such as Yeast 143 and Yeast 1225.

While the algorithm notably improves Recall, its impact on G-Mean and AUC remains moderate, highlighting areas for refinement. Future improvements could involve developing a refined weighting strategy that better balances Recall, G-Mean, and AUC. Alternatively, optimizing the data identification process or designing a tailored resampling strategy could further enhance overall performance. The addition of techniques, such as adaptive resampling and dynamic weighting could help refine the algorithm’s ability to improve Recall without sacrificing G-Mean or AUC.

Beyond enhancing classification effectiveness, future work should prioritize improving the computational efficiency of the proposed method. This includes reducing the complexity of distance calculations in data partitioning and resampling, as well as minimizing processing time for large datasets. Strategies, such as optimizing feature selection, implementing parallel processing, and integrating more efficient sampling algorithms, could significantly reduce computational overhead, making the algorithm more scalable for real-world applications.

These enhancements would lead to a more robust and generalizable model, capable of consistently outperforming the baseline across diverse datasets. Expanding the dataset diversity and exploring advanced validation techniques will further strengthen the proposed algorithm’s applicability in real-world scenarios.

Author Contributions

All authors contributed to the paper. K.T. was mainly responsible for the experimental work, while both K.T. and A.H. worked together on conceptualization. A.H., in the role of research advisor, oversaw the verification and final editing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and codes can be requested from the author.

Acknowledgments

This work was supported by King Mongkut’s Institute of Technology Ladkrabang.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Nasrollahpour, H.; Isildak, I.; Rashidi, M.-R.; Hashemi, E.A.; Naseri, A.; Khalilzadeh, B. Ultrasensitive bioassaying of HER-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as a nanobiocompatible platform. Cancer Nanotechnol. 2021, 12, 10. [Google Scholar] [CrossRef]
Guo, K.; Wang, Y.; Kang, J.; Zhang, J.; Cao, R. Core dataset extraction from unlabeled medical big data for lesion localization. Big Data Res. 2021, 24, 100185. [Google Scholar] [CrossRef]
Cheng, S.; Wu, Y.; Li, Y.; Yao, F.; Min, F. TWD-SFNN: Three-way decisions with a single hidden layer feedforward neural network. Inf. Sci. 2021, 579, 15–32. [Google Scholar] [CrossRef]
Wu, C.; Luo, C.; Xiong, N.; Zhang, W.; Kim, T.-H. A greedy deep learning method for medical disease analysis. IEEE Access 2018, 6, 20021–20030. [Google Scholar] [CrossRef]
Wei, W.; Li, J.; Cao, L.; Ou, Y.; Chen, J. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web-Internet Web Inf. Syst. 2013, 16, 449–475. [Google Scholar] [CrossRef]
Niu, K.; Zhang, Z.; Liu, Y.; Li, R. Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Inf. Sci. 2020, 536, 120–134. [Google Scholar] [CrossRef]
Daliri, S. Using harmony search algorithm in neural networks to improve fraud detection in the banking system. Comput. Intell. Neurosci. 2020, 2020, 6503459. [Google Scholar] [CrossRef]
Cui, L.; Bai, L.; Wang, Y.; Jin, X.; Hancock, E.R. Internet financing credit risk evaluation using multiple structural interacting elastic net feature selection. Pattern Recognit. 2021, 114, 107835. [Google Scholar] [CrossRef]
Yang, J.; Xiong, N.; Vasilakos, A.V.; Fang, Z.; Park, D.; Xu, X.; Yoon, S.; Xie, S.; Yang, Y. A fingerprint recognition scheme based on assembling invariant moments for cloud computing communications. IEEE Syst. J. 2011, 5, 574–583. [Google Scholar] [CrossRef]
Xia, F.; Hao, R.; Li, J.; Xiong, N.; Yang, L.T.; Zhang, Y. Adaptive GTS allocation in IEEE 802.15.4 for real-time wireless sensor networks. J. Syst. Archit. 2013, 59 Pt D, 1231–1242. [Google Scholar] [CrossRef]
Rezvani, S.; Wang, X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023, 143, 110415. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Devi, D.; Biswas, S.K.; Purkayastha, B. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognit. Lett. 2017, 93, 3–12. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In Knowledge Discovery in Databases: PKDD 2003, Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar]
Batuwita, R.; Palade, V. Efficient resampling methods for training support vector machines with imbalanced datasets. In Proceedings of the International Joint Conference on Neural Networks 2010, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
Estabrooks, A.; Jo, T.; Japkowicz, N. A multiple resampling method for learning from imbalanced datasets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef]
Lin, W.-C.; Tsai, C.-F.; Hu, Y.-H.; Jhang, J.-S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
Fernandez, A.; Garcia, S.; del Jesus, M.J.; Herrera, F. A study of the behaviour of linguistic fuzzy rule-based classification systems in the framework of imbalanced datasets. Fuzzy Sets Syst. 2008, 159, 2378–2398. [Google Scholar] [CrossRef]
Fernandez, A.; del Jesus, M.J.; Herrera, F. On the 2-tuples based genetic tuning performance for fuzzy rule-based classification systems in imbalanced datasets. Inf. Sci. 2010, 180, 1268–1291. [Google Scholar] [CrossRef]
Qian, Y.; Liang, Y.; Li, M.; Feng, G.; Shi, X. A Resampling Ensemble Algorithm for Classification of Imbalance Problems. Neurocomputing 2014, 143, 57–67. [Google Scholar] [CrossRef]
Batista, G.; Bazzan, A.; Monard, M.C. Balancing Training Data for Automated Annotation of Keywords: A Case Study. In Proceedings of the II Brazilian Workshop on Bioinformatics, São Paulo, Brazil, 3–5 December 2003; pp. 10–18. [Google Scholar]
Kumar, P.; Kumar, R.; Srivastava, G.; Gupta, G.P.; Tripathi, R.; Gadekallu, T.R.; Xiong, N.N. PPSF: A Privacy-Preserving and Secure Framework Using Blockchain-Based Machine Learning for IoT-Driven Smart Cities. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2326–2341. [Google Scholar] [CrossRef]
Elkan, C. The Foundations of Cost-Sensitive Learning. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; pp. 973–978. [Google Scholar]
Barandela, R.; Sánchez, J.S.; Valdovinos, R.M. New Applications of Ensembles of Classifiers. Pattern Anal. Appl. 2003, 6, 245–256. [Google Scholar] [CrossRef]
Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data; Technical Report; University of California: Berkeley, CA, USA, 2004. [Google Scholar]
Yang, X.; Song, Q.; Cao, A. Weighted Support Vector Machine for Data Classification. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; pp. 859–864. [Google Scholar] [CrossRef]
Seiffert, C.; Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern.—Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new oversampling method in imbalanced datasets learning. In Advances in Intelligent Computing, Proceedings of the ICIC 2005, Hefei, China, 23–26 August 2005; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3644, pp. 878–887. [Google Scholar]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalance problem. In Advances in Knowledge Discovery and Data Mining, Proceedings of the PAKDD 2009, Bangkok, Thailand, 27–30 April 2009; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5476, pp. 475–482. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
Shi, S.; Li, J.; Zhu, D.; Yang, F.; Xu, Y. A Hybrid Imbalanced Classification Model Based on Data Density. Inf. Sci. 2023, 624, 50–67. [Google Scholar] [CrossRef]
Huang, Z.; Gao, X.; Chen, W.; Cheng, Y.; Xue, B.; Meng, Z.; Zhang, G.; Fu, S. An Imbalanced Binary Classification Method via Space Mapping Using Normalizing Flows with Class Discrepancy Constraints. Inf. Sci. 2023, 623, 493–523. [Google Scholar] [CrossRef]
Mayabadi, S.; Saadatfar, H. Two Density-Based Sampling Approaches for Imbalanced and Overlapping Data. Knowl.-Based Syst. 2022, 241, 108217. [Google Scholar] [CrossRef]
Tao, X.; Guo, X.; Zheng, Y.; Zhang, X.; Chen, Z. Self-Adaptive Oversampling Method Based on the Complexity of Minority Data in Imbalanced Datasets Classification. Knowl.-Based Syst. 2023, 277, 110795. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Nakai, K. Yeast. In UCI Machine Learning Repository; University of California, Irvine, School of Information and Computer Sciences: Irvine, CA, USA, 1991. [Google Scholar] [CrossRef]
Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef]
Alcalá-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
Fedesoriano. Stroke Prediction Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data (accessed on 15 November 2024).
Mssmartypants. Water Quality Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/mssmartypants/water-quality (accessed on 15 November 2024).
Sudhanshu. Microcalcification Classification Dataset. Kaggle. Available online: https://www.kaggle.com/datasets/sudhanshu2198/microcalcification-classification/data (accessed on 15 November 2024).
Mathew, J.; Pang, C.K.; Luo, M.; Leong, W.H. Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4065–4076. [Google Scholar] [CrossRef]
Zhao, J.; Jin, J.; Chen, S.; Zhang, R.; Yu, B.; Liu, Q. A Weighted Hybrid Ensemble Method for Classifying Imbalanced Data. Knowl.-Based Syst. 2020, 203, 106087. [Google Scholar] [CrossRef]
Guo, J.; Wu, H.; Chen, X.; Lin, W. Adaptive SV-Borderline SMOTE-SVM Algorithm for Imbalanced Data Classification. Appl. Soft Comput. 2024, 150, 110986. [Google Scholar] [CrossRef]
Li, F.; Wang, B.; Shen, Y.; Wang, P.; Li, Y. An Overlapping Oriented Imbalanced Ensemble Learning Algorithm with Weighted Projection Clustering Grouping and Consistent Fuzzy Sample Transformation. Inf. Sci. 2023, 637, 118955. [Google Scholar] [CrossRef]

Figure 1. Examples of imbalanced classification techniques [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30].

Figure 2. Flow chart of proposed algorithm.

Figure 3. The result of data partition which (a) is the original data set and (b) is the data set that identified.

Figure 4. The process of data matching in each set: Set 0–Set 4 (a–e).

Figure 5. The examples of data matching in each set: Set 0–Set 4 (a–e).

Figure 6. Experimental methods: (a) Baseline (b) Proposed Algorithm.

Figure 7. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Yeast dataset.

Figure 8. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Wine Red datasets.

Figure 9. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Wine White datasets.

Figure 10. Graph comparing the three metrics: (a) Recall, (b) G-Mean, and (c) AUC of Stroke, Microcal, and Water datasets.

Table 1. Advantages and disadvantages of each imbalanced classification techniques.

Techniques	Advantage	Disadvantage
Data-level Techniques	Balances the class distribution, making it easier for the model to learn. Can be applied to many classification algorithms without modification. Oversampling can provide the minority class with more examples, which improves detection.	Oversampling can lead to overfitting by duplicating minority class samples. Undersampling can result in loss of valuable information from the majority class. Noise removal techniques might inadvertently remove important data.
Algorithm-level Techniques	Makes better imbalanced classification models. Ensemble methods can improve models’ performance. Can keep the original dataset, preserving all data points.	Might be computationally intensive and complex to implement. Might require extensive tuning and experimentation to achieve optimal performance. Might not generalize well if the underlying data distribution changes.
Hybrid Techniques	Can provide a balanced approach that mitigates the disadvantages of using either technique alone. Often results in better overall performance and more robust models. Flexible and can be adapted to different types of data and problems.	Complexity increases due to the combination of multiple techniques. May require significant computational resources and expertise to implement effectively. The integration of different techniques can be challenging and may not always lead to the desired improvement.

Sources: Adapted from He and Garcia (2009) [1], Rezvani, S.; Wang, X. (2023) [12], Johnson and Khoshgoftaar (2019) [33], and Chen et al. (2024) [34].

Table 2. Notations of all variables.

Notation	Description
$D_{m a j}$	Majority class instances.
$D_{m i n}$	Minority class instances.
$r_{m i n}$	Radius threshold for minority class instances (calculated via percentile distances).
$m i n n e i_{m i n}$	Minimum number of neighbors required for minority class instances.
$D_{m a j_{o v e r}}$	Majority class instances overlapping with the minority class.
$D_{m i n_{o v e r}}$	Minority class instances overlapping with the majority class.
$D_{m a j_{n o n}}$	Majority class instances in non-overlapping regions.
$D_{m i n_{n o n}}$	Minority class instances in non-overlapping regions.
Set 0–Set 4	$Five datasets constructed by pairing subsets (e . g ., D_{m i n_{o v e r}} \cup D_{m a j_{n o n}}$ ).
$d i s t a n c e M a t r i x$	Pairwise distance matrix between minority and majority instances.
$p e r c e n t i l e D i s t a n c e s$	$Percentile distances used to compute r_{m i n}$ .
$n e i g h b o r D i s t a n c e s$	$Distances of minority instances to neighbors within r_{m i n}$ .

Table 3. All considered datasets.

Name	Dataset	Attributes	Instances	Class Distribution	Imbalance Ratio
Yeast 143	Yeast	8	459	429/30	14.3
Yeast 246			1484	1055/429	2.46
Yeast 508			1484	1240/244	5.08
Yeast 908			514	463/51	9.08
Yeast 912			506	456/50	9.12
Yeast 914			1004	905/99	9.14
Yeast 935			528	477/51	9.35
Yeast 1225			464	429/35	12.25
Yeast 3057			947	917/30	30.57
Yeast 3273			1484	1440/44	32.73
WineRedQ4	Wine Quality	11	1599	53/1546	29.17
WineRedQ5				681/918	1.35
WineRedQ6				638/961	1.51
WineRedQ7				199/1400	7.04
WineWhiteQ4		11	4838	163/4735	29.05
WineWhiteQ5				1457/3441	2.36
WineWhiteQ6				2198/2700	1.23
WineWhiteQ7				880/4018	4.57
WineWhiteQ8				175/4723	26.99
Stroke	Stroke	10	4908	209/4699	22.48
Microcal	Microcalcification	6	11,183	260/10,923	42.01
Water	Water Quality	20	7996	912/7084	7.77

Table 4. Summary of Best Recall, G-Mean, and AUC values for the Yeast dataset.

Datasets	Algorithm	Best Recall	Detail	Best G-Mean	Detail	Best AUC	Detail
Yeast 143	Baseline	0.86667	• S-SMOTE	0.80803	• ROS	0.80853	• ROS
	Baseline	0.86667	• Boosting SVC	0.80803	• SVC	0.80853	• SVC
	Proposed	0.83333	• S-SMOTE	0.82228	• S-SMOTE	0.82248	• S-SMOTE
			• Boosting SVC		• SVC		• SVC
			• W–R		• Non–W		• Non–W
Yeast 246	Baseline	0.81177	• B-SMOTE	0.70388	• ROS	0.71012	• ROS
	Baseline	0.81177	• SVC	0.70388	• Bagging RF	0.71012	• Bagging RF
	Proposed	0.96235 * (0.0042)	• ROS	0.70828	• ROS	0.71410	• ROS
			• Boosting RF		• Bagging RF		• Bagging RF
			• W–R		• Non–W		• W–G
Yeast 508	Baseline	0.81923	• B-SMOTE	0.79436	• ROS	0.80102	• ROS
	Baseline	0.81923	• Boosting SVC	0.79436	• Boosting SVC	0.80102	• Boosting SVC
	Proposed	0.87692	• ROS	0.80338	• ROS	0.80361	• ROS
			• Bagging RF		• SVC		• SVC
			• W–R		• Non–W		• Non–W
Yeast 908	Baseline	0.88333	• B-SMOTE	0.91206	• S-SMOTE	0.91355	• S-SMOTE
	Baseline	0.88333	• Boosting SVC	0.91206	• Bagging SVC	0.91355	• SVC
	Proposed	0.93333	• ROS	0.94212 * (0.04187)	• ROS	0.94249 * (0.04206)	• ROS
			• SVC		• SVC		• SVC
			• Non–W		• Non–W		• Non–W
Yeast 912	Baseline	0.85714	• ROS	0.78421	• ROS	0.78752	• ROS
	Baseline	0.85714	• Bagging SVC	0.78421	• Bagging SVC	0.78752	• Bagging SVC
	Proposed	0.91429	• B-SMOTE	0.78160	• B-SMOTE	0.78541	• B-SMOTE
			• Boosting SVC		• SVC		• SVC
			• W–R		• W–G		• W–A
Yeast 914	Baseline	0.76191	• SMOTE	0.81985	• SMOTE	0.82206	• SMOTE
	Baseline	0.76191	• Bagging SVC	0.81985	• SVC	0.82206	• SVC
	Proposed	0.96190 * (0.00036)	• ROS	0.81359	• SMOTE	0.82429	• SMOTE
			• Boosting RF		• Bagging RF		• Bagging RF
			• W–R		• Non–W		• Non–W
Yeast 935	Baseline	0.71429	• SMOTE	0.76379	• S-SMOTE	0.77994	• S-SMOTE
	Baseline	0.71429	• Bagging SVC	0.76379	• Bagging RF	0.77994	• Bagging RF
	Proposed	0.74286	• B-SMOTE	0.78540	• B-SMOTE	0.79928	• SMOTE
			• SVC		• RF		• Boosting RF
			• Non–W		• W–G		• W–A
Yeast 1225	Baseline	0.83333	• SMOTE	0.81556	• B-SMOTE	0.83218	• B-SMOTE
	Baseline	0.83333	• Boosting SVC	0.81556	• RF	0.83218	• RF
	Proposed	0.83333	• B-SMOTE	0.85685	• ROS	0.86379	• ROS
			• Boosting SVC		• SVC		• SVC
			• Non–W		• W–G		• W–A
Yeast 3057	Baseline	0.84000	• ROS	0.66553	• ROS	0.67838	• ROS
	Baseline	0.84000	• Boosting SVC	0.66553	• Bagging SVC	0.67838	• Bagging SVC
	Proposed	0.92000	• SMOTE	0.69037	• ROS	0.69730	• ROS
			• Boosting SVC		• SVC		• SVC
			• W–R		• W–G		• W–G
Yeast 3273	Baseline	1.00000	• ROS	0.97441	• SMOTE	0.97474	• SMOTE
	Baseline	1.00000	• Boosting SVC	0.97441	• SVC	0.97474	• SVC
	Proposed	1.00000	• SMOTE	0.98711 * (0.00025)	• SMOTE	0.98720 * (0.00025)	• SMOTE
			• Boosting SVC		• SVC		• SVC
			• W–R		• Non–W		• Non–W

Table 5. Summary of Best Recall, G-Mean, and AUC values for the Wine Red datasets.

Datasets	Algorithm	Best Recall	Detail	Best G-Mean	Detail	Best AUC	Detail
WineRedQ4	Baseline	0.90000	• SMOTE	0.80205	• SMOTE	0.80807	• SMOTE
	Baseline	0.90000	• SVC	0.80205	• SVC	0.80807	• SVC
	Proposed	0.90000	• ROS	0.79468	• ROS	0.80097	• ROS
			• Bagging SVC		• Bagging SVC		• Bagging SVC
			• Non–W		• Non–W		• Non–W
WineRedQ5	Baseline	0.89692	• SMOTE	0.77443	• SMOTE	0.77591	• SMOTE
	Baseline	0.89692	• Boosting SVC	0.77443	• Boosting RF	0.77591	• Boosting RF
	Proposed	0.97846 * (0.00374)	• B-SMOTE	0.77267	• SMOTE	0.77320	• SMOTE
			• Bagging RF		• Boosting RF		• Boosting RF
			• W–R		• Non–W		• Non–W
WineRedQ6	Baseline	0.67879	• Safe-SMOTE	0.70162	• B-SMOTE	0.70293	• B-SMOTE
	Baseline	0.67879	• SVC	0.70162	• Bagging RF	0.70293	• Bagging RF
	Proposed	0.96667 * (0.00003)	• ROS	0.70804	• B-SMOTE	0.70835	• B-SMOTE
			• RF		• Bagging RF		• Bagging RF
			• W–R		• W–G		• W–G
WineRedQ7	Baseline	0.89524	• S-SMOTE	0.81897	• B-SMOTE	0.82141	• B-SMOTE
	Baseline	0.89524	• Boosting SVC	0.81897	• Bagging SVC	0.82141	• Bagging SVC
	Proposed	0.98095 * (0.00605)	• SMOTE	0.82385	• ROS	0.82653	• ROS
			• Boosting RF		• SVC		• SVC
			• W–R		• Non–W		• Non–W

Table 6. Summary of Best Recall, G-Mean, and AUC values for the Wine White datasets.

Datasets	Algorithm	Best Recall	Detail	Best G-Mean	Detail	Best AUC	Detail
WineWhiteQ4	Baseline	0.78400	• SMOTE	0.78848	• S-SMOTE	0.78886	• S-SMOTE
	Baseline	0.78400	• SVC	0.78848	• SVC	0.78886	• SVC
	Proposed	0.80800	• S-SMOTE	0.78799	• ROS	0.78831	• ROS
			• Boosting SVC		• SVC		• SVC
			• W–R		• Non–W		• Non–W
WineWhiteQ5	Baseline	0.86804	• B-SMOTE	0.76996	• B-SMOTE	0.77319	• B-SMOTE
	Baseline	0.86804	• Boosting SVC	0.76996	• Bagging RF	0.77319	• Bagging RF
	Proposed	0.97113 * (0.00005)	• B-SMOTE	0.77350	• B-SMOTE	0.77548	• B-SMOTE
			• Boosting RF		• Bagging RF		• Bagging RF
			• W–R		• W–G		• W–G
WineWhiteQ6	Baseline	0.92500	• S-SMOTE	0.71466	• ROS	0.71504	• ROS
	Baseline	0.92500	• Boosting SVC	0.71466	• Bagging RF	0.71504	• Bagging RF
	Proposed	0.99583 * (0.2256)	• ROS	0.71982	• B-SMOTE	0.72000	• B-SMOTE
			• Boosting RF		• Bagging RF		• Bagging RF
			• W–R		• W–G		• W–G
WineWhiteQ7	Baseline	0.79375	• B-SMOTE	0.74659	• ROS	0.76218	• ROS
	Baseline	0.79375	• Bagging SVC	0.74659	• Bagging RF	0.76218	• Bagging RF
	Proposed	0.96979 * (0.000003)	• ROS	0.75287 * (0.02169)	• B-SMOTE	0.76653 * (0.03713)	• ROS
			• Bagging RF		• Bagging RF		• Bagging RF
			• W–R		• W–G		• W–G
WineWhiteQ8	Baseline	0.73714	• SMOTE	0.70885	• ROS	0.71270	• S-SMOTE
	Baseline	0.73714	• Boosting SVC	0.70885	• Boosting SVC	0.71270	• Bagging RF
	Proposed	0.88571 * (0.00045)	• S-SMOTE	0.78596 * (0.00735)	• B-SMOTE	0.78815 * (0.00232)	• B-SMOTE
			• Bagging RF		• Boosting RF		• Boosting RF
			• W–R		• W–R		• W–R

Table 7. Summary of Best Recall, G-Mean, and AUC values for the Stroke, Microcal, and Water datasets.

Datasets	Algorithm	Best Recall	Detail	Best G-Mean	Detail	Best AUC	Detail
Stroke	Baseline	0.94717	• SMOTE	0.73554	• ROS	0.75056	• ROS
	Baseline	0.94717	• Boosting SVC	0.73554	• Boosting SVC	0.75056	• Boosting SVC
	Proposed	0.93585	• ROS	0.75936 * (0.00173)	• S-SMOTE	0.76684 * (0.01101)	• S-SMOTE
			• SVC		• Boosting SVC		• Boosting SVC
			• W–R		• Non–W		• Non–W
Microcal	Baseline	0.95200	• B-SMOTE	0.91004	• ROS	0.91038	• ROS
	Baseline	0.95200	• Boosting SVC	0.91004	• Bagging SVC	0.91038	• Bagging SVC
	Proposed	0.96000	• B-SMOTE	0.91360	• SMOTE	0.91364	• SMOTE
			• Boosting SVC		• SVC		• SVC
			• Non–W		• Non–W		• Non–W
Water	Baseline	0.70400	• B-SMOTE	0.74155	• ROS	0.74486	• ROS
	Baseline	0.70400	• Bagging SVC	0.74155	• Bagging RF	0.74486	• Bagging RF
	Proposed	0.96400 * (0.00001)	• SMOTE	0.79700 * (0.0023)	• ROS	0.79764 * (0.0002)	• ROS
			• RF		• SVC		• SVC
			• W–R		• Non–W		• Non–W

Table 8. Comparison of G-Mean Performance Across Different Methods.

Datasets	CW [35]	CW [36]	CW [37]	CW [38]	CW [46]	CW [47]	CW [48]	CW [49]	Proposed
Yeast 143	0.6555	0.7593	-	-	0.76	-	0.697	0.7573	0.82228
Yeast 246	-	-	0.718	0.743	0.72	-	0.683	-	0.70828
Yeast 908	0.8674	0.8494	0.954	0.937	0.9	-	0.863	0.9231	0.94212
Yeast 912	-	-	-	-	0.74	-	0.627	0.7263	0.7816
Yeast 914	-	-	-	-	0.8	0.7642	0.76	-	0.81359
Yeast 935	-	0.789	-	-	0.81	-	0.779	-	0.7854
Yeast 3057	-	-	-	-	0.73	0.6605	0.657	-	0.69037
Yeast 3273	0.9601	-	-	0.962	0.96	0.939	0.948	-	0.98711

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Theephoowiang, K.; Hanskunatai, A. A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data 2025, 10, 54. https://doi.org/10.3390/data10040054

AMA Style

Theephoowiang K, Hanskunatai A. A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data. 2025; 10(4):54. https://doi.org/10.3390/data10040054

Chicago/Turabian Style

Theephoowiang, Kittipong, and Anantaporn Hanskunatai. 2025. "A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification" Data 10, no. 4: 54. https://doi.org/10.3390/data10040054

APA Style

Theephoowiang, K., & Hanskunatai, A. (2025). A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification. Data, 10(4), 54. https://doi.org/10.3390/data10040054

Article Menu

A Partition-Based Hybrid Algorithm for Effective Imbalanced Classification

Abstract

1. Introduction

2. Related Works

3. Motivation

4. Designed Algorithms

4.1. Data Characterization

4.2. Data Matching

5. Experimental Results and Discussion

5.1. Experimental Design

5.2. Datasets

5.3. Results of Proposed Algorithm

5.4. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI