Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution

Xie, Wenhao; Huang, Xiao

doi:10.3390/info17010028

Open AccessArticle

Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution

by

Wenhao Xie

^1,* and

Xiao Huang

^2,*

¹

School of Science, Xi’an Shiyou University, Xi’an 710065, China

²

School of Computer Science, Xi’an Shiyou University, Xi’an 710065, China

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(1), 28; https://doi.org/10.3390/info17010028

Submission received: 9 November 2025 / Revised: 13 December 2025 / Accepted: 19 December 2025 / Published: 31 December 2025

Download

Browse Figures

Versions Notes

Abstract

Oversampling is common and effective in resolving the classification problem of imbalanced data. Traditional oversampling methods are prone to generating overlapping or noisy samples. Clustering can effectively alleviate the above problems to a certain extent. However, the quality of clustering results has a significant impact on the final classification performance. To address this problem, an oversampling algorithm based on the Gaussian distribution oversampling algorithm and the K-means clustering algorithm combining compactness and separateness (CSKGO) is proposed in this paper. The algorithm first uses the K-means clustering algorithm, combining compactness and separateness to cluster the minority samples, constructs the cluster compactness index and inter-cluster separateness index to obtain the optimal number of clusters and the clustering results, and obtains the local distribution characteristics of the minority samples through clustering. Secondly, the sampling ratio for each cluster is assigned based on the compactness of the clustering results to determine the number of samples for each cluster in the minority class. Then, the mean vectors and covariance matrices of each cluster are calculated, and the Gaussian distribution oversampling algorithm is used to generate new samples that match the distribution of characteristics of the real minority samples, which are combined with the majority samples to form balanced data. To verify the effectiveness of the proposed algorithm, 24 datasets were selected from the University of California Irvine (UCI) Repository, and they were oversampled using the CSKGO algorithm proposed in this paper and other oversampling algorithms, respectively. Finally, these datasets were classified using Random Forest, Support Vector Machine, and K-Nearest Neighbor Classifiers. The results indicate that the algorithm proposed in this paper has higher accuracy, F-measure, G-mean, and AUC values, which can effectively improve the classification performance of the imbalanced datasets.

Keywords:

imbalanced data; oversampling algorithm; Gaussian distribution; K-means

Graphical Abstract

1. Introduction

Imbalanced data refers to the situation where the number of samples in one class is much larger than the number of samples in other classes. The classification problem of imbalanced data is prevalent in many fields, including medical diagnosis [1], financial fraud detection [2], fault detection [3], and social network analysis. In such cases, the minority samples usually carry more important information, such as positive cases in early disease prediction, and fraud information in financial fraud detection, among others. This makes the classification performance on the minority class particularly important. Traditional classification algorithms (such as Logistic Regression [4], Support Vector Machines [5], and Decision Trees [6]) perform well when dealing with balanced data, but for imbalanced data, the models often tend to improve the prediction accuracy for the majority class, resulting in a significant reduction in the recognition rate of the minority class. Focusing solely on overall accuracy while neglecting minority class performance can lead to severe consequences. For example, in the medical field, the misdiagnosis rate may rise due to the neglect of positive cases (the minority class), thus affecting the diagnosis and treatment of patients. Moreover, in the financial domain, ignoring fraudulent transactions (the minority class) can lead to huge financial losses. Therefore, improving the classification performance of imbalanced data and the recognition accuracy of minority class samples is of great significance for imbalanced data classification.

The solution to the classification problem of imbalanced datasets is mainly divided into the data level and algorithm level. At the data level, the common methods include oversampling the minority class samples and undersampling the majority class samples. Undersampling methods generally achieve a data balance by either randomly deleting the majority samples or applying clustering algorithms to select representative ones. Zhang et al. [7] propose the fuzzy rough set-based undersampling method (USFRS), which selects more representative samples from the majority class by considering the fuzzy relationship between the K-nearest neighbors of the majority class samples and the minority class samples. Since USFRS relies on sample relationships, noise in the majority class may distort fuzzy relation assessments. This interference can lead to the incorrect selection of majority class samples as representative samples, thereby reducing classification efficiency. Bhattacharya et al. [8] propose an algorithm combining clustering and undersampling. First, the majority class is clustered, and the cluster centers are selected as representative samples to train the classifier alongside minority samples. However, this algorithm also has its shortcomings. If the clustering algorithm is improperly selected or the clustering parameters are not set reasonably, the cluster center points may not represent the majority class samples well, thus affecting the performance of the classifier.

In addition, oversampling algorithms are widely adopted to address the classification of imbalanced data by generating synthetic samples for the minority class to balance the dataset. These methods have received significant attention as a powerful solution for imbalanced data classification. The Synthetic Minority Oversampling Technique (SMOTE), proposed by Chawla et al. [9], is one of the most widely cited and applied oversampling methods. SMOTE employs stochastic linear interpolation to generate synthetic samples between minority class instances and their K-nearest neighbors, thereby balancing the dataset. However, in the process of synthesizing samples using the SMOTE algorithm, samples may be significantly influenced by noise points or outliers. Consequently, the newly generated samples may diverge from the true distribution of the minority class, thereby potentially degrading the performance of the classifier. Subsequent studies have proposed various SMOTE-based enhancements. Dong Yanjie et al. [10] propose Random-SMOTE to address SMOTE’s limitations in handling sparse imbalanced data. This method selects two minority class samples and generates synthetic instances within the triangular region formed by them. Although the Random-SMOTE algorithm enhances the processing capability of sparsely distributed samples to some extent, its sampling region is relatively constrained and exhibits significant randomness. Consequently, the generated samples may lack representativeness and still cannot adequately reflect the overall distribution characteristics of the minority class samples. Han et al. [11] propose the Borderline-SMOTE algorithm, which synthesizes samples near the class boundary of the minority class to enhance classifier recognition. However, Borderline-SMOTE primarily focuses on boundary samples, which can lead to the misidentification of noise samples as boundary samples during synthesis, thereby introducing potential noise. He et al. [12] propose the Adaptive Synthetic Sampling (ADASYN) algorithm, which enhances the synthesis of minority class samples through an adaptive approach, thus improving the model’s classification performance for imbalanced data. However, the ADASYN algorithm tends to generate an excessive number of synthetic samples in noise-sensitive areas, resulting in an increase in noise points and overlapping samples, which in turn compromises the diversity of the synthetic samples. Beyond data-level methods (e.g., undersampling and oversampling), algorithmic-level approaches (such as cost-sensitive learning [13], ensemble learning [3], and transfer learning [14]) are also critical for imbalanced data classification.

In conclusion, while the existing oversampling techniques have partially addressed the classification problems of imbalanced datasets, they still face several limitations. Some oversampling algorithms are particularly vulnerable to noise and outliers. If noisy samples are selected as parent samples during oversampling, additional noise may be generated, degrading classifier performance. Additionally, the sample overlap problem may be exacerbated by oversampling algorithms, especially in datasets where class overlap exists. Some oversampling algorithms generate new samples by interpolation, which may introduce more overlapping areas between the majority and minority classes, resulting in less classification effectiveness. In addition, the lack of diversity in samples generated by oversampling algorithms also affects the classification efficiency of classifiers.

To address these common issues in both undersampling and oversampling—such as high sensitivity to noise and outliers, failure to generate representative samples, persistence of inter-class overlap after sampling, and loss of informative instances—an effective improvement approach is to introduce cluster analysis. Through reasonable clustering of majority class samples in the undersampling algorithm or minority class samples in the oversampling algorithm, the data distribution characteristics of the original datasets can be effectively retained, and the sensitivity to noise points or outliers can be reduced. In addition, some studies [15,16] have also proved that the combination of clustering algorithms and sampling algorithms can reduce the overlapping regions at the boundary of the majority class and minority class samples.

While combining sampling with clustering algorithms can address these issues, clustering itself faces challenges such as high sensitivity to initial parameters (leading to unstable results) and difficulty in achieving global optima due to improper parameter settings.

Therefore, it is of great significance to improve the sampling effect from the point of view of improving the clustering effect and generating new samples that keep the original data distribution characteristics, are not easily affected by noise points, and are less likely to create more overlapping regions. In order to improve the clustering effect, the new clustering indices should be proposed, and then the clustering algorithm and the oversampling algorithm should be integrated effectively to obtain better classification performances for imbalanced datasets.

To achieve the above objectives, this paper proposes an oversampling algorithm (CSKGO), which is based on the Gaussian distribution oversampling algorithm (GOS) and the K-means clustering algorithm combining compactness and separateness (CSK-means). Firstly, the K-means algorithm is optimized by using the cluster compactness index and inter-cluster separateness index to improve the accuracy and stability of clustering. Then, minority class samples are clustered using the improved CSK-means algorithm so as to maintain the local distribution characteristics of the minority class samples as far as possible. The sampling ratio is allocated for each cluster according to the compactness of the clustering results. Finally, the mean vectors and covariance matrix of each cluster are calculated, and the standard Gaussian distribution samples are transformed into new samples that conform to the distribution characteristics of the minority class samples by using linear transformation. The balanced datasets are obtained by the above data processing, then classification is performed on this basis, and a high classification accuracy is obtained. The main contributions of this paper are as follows:

It improves the K-means algorithm by introducing a new inter-cluster separateness index and calculates the clustering effectiveness index by combining cluster compactness to obtain the optimal number of clusters k and better clustering results.
In order to make the newly generated samples more consistent with the distribution characteristics of the original minority class samples, the improved K-means algorithm is used to cluster the minority class samples in order to maintain the local distribution characteristics of the minority class samples as much as possible.
In order to improve the quality of the samples and reduce the influence of noise points, the cluster compactness index is utilized to allocate the sampling ratio for each cluster so that the sampling ratio is more inclined to the clusters with a high cluster compactness index, thus generating more representative samples.
The experiments are conducted on 24 public datasets from the University of California Irvine (UCI) Repository, and the experimental results show that the algorithm proposed in this paper effectively improves the classification performance of imbalanced datasets.

2. Related Work

In recent years, a variety of oversampling techniques have been proposed to address the challenges of imbalanced learning, particularly in the presence of noise, class overlap, and high dimensionality. Early efforts, such as SMOTE [9] and its variants (e.g., Borderline-SMOTE [11], ADASYN [12]), rely on linear interpolation within local neighborhoods of minority samples. However, these methods often generate synthetic instances in overlapping or noisy regions, degrading classifier performance.

To mitigate these issues, recent studies have explored more sophisticated synthesis strategies that go beyond simple interpolation. For instance, Yan et al. [17] propose a method that leverages the Mahalanobis distance and majority class density contours to guide sample generation in highly overlapped scenarios with extremely scarce minority samples (MLOS). By constraining synthesis using probability density similarity, MLOS effectively reduces boundary ambiguity. Similarly, Tang et al. [18] introduce an instance gravity model (MOSIG) inspired by physical laws, where instance “attraction” is used to weight seed selection—demonstrating strong robustness in software defect prediction tasks. In high-dimensional settings, Yang et al. [19] combine manifold learning, entropy-based weighting, and Beta-distribution SMOTE to adaptively generate fewer but more informative samples, significantly alleviating class overlap. Meanwhile, for multi-label imbalance, Liu et al. [20] develop oversampling multi-label data based on natural neighbor and label correlation (MLONC), which uses natural neighbors to adapt the neighborhood size per label and incorporates label correlation during synthetic label assignment, improving semantic consistency.

Although these methods perform well in specific scenarios, they usually treat the minority class as an internally consistent whole and generate new samples within its neighborhood. However, the minority class is often not uniformly distributed in the feature space but rather presents multiple separate or partially overlapping local aggregation regions. When oversampling is performed, ignoring this internal structural difference and directly interpolating between different aggregation regions can easily result in samples that do not conform to the actual data distribution. Meanwhile, they lack the ability to distinguish between areas with blurred category boundaries or those affected by noise, which may result in the generation of harmful samples and thereby undermine the classification performance. To overcome the above limitations, some research has tended to introduce clustering mechanisms. Before oversampling, clustering is conducted to obtain the distribution structure of the minority class, and sample generation is only carried out within the sub-clusters. Wang et al. [15] propose an oversampling algorithm based on adaptive density difference peak clustering and spatial distribution entropy (ADDPC-SMOTE). It obtains the spatial distribution of the two classes of samples and uses the local density difference to perform peak clustering of the minority class to avoid class overlap. Next, boundary and sparse samples are identified based on local density differences, and sampling probabilities are assigned to minority samples accordingly. Finally, the spatial distribution entropy is utilized to evaluate the synthetic samples, to ensure the balance of the distribution of the two classes of samples. Zhang et al. [21] propose the FSDR-SMOTE algorithm to address the effect of noise and fuzzy boundaries on classification. Firstly, the noisy points are removed using the Tukey criterion. Then, the standard deviation of the features is calculated as the degree of discretization to detect the sample locations. Finally, the minority class samples are divided into sub-clusters using the K-means clustering algorithm, and for each sub-cluster, new samples are generated based on random samples, boundary samples, and corresponding sub-cluster centers. Lv et al. [22] propose the ISODATA-SMOTE algorithm, which iteratively self-organizes the dynamic clustering of a minority class into different sub-clusters and then oversamples each sub-cluster to balance the data according to the sampling ratio. Zhao et al. [23] propose a hybrid sampling algorithm based on the Conditional Tabular generative adversarial network (CTGAN) and the KNN algorithm. The algorithm utilizes the density-based spatial clustering algorithm DBSCAN to remove the noise samples and boundary samples and then uses CTGAN to generate new samples that match the original data distribution. Qin et al. [24] propose an oversampling algorithm based on hierarchical clustering and improved SMOTE, which performs hierarchical clustering for samples of different classes, respectively, after which the sampling weights are determined based on the number of samples in the minority class clusters, and at the same time, the probability distribution of each minority class cluster is computed by the distance of the minority class samples to their neighboring majority class samples, and the two are combined to generate the samples. Chen et al. [25] propose an imbalanced data oversampling algorithm combining Weighted K-means (WK-Means) and SMOTE, which adds weight factors to the features and improves the clustering results by calculating the weight values of different features to identify and remove unimportant variables. It also combines the clustering consistency coefficient to find the minority class samples that are in close proximity to the cluster boundaries for SMOTE oversampling, thus improving the classification accuracy of the classifier. Dong et al. [26] propose an algorithm for classifying imbalanced data based on Density Peak Clustering (DPC) resampling combined with Extreme Learning Machine (ELM). The representative samples of the majority class are selected, and synthetic samples belonging to the minority class are created. The influence of noise is taken into account when selecting samples of the majority class, which can effectively solve the problem of noise misjudgment.

The above algorithms are used to conduct a more reasonable division of minority class samples through clustering (such as DPC, K-means, ISODATA, etc.) so that synthetic samples that better conform to the local distribution can be generated in each sub-cluster. Moreover, techniques such as spatial entropy, feature standard deviation, and GAN are introduced to improve the rationality and diversity of the generated samples. However, some of these algorithms may be quite sensitive to the selection of certain parameters. For example, the parameters in the DPC algorithm, the parameter k in the K-means algorithm, and the hyperparameters of the GAN significantly affect performance. Inappropriate parameters may lead to poor clustering results or a decline in the quality of the generated samples, resulting in unsatisfactory classification results.

In summary, although some progress has been made in the research of clustering-based oversampling techniques, unstable clustering results may lead to the generation of samples that do not conform to the distribution of the original data, resulting in a lower sample quality. To solve that problem, this paper proposes an oversampling algorithm based on CSK-means clustering and Gaussian distribution. The algorithm optimizes K-means by incorporating both a cluster compactness index and an inter-cluster separateness index during minority-class clustering, and it dynamically allocates the sampling ratio based on cluster compactness to generate new samples that better conform to the underlying distribution of the minority class.

3. Proposed Algorithm

In imbalanced classification, the number of samples for the minority class is usually relatively small. These samples often exhibit complex distribution characteristics and may also be affected by noise and outliers. To alleviate the imbalance of data, researchers often employ the oversampling strategy to generate synthetic samples in order to enhance the recognition ability of the minority class. Among them, the methods based on probability distribution modeling (such as Gaussian distribution) have been widely explored due to their ability to capture the overall statistical characteristics of the data: by estimating the mean and covariance matrix of the minority class samples, a multivariate Gaussian distribution is fitted, and new samples are generated by sampling from it. However, when the minority class contains noise or outliers, it will distort the estimated mean and covariance, causing the generated synthetic samples to deviate from the true underlying data distribution, ultimately reducing the classification performance. Using algorithms such as K-means clustering to subdivide the minority class samples can help isolate the noise samples, as clustering can identify the local distribution characteristics of the minority class samples. Then, oversampling each sub-cluster with the Gaussian distribution can generate new samples that are more consistent with the distribution characteristics of the minority class samples. However, too many or too few clusters will lead to unsatisfactory clustering results and finally result in an irrational distribution of the generated data samples, so it is necessary to choose the number of clusters reasonably. However, selecting an optimal number of clusters remains a key challenge for clustering algorithms.

Based on the above problems, this paper proposes an oversampling algorithm based on the CSK-means algorithm and Gaussian distribution oversampling, which is called the CSKGO algorithm. Its core lies in the synergistic use of CSK-means clustering and Gaussian distribution oversampling to sequentially address the challenges of noise, complex distribution, and cluster number determination. Specifically, CSK-means first adaptively partitions the minority class into homogeneous sub-clusters, effectively isolating noise and identifying local patterns while determining the optimal number of clusters. Then, Gaussian oversampling is applied locally within each sub-cluster, ensuring the generated synthetic samples accurately reflect the true underlying data distribution of each region. Therefore, the implementation of the CSKGO algorithm includes two stages: adaptive clustering and analysis, and localized sample generation.

Firstly, the CSK-means algorithm is used to cluster the minority samples, and the optimal clustering number and clustering results can be obtained by combining the compactness and separateness of the cluster index to evaluate, and then the local distribution characteristics of the minority class samples are obtained. After that, the sampling ratio of each cluster is assigned according to the compactness of the clustering results. Then, the Gaussian distribution oversampling algorithm is utilized to convert the standard Gaussian distribution samples generated into new samples that conform to the true minority class sample distribution. Finally, the balance data is achieved by merging with majority class samples.

3.1. K-Means Clustering Algorithm Combining Compactness and Separateness

K-means [27] is a classical algorithm in cluster analysis that divides the dataset into k clusters by iteration, making the samples in each cluster as close as possible to the cluster center. It has the advantages of being simple and easy to implement, converging quickly, and being able to effectively handle large datasets. However, in practice, the K-means algorithm also has some clear shortcomings. First of all, a value of the number of clusters k needs to be set in advance. Choosing the right value of k is crucial for obtaining good clustering results. Secondly, the randomly selected k initial cluster centroids can lead to problems such as susceptibility to local optima, instability of clustering results, and sensitivity to noise points. Therefore, this paper improves the two indices of the cluster compactness index and inter-cluster separateness index and uses the improved cluster compactness index and inter-cluster separateness index to evaluate the effect of K-means clustering, so as to determine the optimal number of clusters k. The cluster compactness index measures the tightness within the same cluster by the distance of the samples within a cluster to the cluster center, while the inter-cluster separateness index measures the degree of separateness between two different clusters. Through these two indices, the clustering effect can be better evaluated to select the optimal value of k.

When clustering, if the value of the number of clusters k is too small, each cluster divided will contain more samples, the variability of the samples in the same cluster increases, and the center of the cluster may be pulled by the influence of multiple samples with different characteristics, resulting in an increase in the average distance from the samples to the center of the cluster, which increases the cluster compactness index as well, and the inter-cluster separateness index has less impact. Therefore, due to the high compactness index of the clustering results, the samples in the cluster are divided unevenly, and the differences between the samples are large, which leads to unsatisfactory clustering results. Relatively, if the value of the number of clusters k is too large, the distance between the divided clusters may be reduced, which means that a given region of radius

r

may contain sample points of multiple clusters, resulting in the clustering separateness index becoming smaller, while the cluster compactness index does not change significantly. Therefore, the clustering results appear to overlap between different clusters because of the low inter-cluster separateness index, resulting in poor-quality clustering results.

3.1.1. Related Definitions of Clustering

Given a dataset

X = \{x_{1}, x_{2} \dots x_{n}\}

, the clustering result is

C S = \{C_{1}, C_{2} \dots C_{k}\}

, in which

C_{i}

is the cluster

i

; cluster centers are

v = {v_{1}, v_{2} \dots v_{k}}

, and

v_{i}

is the cluster center of

C_{i}

.

(1): Cluster compactness index

Definition 1.

The cluster compactness index is a measure that reflects the tightness within the cluster by the distance between samples in the same cluster and plays a key role in evaluating the quality of clustering results. The smaller the compactness index, the closer the samples in the same cluster are to each other and the more compact the samples in the cluster are. Assuming that a cluster contains n samples

x_{1}, x_{2}, \dots, x_{n}

, the cluster compactness index can be expressed as [28]

C o m (C_{i}) = \frac{1}{n} \sum_{x_{j} \in C_{i}} d i s t (x_{j}, v_{i})

(1)

where

v_{i}

is the cluster center of the cluster

C_{i}

, with

d i s t (x_{j}, v_{i})

representing the distance from the sample

x_{j}

to the cluster center

v_{i}

.

The compactness of the clustering result can be defined as a measure of the compactness of the samples in the whole clustering result and is calculated as follows [28]:

c o m c s (C S) = \frac{\max_{i = 1, 2, \dots, k} \{C o m (C_{i})\}}{\sum_{i = 1}^{k} C o m (C_{i})}

(2)

This approach does not simply select the cluster compactness index of any single cluster to represent the global cluster compactness. First, we calculate the cluster compactness index of each cluster

C_{i}

, which reflects how tightly its samples are grouped. Then, we select the maximum of these cluster compactness indexes across all clusters as the global cluster compactness index of the clustering result. Using this maximum value (corresponding to the worst-case scenario of compactness) as the global cluster compactness measure represents a relatively conservative evaluation criterion. It represents the degree to which dispersion is most severe in the whole clustering result and which is least conducive to maintaining tightly grouped clusters. If this global cluster compactness index is small, it means that even in the most unfavorable local (where a cluster is hardest to keep compact), the cluster remains tightly packed, which can indicate that the whole clustering result is relatively compact. Consequently, this metric serves as a global and comprehensive indicator of cluster cohesion quality, prioritizing the most critical weak point rather than relying on the average or best-case performance.

(2): Inter-cluster separateness index

The traditional cluster compactness index only focuses on the compactness of the samples within the cluster and ignores the separateness between the clusters, which leads to difficulties in obtaining optimal clustering results. In order to solve this problem, a separateness index considering radius

r

neighbors is proposed in this paper. The separateness index can effectively consider the separateness between different clusters by analyzing the tightness of the samples belonging to these different clusters; if the neighborhood of the cluster center of a cluster does not contain the sample points of another cluster, it indicates that the two clusters are not closely connected and the two clusters have a greater degree of inter-cluster separateness. Conversely, if the radius

r

neighborhood of the cluster center of a cluster contains samples belonging to another cluster, it indicates that the relationship between the two clusters is close and the degree of separateness between the clusters is small. To facilitate the description of the method, the following definitions are made:

Definition 2.

Radius

r

neighbors form the set of

m

samples whose distance from the cluster center

v

does not exceed the radius

r

, denoted

N e i g_{r} (v)

. The radius

r

is calculated by multiplying the distance between the centers of two classes by

β

, which is optimal at 0.7 after many experiments.

Definition 3.

The inter-cluster separateness index is used to indicate the degree of separation between two different clusters and is determined by calculating the number of sample points containing other clusters within radius

r

neighbors of the center of the cluster. The inter-cluster separateness index between two clusters

C_{i}

and

C_{j}

is denoted as

s e p (C_{i}, C_{j})

, with the formula

s e p (C_{i,} C_{j}) = \frac{d i s t (v_{i,} v_{j})}{\sum_{i = 1}^{m} f (v_{j}, x_{i})}

(3)

where

d i s t (v_{i}, v_{j})

is the distance between the cluster centers of

C_{i}

and

C_{j}

,

v_{j}

is the cluster center of

C_{j}

, and

f (v_{j}, x_{i})

is calculated as follows:

f (v_{j}, x_{i}) = \{\begin{array}{l} 1, x_{i} \in C_{i}, x_{i} \in N e i g_{r} (v_{j}) \\ 0, x_{i} \in C_{i}, x_{i} \notin N e i g_{r} (v_{j}) \end{array}

(4)

The larger the inter-cluster separateness index, the higher the inter-cluster separation of the two clusters, and the farther the sample points are from each other in the feature space.

Definition 4.

Global cluster separateness index: For a certain cluster, the sum of its inter-cluster separateness index from other clusters is calculated first, and then all clusters are traversed, taking the minimum value of their sums. This index reflects the overall inter-cluster separateness. The formula for calculating the separateness index of the clustering results is as follows:

s e p c s (C S) = \min_{i = 1, 2, \dots, k} {\frac{\sum_{j = 1, j \neq i}^{k} s e p (C_{i}, C_{j})}{k - 1}}

(5)

In the above formula, this minimum value (corresponding to the worst-case scenario of separability) as the global cluster separateness measure represents a relatively conservative evaluation criterion. It reflects the degree to which clusters are most prone to confusion or overlap in the entire clustering result. If this global cluster separateness is large, it means that even in the most unfavorable local (where clusters are hardest to separate), the clusters are well separated from each other, which can indicate that the whole clustering result is relatively well separated. Consequently, this metric serves as a global and comprehensive indicator of inter-cluster separation quality, prioritizing the most critical weak point rather than relying on the average or best-case performance.

The larger the global cluster separateness index, the higher the separateness of the clusters, meaning that different clusters are farther apart and overlap less and the cluster structure is clear.

Definition 5.

Cluster validity index-CVIN. When the clustering effectiveness index decreases, it means that the clustering compactness index decreases or the clustering separateness index increases, the clustering result has high compactness and high separateness, and the clustering result is more excellent. The formula is as follows:

C V I N (C S) = \frac{c o m c s (C S)}{s e p c s (C S)}

(6)

3.1.2. Implementation Steps of the K-Means Clustering Algorithm Combining Compactness and Separateness

The algorithm first clusters the minority class samples using the K-means clustering algorithm, setting an initial number of clusters. Based on the clustering results, the cluster compactness index and the inter-cluster separateness index are calculated according to Equations (2) and (5), and then the current cluster validity index is obtained (see Equation (6)). When

k \leq \sqrt{N}

, the above steps are repeated and the number of clusters

k

is increased step by step. After the iteration is completed, the cluster validity index of the clustering results under each value of k is evaluated to determine the optimal number of clusters, as shown in Algorithm 1.

Algorithm 1 CSK-means clustering algorithm

Input: The minority class samples of dataset

X_{\min}

Output: Optimal number of clusters

k

and clustering results

Initialize, $k = 2$ , $N = n u m b e r o f m i n o r i t y c l a s s s a m p l e s$
while $k \leq \sqrt{N}$ do
$C S = k - m e a n s (X_{\min})$ // Perform K-means clustering to get the clustering result.
$C o m s = c o m c s (C S)$ // calculate the cluster compactness index by Equations (1) and (2)
$S e p s = s e p c s (C S)$ // calculate the cluster separateness index by Equations (3)–(5)
$C V I N = C o m s / S e p s$ // calculate the cluster validity index by Equation (6)
$k = k + 1$
end while
$k = \arg \min (C V I N)$
$c l u s t e r i n g r e s u l t = C S (\arg \min (C V I N))$
return $k$ , clustering results

3.2. Gaussian Distribution Oversampling Algorithm

The Gaussian distribution is an important continuous probability distribution that is widely used in statistics, natural science, social science, and other fields. Its curve is bell-shaped and symmetrical, indicating that the variable has the highest probability of values occurring near the center, and the probability of values occurring gradually decreases as it moves away from the center. If a random variable obeys a normal distribution with mathematical expectation

μ

and variance

σ^{2}

, it can be written as

X ~ N (μ, σ^{2})

, and its probability density function is

f (x) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}

(7)

where

x

is the random variable,

μ

and

σ^{2}

are the mean and variance parameters of the distribution, and

σ

is the standard deviation. The mean (

μ

) and standard deviation (

σ

) of the sample are defined as follows:

μ = \frac{\sum_{i = 1}^{n} x_{i}}{n}

(8)

σ = \sqrt{\frac{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}{n}}

(9)

The multivariate Gaussian distribution is a generalization of the Gaussian distribution in multidimensional space, which is also a combined probability distribution of two or more dependent variables. Its formula is as follows:

f (x) = \frac{1}{{|\sum|}^{1 / 2} {(2 π)}^{d / 2}} e^{\{- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)\}}

(10)

where

x

is the vector of random variables,

μ

is the vector of means,

Σ

is the positive definite covariance matrix of

d \times d

, and

Σ^{- 1}

is the inverse of the matrix.

|Σ|

is the determinant of the covariance matrix,

d

is the dimension, and

T

denotes the transpose.

Covariance is a statistical measure used to express the relationship between two random variables. It reflects the degree of association between changes in one variable and changes in another variable and is an important tool for understanding the relationship between variables in a dataset. The formula for the covariance between any two variables

x

and

y

is as follows:

\sum = C o v (x, y) = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{n}

(11)

where

\bar{x}

is the mean value of feature

x

, and

\bar{y}

is the mean value of feature

y

.

The Gaussian distribution oversampling (GOS) [29] algorithm is a method that uses the Gaussian distribution to generate new samples, which is especially suitable for dealing with imbalanced datasets. New samples are generated by simulating the distribution of the initial data.

Taking multidimensional data as an example, firstly, the parent samples of the minority class, which is used in the oversampling, are obtained, and their mean and covariance matrices are calculated according to Equations (8) and (11). Then, the standard multivariate normal distribution samples are transformed into multivariate Gaussian distribution samples that conform to the initial minority class samples by linear transformation so as to obtain the final sample data. The formula for generating new samples from a multivariate Gaussian distribution is as follows:

X_{new} = μ + A z

(12)

where

A A^{T}

is the covariance matrix, and

z

is a random sample sampled from a standard multivariate normal distribution.

3.3. Introduction of the Proposed Algorithm

Firstly, the CSK-means algorithm is used to cluster the minority classes, the clustering results are evaluated by calculating the class compactness index and the inter-class separateness index, and the optimal clustering result is k clusters. Then, the sampling ratio of each sub-cluster is allocated according to the compactness index of the clustering result, and the smaller the compactness index, the larger the sampling ratio allocated to the sub-cluster. The number of samples generated by each minority class sub-cluster is determined by multiplying the difference between the number of minority class samples and the number of majority class samples by the sampling proportion. The means and covariances of the features in each cluster are calculated, and the mean vector (

μ

) and covariance matrix (

Σ

) are constructed. The standard Gaussian distribution samples are transformed into the new samples that conform to the distribution of the minority class samples by linear transformation. Finally, a balanced dataset is constructed to improve the classification accuracy. The Algorithm 2 works as follows:

Algorithm 2 Oversampling algorithm based on CSK-means algorithm and Gaussian distribution (CSKGO)

Input: Imbalanced dataset

X

Output: balanced dataset

X

X = X_{\min} + X_{m a j}

//Divide dataset into majority class and minority class
Initialize

k = 2

,

N_{maj}

,

N_{\min}

= number of majority class and minority class
while

k \leq \sqrt{N}

do

C S = k - m e a n s (X_{\min})

// Perform K-means clustering to get the clustering result

C o m s = c o m c s (C S)

// calculate the cluster compactness index by Equations (1) and (2)

S e p s = s e p c s (C S)

// calculate the cluster separateness index by Equations (3)–(5)

C V I N = C o m s / S e p s

// calculate the cluster validity index by Equation (6)

k = k + 1

end while

k = \arg \min (CVIN)

CR = CS (\arg \min (CVIN))

//Final clustering result

R a t i o s = S R A (C o m s (\arg \min (C V I N)))

//Assigns sampling ratios by the compactness index
for

i = 1

to

k

do

S a m p l e s = c l u s t e r_{i}

//Get the sample points in the cluster

X_{n e w} = G O S (S a m p l e s, R a t i o s)

//Gaussian distribution oversampling

X_{\min} = X_{\min} + X_{n e w}

end for

X = X_{\min} + X_{m a j}

return

X

4. Experiments

4.1. Dataset

To verify the effectiveness of the algorithm proposed in this paper, 24 commonly imbalanced datasets from the UCI Machine Learning Repository are used to conduct experiments. The categorical features of some datasets are transformed into continuous features through One-Hot Encoding. As shown in Table 1, each dataset has different imbalanced ratios (IR), numbers of attributes, numbers of samples, numbers of minority class samples, and noise and outlier ratios (NORs).

The IR measures the severity of class imbalance and is computed as

I R = \frac{N_{m a j}}{N_{m i n}}

(13)

where

N_{m a j}

is the number of majority class samples, and

N_{m i n}

is the number of minority class samples.

The LOF (Local Outlier Factor) algorithm is employed in this paper to calculate the noise and outlier point ratio (NOR). Based on the principle of density difference, the algorithm computes the local outlier factors of points. It first separates outliers via an LOF threshold, then identifies noise points from the remaining points by a local reachability density threshold, and finally, generates statistics for the proportions of both.

4.2. Evaluation Measures

For the classification problem of imbalanced datasets, the misclassification of the majority class and minority class often produces different costs. In many cases, the impact of minority class misclassification is more serious than that of majority class misclassification. If the model tends to predict all samples as the majority class, the model may still obtain a high overall accuracy even if the minority class is never correctly predicted, but this does not mean that the model is effective in dealing with imbalanced data. Therefore, the traditional evaluation metrics are no longer applicable for assessing the classification performance of imbalanced data

In order to evaluate the classification effect of imbalanced data more reasonably, some related scholars proposed evaluation methods such as the F-measure and G-mean by using the confusion matrix, as shown in Table 2.

Accuracy represents the proportion of samples that are correctly classified in the total sample, and the formula is as follows:

A c c u r a c y = \frac{TP + TN}{TP + TN + FP + FN}

(14)

Precision represents the proportion of true minority samples among the samples predicted to be in the minority class, and the formula is as follows:

P r e c i s i o n = \frac{TP}{TP + FP}

(15)

Recall represents the proportion of true minority class samples among the correctly classified samples, and the formula is as follows:

R e c a l l = \frac{TP}{TP + FN}

(16)

The F-measure, also known as the F1-Score, comprehensively considers the recall rate and precision rate of minority samples and reflects the classification situation of minority samples. Its formula is

F - m e a s u r e = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(17)

The G-mean takes into account the success of the model in classifying both minority and majority class samples, and its formula is as follows:

G - m e a n = \sqrt{\frac{TP}{TP + FN} \times \frac{TN}{TN + FP}}

(18)

The ROC curve is a curve with a false positive rate (FPR) as the horizontal axis and true positive rate (TPR) as the vertical axis, showing the relationship between the true positive rate (TPR) and false positive rate (FPR) at different thresholds. The better the model, the closer the ROC curve will be to the top-left corner. The true positive rate (TPR) and false positive rate (FPR) are calculated as follows:

T P R = \frac{TP}{TP + FN}

(19)

F P R = \frac{FP}{FP + TN}

(20)

AUC is the area under the ROC curve, which reflects the performance of the classification model. When the AUC has a higher value, the classification model achieves better results.

4.3. Experiments and Results

4.3.1. Validating the Effectiveness and Efficiency of the CSK-Means Clustering Module

Before applying the proposed CSKGO oversampling algorithm, it is crucial to ensure that the underlying clustering module—CSK-means—can reliably capture the intrinsic structure of minority samples. To this end, we conduct a series of experiments to (1) determine the optimal hyperparameter β, (2) validate the superiority of the proposed CVIN index in identifying the true cluster number, (3) analyze the computational efficiency of CSK-means, and (4) visually demonstrate its ability to reveal local distribution characteristics. These validations collectively justify the use of CSK-means as the foundation for subsequent Gaussian distribution oversampling.

(1): Threshold Selection of β

In the CSK-means algorithm, the inter-cluster separateness index needs to be calculated. The calculation of this index requires determining the number of non-same-type cluster samples within the radius

r

neighborhood, and the value of

r

is calculated by multiplying the distance between the centers of the two classes by

β

. Therefore, the selection of

β

also has an impact on the final clustering result. To determine the optimal value of

β

, a comprehensive evaluation was conducted on 24 publicly available datasets. These datasets vary in terms of class imbalance severity, feature dimensionality, and sample size. Cross-validation was employed in conjunction with a grid search over

β

\in \{0.5, 0.6, 0.7, 0.8, 0.9\}

to identify the optimal

β

value. For each combination of

β

and dataset, 5-fold stratified cross-validation was employed, resulting in a total of 24 datasets × 5

β

values × 5 folds = 600 experimental runs. The average performance of each key metric—accuracy, F-measure, G-Mean, and AUC—across all datasets is shown in Figure 1a. It can be observed that

β = 0.7

yields the highest average F-measure and G-mean, while also maintaining a high accuracy and AUC, indicating an effective balance between precision and recall.

To further assess the stability and generalization capability of

β = 0.7

—and to minimize the risk of overfitting as much as possible—the distribution of the F-measure for each

β

value across all 24 datasets is presented in Figure 1b. The box plot shows that

β = 0.7

not only achieves the highest median F-measure but also exhibits the narrowest interquartile range (IQR), indicating that its performance is consistent and has minimal fluctuations on diverse datasets. In contrast, other

β

values display wider boxes or lower medians, suggesting greater sensitivity to dataset variations and a less stable performance. Based on the above result, considering its high average performance, optimal trade-off among evaluation metrics, and excellent stability across all datasets,

β = 0.7

has been selected as the default threshold for the algorithm.

(2): Comparison of clustering results of each dataset before oversampling

To validate the effectiveness of the cluster validity index (CVIN) in the CSK-means algorithm, the following experiment is conducted. In this experiment, 24 datasets are selected with different imbalance ratios. As described in Section 3.1.2 of this paper, the minority classes of these datasets are clustered before oversampling them. The optimal number of clusters for the minority class is determined using the proposed clustering validity index (CVIN) and is compared against commonly clustering validity indices: the Dunn index [30] (DI), silhouette coefficient [31] (SC), Davies–Bouldin index [32] (DB), and Calinski–Harabasz index [33] (CH). For each of the above clustering indices, the optimal number of clusters is denoted as

k_{o p t}

, and the iterative search range is [2,

k_{m a x}

], where

k_{m a x} = int \sqrt{n}

. For all the datasets, the optimal number of clusters

k_{o p t}

obtained for each of the above clustering indices is shown in Table 3.

(3): Computational Complexity Analysis

To evaluate the practicality of the CSK-means algorithm, its computational complexity was systematically analyzed and compared with that of K-means variants guided by mainstream clustering validity indices. Additionally, running time measurements on real datasets were provided.

(a): Analysis of Computational Complexity

Assume the dataset contains

N

samples with feature dimension

d

. CSK-means searches for the optimal number of clusters within the range

k \in {2, \dots, \sqrt{N}}

, performing K-means clustering once for each

k

and computing the cluster validity index (CVIN). This index consists of two components: compactness and separateness.

Compactness index computational complexity:

After K-means converges, each sample is assigned to a cluster. Computing the distance from each sample to its cluster centroid requires

O (d)

operations, so the total cost is

O (N d)

. Then, the maximum value among the average compactness of

k

clusters is selected, requiring

k - 1

comparisons, which can be ignored. Therefore, the computational complexity of compactness is

O (N d)

.

Separateness index computational complexity:

To compute the single

s e p (C_{i,} C_{j})

, the process is divided into three steps: ① for any cluster

C_{i}

, first calculate its distance to the centroids of other clusters

C_{j}

, denoted as

d i s t (v_{i,} v_{j})

, with complexity

O (d)

. ② Set a radius

r_{i j}

, and count the number of samples in

C_{i}

whose distance to

v_{i}

is less than

r_{i j}

. The total number of computations is

O (N_{i})

, where

N_{i}

is the number of samples of cluster

C_{i}

. ③ Computing the separateness between one cluster

C_{i}

and all other

k - 1

clusters requires traversing

k - 1

times, resulting in a total complexity of

O (\sum_{j \neq i} n_{i} d) = O (k n_{i} d)

. The sum of separateness values across all clusters

k

and taking the minimum value results in a total complexity of

O (\sum_{i = 1}^{k} k n_{i} d) = O (k N d)

.

Total complexity of CSK-means:

The time complexity of computing CVIN is

O (N d + k N d) = O (k N d)

. The time complexity of the K-means algorithm is

O (k N T)

. To find the optimal number of clusters,

k

increases from 2 to

\sqrt{N}

, and the CSK-means computation complexity becomes

O ((2 N T + 2 N) + (3 N T + 3 N) + \dots + (k N T + k N) + \dots + (\sqrt{N} N T + \sqrt{N} N))

, also resulting in

O (N^{2} T)

.

Table 4 shows the comparison of the complexity of CSK-means (CVIN) with several methods guided by classic clustering effectiveness indices. The DI and SC indices require calculating the distance between all sample pairs, and the complexity of its traversal for

\sqrt{N}

times is as high as

O (N^{2} \sqrt{N})

. In contrast, the complexities of CSK-means, DB, and CH are all of the order

O (N^{2} T)

, which is the same as that of K-means and much lower than that of DI and SC.

(b): Actual Operation Time Measurement and Analysis

In order to verify the above theoretical analysis, the running times of each method were measured on five real datasets with increasing sizes. As shown in Table 5, the running time of all methods was found to increase as the sample size increased. On the first four datasets of this experiment, the running times of the CSK-means algorithm ranked second among all the compared algorithms, demonstrating a very high computational efficiency. It is worth noting that on the fifth dataset used in this experiment (the largest in this experiment, with a sample size > 5000), the CSK-means algorithm achieved the shortest running time compared to all other compared methods (CSK-means required 171.05 s, while the Dunn index (DI), which is based on pairwise distance computations, required 284.89 s and exhibited significantly higher computational costs).

This result preliminarily indicates that this algorithm demonstrates distinct advantages when dealing with large-scale data, and its time complexity characteristics may exhibit superior scalability when processing more samples.

(4): Visualizing the Effect of Clustering on Minority Samples in CSK-means

To visually demonstrate the ability of the CSK-means algorithm in capturing the structural features of minority class samples, we applied Principal Component Analysis (PCA) to reduce the dimensionality of the Haberman dataset, clustered the minority class samples after dimensionality reduction, and presented the results before and after clustering, as shown in Figure 2. It can be seen that the original minority class samples exhibit a distinct multi-peak distribution, with multiple local dense regions (Figure 2a). In contrast, CSK-means effectively identified these dense regions using the CVIN metric and classified them into clusters that are closely separated and have compact internal structures (Figure 2b). This result indicates that CSK-means can accurately capture the local distribution characteristics of minority class samples, providing reliable structural guidance for subsequent clustering-based oversampling and helping to generate synthetic samples that better match the real data distribution.

4.3.2. Comparison Analysis of Recall Before and After Oversampling

Specifically, the clusters obtained from the CSK-means clustering are used for (i) allocating the sampling rate; and (ii) generating synthetic samples using the Gaussian distribution. To verify the overall impact of this design and the effectiveness of the oversampling algorithm in improving the classification performance for imbalanced data, we compared the minority class recall rates of three classifiers before and after applying the CSKGO oversampling algorithm. In this experiment, three different classifiers were used: Random Forest, K-Nearest Neighbor (KNN), and SVM, to classify the datasets that were either unprocessed or processed by the oversampling algorithm proposed in this paper. Figure 3, Figure 4 and Figure 5, respectively, show the recall rates of the three classifiers for each dataset before and after oversampling; Table 6 presents the average recall rates (recall) of the three classifiers for all datasets before and after oversampling.

As shown in Table 6, oversampling yields a clear and marked improvement in the average recall of minority classes across all three classifiers. Specifically, the average recall rises from 62.45% to 71.92% for RF, from 57.02% to 76.31% for KNN, and from 60.91% to 73.23% for SVM. These results demonstrate that oversampling effectively mitigates class imbalance by supplying more representative training samples for minority classes, thereby improving the models’ ability to correctly identify minority class samples. Figure 3, Figure 4 and Figure 5 further show that recall rates for most minority categories improve substantially after oversampling, with the most pronounced gains observed for KNN. Although minor fluctuations occur in a few cases, the overall trend clearly validates the effectiveness of oversampling in class imbalance.

4.3.3. Comparison of Classification Results of Each Dataset After Oversampling

(1): Visualization comparison of oversampling results of CSKGO algorithm and other algorithms on low-dimensional datasets

In order to visually display the data distribution effect of the CSKGO algorithm and other algorithms on the minority samples, the PCA method is used to reduce the dimensionality of the high-dimensional datasets to two-dimensional datasets and visualize them.

As examples of visualizing oversampling results, the Haberman dataset is selected in this experiment, the PCA method is used to reduce the dimensionality of the high-dimensional datasets to two-dimensional datasets, and the eight oversampling algorithms—SMOTE [9], Random-SMOTE (RSMOTE) [10], Borderline-SMOTE (BSMOTE) [11], Adaptive Synthetic Sampling (ASY) [12], Gaussian Distribution Oversampling (GOS) [29], Weighted Kernel SMOTE (WKS) [34], Fuzzy C-Means SMOTE (FCMS) [35], and the CSKGO proposed—are used to oversample the above datasets.

Figure 6 show the data distribution after oversampling for the above datasets, and the eight subfigures show, respectively, the visualization results after oversampling the minority class samples by the eight algorithms for the same dataset. In Figure 6, the blue sample points are the initial minority class samples, and the yellow sample points are the newly generated minority class samples through the oversampling algorithms.

As shown in Figure 6, the distribution of new sample points generated by the CSKGO algorithm is closer to the core area of the original minority class in the feature space, while some new sample points generated by other algorithms deviate from the original minority class, which may introduce noise. This issue is particularly evident in the GOS algorithm because when using GOS, if there are noise points or outliers, the new samples generated by using the mean and variance of the samples will also contain some unnecessary noise points. However, the CSKGO algorithm can effectively identify and eliminate noise points through clustering. By dividing the samples into different small clusters, the local characteristics of the minority samples can be more effectively obtained so that the generated samples are more in line with the distribution of the initial minority sample points. The new sample points cover a wide range but do not diverge excessively, which balances the diversity and security of samples. Therefore, it shows that the sample distribution generated by the proposed CSKGO algorithm is more reasonable.

(2): Comparison of classification results of CSKGO algorithm and other algorithms on high-dimensional datasets after oversampling

Next, the experiments on high-dimensional datasets will be conducted. To further validate the performance of the proposed CSKGO algorithm, CSKGO is compared with the SMOTE algorithm, RSMOTE algorithm, BSMOTE algorithm, ASY algorithm, GOS algorithm, WKS algorithm, FCMS algorithm, MOSIG algorithm, MLONC algorithm, and MLOS algorithm. Experiments are conducted on 24 datasets. The performance of the algorithms is validated using four evaluation metrics: Accuracy, F-measure, G-mean, and AUC. In the experiments, Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) are used as classification models. The datasets are divided according to 80% training set and 20% testing set for 5-fold cross-validation. For each dataset and each algorithm, the whole process is repeated five times, and the average result and standard deviation are calculated. In this experiment, the testing data are the original data that have not been sampled.

Table 7, Table 8 and Table 9 summarize the average performance of CSKGO and other oversampling algorithms on RF, KNN, and SVM classifiers. The covered metrics include accuracy, F-measure, G-mean, and AUC. Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18 present the heatmap visualization results of each evaluation metric across all 24 datasets. Specifically, Figure 7, Figure 8, Figure 9 and Figure 10 correspond to the RF classifier, Figure 11, Figure 12, Figure 13 and Figure 14 correspond to the KNN classifier, and Figure 15, Figure 16, Figure 17 and Figure 18 correspond to the SVM. In each heatmap, the rows represent each dataset, and the columns represent different oversampling algorithms (including CSKGO), with the color intensity reflecting the values of the metrics (such as G-mean)—the darker the color, the better the performance. This visualization method helps to intuitively compare the performance differences in different algorithms on various imbalanced datasets.

Table 7 and Figure 7, Figure 8, Figure 9 and Figure 10 show the comprehensive performance comparison on the RF classifier, which indicates that CSKGO achieves the best results in all four key indicators: accuracy (93.75%), F-measure (72.54%), G-mean (72.54%), and AUC (93.70%). Particularly in the F-measure and G-mean, CSKGO significantly outperforms the second place (SMOTE), demonstrating its greater advantage in balancing the performance of the majority and minority classes. Additionally, its lower standard deviation (such as AUC ± 1.71) reflects good stability across different datasets.

Table 8 and Figure 11, Figure 12, Figure 13 and Figure 14 show the comprehensive performance comparison on the KNN classifier; although the accuracy of CSKGO (90.19%) is slightly lower than that of the original KNN model, its F-measure (66.11%) and G-mean (67.53%) are the highest among all methods, and the AUC (88.83%) is also significantly superior. This indicates that CSKGO successfully shifted the optimization goal from overall accuracy to sensitivity to the minority class, which better aligns with the core requirements of imbalanced learning.

For SVMs that are particularly sensitive to the issue of imbalance (see Table 9 and Figure 15, Figure 16, Figure 17 and Figure 18), although CSKGO did not come out on top in terms of accuracy (MLOS had the highest score), it still ranked highly in robustness indicators such as AUC (84.54%) and G-mean (61.48%). Overall, its performance was balanced and there were no significant weaknesses.

In conclusion, CSKGO demonstrates an excellent generalization ability and stability across various classifiers and diverse datasets. Particularly, it consistently leads in key metrics for evaluating an imbalanced classification performance (F-measure, G-mean, AUC), fully demonstrating its effectiveness in generating high-quality synthetic samples and improving decision boundaries. Moreover, even on datasets with high imbalance rates (such as kdd and kr IR, exceeding 50), it still maintains a high classification performance.

From the experimental results, it can be seen that for certain datasets, the performance of all algorithms is unsatisfactory: not only is the accuracy lower than 0.5 but also the G-mean and F-measure are significantly low. This indicates that these datasets have inherent “congenital” flaws in classifying categories. Such datasets (like wine, pok9, yea4) may have the characteristic of highly overlapping decision boundaries between categories. Moreover, when there are datasets (such as flare, zoo) that contain both numerical and categorical features, if standard distance metrics (such as the Euclidean distance) are directly used, it is often difficult to reasonably describe the semantic similarity between samples. This is because traditional distance functions do not take into account the inherent scale and semantic differences in different feature types. Even if the categorical variables are one-hot encoded, the dimension expansion, sparsity, and scale mismatch between numerical and binary features may lead to distortion in distance calculations, thereby weakening the discriminative ability of distance-based classifiers. For these reasons, these datasets perform poorly when using the 11 oversampling algorithms mentioned in this paper.

4.3.4. Non-Parametric Statistical Test

To verify that the performance differences between the proposed algorithm and other algorithms are not due to chance and are statistically significant, non-parametric statistical tests are performed. The Wilcoxon signed-rank test [36] is a non-parametric statistical test that is mainly used to compare the difference between two groups of paired samples. The test statistic is determined by calculating the sum of the positive and negative ranks, and the significance is determined according to the test statistic (one-tailed hypothesis with significance level α = 0.05). In this paper, the Wilcoxon signed-rank test is used to complete the non-parametric statistical test.

Table 10, Table 11 and Table 12 show approximate p-values for pairwise comparisons of the RF classifier’s classification results obtained by different oversampling algorithms. The symbol S+/S− indicates that the algorithm in the row performs significantly superior/inferior compared to the algorithm in the column, and similarly, the symbol NS+/NS− indicates that the algorithm in the row does not significantly improve/degrade the algorithm in the column.

Table 13, Table 14 and Table 15 show approximate p-values for pairwise comparisons of the KNN classifier’s classification results obtained by different oversampling algorithms. Similarly, Table 16, Table 17 and Table 18 show approximate p-values for pairwise comparisons of the SVM classifier’s classification results obtained by different oversampling algorithms. Through the Wilcoxon signed-rank test, it is evident that the performance of CSKGO surpasses that of other oversampling algorithms on RF and KNN classifiers. This is particularly notable in terms of the AUC value, which indicates that the CSKGO algorithm has a better ability to sort positive and negative samples for the imbalanced datasets.

4.3.5. Ablation Experiment

In order to validate the contribution of each part of the proposed CSKGO algorithm more effectively, an ablation experiment is conducted in this paper. The effectiveness of the K-means clustering algorithm module, the CSK-means clustering algorithm module, and the sampling ratio assignment (SRA) module on the final results is verified through the ablation experiment. The initial Gaussian distribution oversampling algorithm GOS is first used as the baseline, then K-means clustering is added to the baseline as Ablation-I, to which the CSK-means algorithm is added as Ablation-II, and then the Sampling Ratio Assignment (SRA) module is added as Ablation-III for the comparison of results. The ablation experiment is performed on 24 datasets on the KNN classifier using the F-measure and G-mean as evaluation metrics. The results of the experiments are as follows.

As can be seen in Table 19, Table 20 and Table 21, the F-measure and G-mean of Ablation-I are on average 0.85% and 0.37% higher than the F-measure and G-mean of the GOS algorithm, respectively. Meanwhile, the F-measure and G-mean of Ablation-II are on average 0.54% and 0.29% higher than the F-measure and G-mean of Ablation-I, respectively. Finally, the F-measure and G-mean of Ablation-III are on average 1.73% and 1.58% higher than the F-measure and G-mean of Ablation-II, respectively.

It can be seen from the results that Algorithm 1 performs K-means clustering for minority class samples with GOS, and if the clustering results are not satisfactory, it will be difficult to obtain the local distribution characteristics of the minority class, resulting in final samples generated that do not embody the distribution characteristics of the real data, which ultimately leads to a decreased classification performance. In contrast, Ablation-II obtains the optimal number of clusters and clustering results by adding the clustering validity index based on compactness and separateness, and the final classification effect is greatly improved. This indicates that the addition of the CSK-means algorithm plays an important role in improving the performance of the GOS algorithm. For Ablation-III, it generates more representative samples by adding the compactness-based sampling ratio allocation algorithm on the basis of Ablation-II, and the final classification effect is also highly improved. This shows that the compactness-based sampling ratio allocation algorithm also has a certain contribution to improving the performance of the GOS algorithm.

5. Conclusions

To solve the problem of imbalanced data classification, the CSKGO algorithm is proposed in this paper, which iteratively finds the optimal number of clusters of the minority class and performs clustering. The local distribution characteristics of the samples are obtained by clustering, and the GOS algorithm is used to generate new sample points that conform to the distribution of the original minority class samples. The CSK-means algorithm is proposed by combining the compactness within clusters and the separateness between clusters in the process of finding the optimal number of clusters, and the optimal number of clusters is obtained by iteratively calculating the clustering validity. Finally, the clustered sample points are used to generate new sample points by the GOS algorithm. Experimental results on multiple benchmark datasets demonstrate that CSKGO generally achieves a superior classification performance compared to existing oversampling methods. The experimental results show that the CSKGO algorithm has an improved classification effect for imbalanced datasets.

However, the algorithm exhibits limitations on certain datasets—such as fla, har, pok9, win, yea4, and zoo—where the F-measure and G-mean fall below 0.5. This degradation is likely attributed to extreme class imbalance, high dimensionality with limited samples, intrinsic class overlap, or the presence of categorical features, all of which hinder reliable clustering and meaningful sample generation. Additionally, the current implementation involves repeated distance computations when evaluating inter-cluster separateness, leading to an increased computational overhead.

Future research will be conducted in the following three aspects: (1) To significantly enhance the computational efficiency and scalability of the algorithm in high-dimensional and large-scale multi-dataset scenarios, we propose systematically integrating approximate strategies based on sampling and efficient parallel computing frameworks: On the one hand, we abandon the exponential computational cost caused by traditional exhaustive neighborhood evaluation, and instead adopt approximate strategies such as random subsampling, stratified sampling, importance sampling, or local sensitive hashing (LSH), which significantly improve the scalability of the algorithm under a controlled accuracy loss. On the other hand, considering the characteristic that inter-cluster separability metrics usually rely on global or semi-global statistics, we design a multi-level parallel architecture—including distributed computing based on data partitioning, using GPU/TPU to accelerate core matrix and distance operations, and allocating each configuration to independent worker processes when traversing multiple clustering configurations (such as different k values or initialization schemes), to achieve efficient parallel exploration of the hyperparameter space. (2) Expand the processing capabilities of the CSKGO algorithm for mixed-type or pure categorical data by introducing distance measurement methods or encoding strategies suitable for non-numeric features, thereby enhancing its applicability in complex data. (3) Integrate complementary technologies such as cost-sensitive learning and ensemble learning to construct a more robust imbalance classification framework, to address challenging scenarios such as extreme imbalance, high-dimensional sparsity, or highly overlapping categories.

Author Contributions

Conceptualization, W.X. and X.H.; methodology, W.X.; software, X.H.; validation, W.X. and X.H.; formal analysis, W.X.; investigation, X.H.; resources, W.X.; data curation, X.H.; writing—original draft preparation, W.X.; writing—review and editing, W.X. and X.H.; visualization, X.H.; supervision, W.X.; project administration, W.X.; funding acquisition, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shaanxi Provincial Natural Science Foundation Project grant number 2023-JC-YB-597. The APC was funded by the same project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ (accessed on 12 December 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, J.; Chen, L.; Tian, J.-X.; Abid, F.; Yang, W.; Tang, X.-F. Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm. Int. J. Control. Autom. Syst. 2021, 19, 1998–2008. [Google Scholar] [CrossRef]
Sulaiman, S.; Ibraheem, I.; Hameed, S. Credit Card Fraud Detection Using Improved Deep Learning Models. Comput. Mater. Contin. 2024, 78, 1049–1069. [Google Scholar] [CrossRef]
Jian, C.; Ao, Y.H. Imbalanced fault diagnosis based on semi-supervised ensemble learning. J. Intell. Manuf. 2022, 34, 3143–3158. [Google Scholar] [CrossRef]
Wang, X.L.; Jin, Y.C.; Liu, W.W.; Wang, X.Y. KPCA-Based Under-Sampling Algorithm for Imbalanced Data. Adv. Appl. Math. 2024, 13, 4108–4118. [Google Scholar] [CrossRef]
Zheng, H.Y. A New Cost-sensitive SVM Algorithm for Imbalanced Dataset. In Proceedings of the 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 15–17 January 2021; pp. 402–407. [Google Scholar] [CrossRef]
Boonchuay, K.; Sinapiromsaran, K.; Lursinsap, C. Decision tree induction based on minority entropy for the class imbalance problem. Pattern Anal. Appl. 2016, 20, 769–782. [Google Scholar] [CrossRef]
Zhang, X.; He, Z.Q.; Yang, Y.Y. A fuzzy rough set-based undersampling approach for imbalanced data. Int. J. Mach. Learn. Cybern. 2024, 15, 2799–2810. [Google Scholar] [CrossRef]
Bhattacharya, R.; De, R.; Chakraborty, A.; Sarkar, R. Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach. SN Comput. Sci. 2024, 5, 386. [Google Scholar] [CrossRef]
Chawla, N.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Dong, Y.J. The Study on Random-SMOTE for the Classification of Imbalanced Data Sets. MA Thesis, Dalian University of Technology, Dalian, China, 2009. [Google Scholar]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Adv. Intell. Comput. 2005, 3644, 878–887. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
López, V.; Fernández, A.; Moreno-Torres, J.G.; Herrera, F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 2012, 39, 6585–6608. [Google Scholar] [CrossRef]
Sun, Y.; Pahlavan, H.A.; Chattopadhyay, A.; Hassanzadeh, P.; Lubis, S.W.; Alexander, M.J.; Gerber, E.P.; Sheshadri, A.; Guan, Y. Data Imbalance, Uncertainty Quantification, and Transfer Learning in Data-Driven Parameterizations: Lessons from the Emulation of Gravity Wave Momentum Transport in WACCM. J. Adv. Model. Earth Syst. 2024, 16, e2023MS004145. [Google Scholar] [CrossRef]
Wang, W.; Liu, F. ADDPC-SMOTE: An Oversampling Algorithm Based on Density Difference Peak Clustering and Spatial Distribution Entropy. IEEE Access 2023, 11, 108152–108166. [Google Scholar] [CrossRef]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. DBSMOTE: Density-based synthetic minority over-sampling technique. Appl. Intell.—APIN 2011, 36, 664–684. [Google Scholar] [CrossRef]
Yan, Y.; Zheng, L.; Han, S.; Yu, C.; Zhou, P. Synthetic oversampling with Mahalanobis distance and local information for highly imbalanced class-overlapped data. Expert Syst. Appl. 2025, 260, 125422. [Google Scholar] [CrossRef]
Tang, Y.; Zhou, Y.; Yang, C.; Du, Y.; Yang, M. Instance gravity oversampling method for software defect prediction. Inf. Softw. Technol. 2025, 179, 107657. [Google Scholar] [CrossRef]
Yang, X.; Xue, Z.; Zhang, L.; Wu, J. An oversampling algorithm for high-dimensional imbalanced learning with class overlapping. Knowl. Inf. Syst. 2024, 67, 1915–1943. [Google Scholar] [CrossRef]
Liu, B.; Zhou, A.; Wei, B.; Wang, J.; Tsoumakas, G. Oversampling multi-label data based on natural neighbor and label correlation. Expert Syst. Appl. 2025, 259, 125257. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, L.; Wei, B. Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation. Mathematics 2024, 12, 1709. [Google Scholar] [CrossRef]
Lv, Z.Z.; Liu, Q.C. Imbalanced Data Over-Sampling Method Based on ISODATA Clustering. IEICE Trans. Inf. Syst. 2023, E106.D, 1528–1536. [Google Scholar] [CrossRef]
Zhao, X.Y.; Guan, S.; Xue, Y.; Pan, H. HS-CGK: A Hybrid Sampling Method for Imbalance Data Based on Conditional Tabular Generative Adversarial Network and K-Nearest Neighbor Algorithm. Comput. Inform. 2024, 43, 213–239. [Google Scholar] [CrossRef]
Qin, Q.; Yang, Y.; Chen, M.; Wang, X. Improved SMOTE for Oversampling. J. Guilin Univ. Electron. Technol. 2022, 42, 53–59. [Google Scholar] [CrossRef]
Chen, J.F.; Zheng, Z.T. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE. Comput. Eng. Appl. 2021, 57, 106–112. [Google Scholar] [CrossRef]
Dong, H.C.; Wen, Z.; Wan, Y.; Yan, F. An imbalanced data classification algorithm based on DPC clustering resampling combined with ELM. Comput. Eng. Sci. 2021, 43, 1856–1863. [Google Scholar]
Steinley, D. K-Means Clustering: A Half-Century Synthesis. Br. J. Math. Stat. Psychol. 2006, 59, 1–34. [Google Scholar] [CrossRef]
Yu, H.; Mao, C.K. Automatic Three-way Decision Clustering Approach Based on K-means. J. Comput. Appl. 2016, 36, 2061–2065+2091. [Google Scholar]
Hassan, M.M.; Eesa, A.S.; Mohammed, A.J.; Arabo, W.K. Oversampling Method Based on Gaussian Distribution and K-Means Clustering. Comput. Mater. Contin. 2021, 69, 451–469. [Google Scholar] [CrossRef]
Bezdek, J.C.; Pal, N.R. Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 1998, 28, 301–315. [Google Scholar] [CrossRef] [PubMed]
Rousseeuw, P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Cali’nski, T.; Harabasz, J.A. A Dendrite Method for Cluster Analysis. Commun. Stat.—Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Choudhary, R.; Shukla, S. SMOTE Based Weighted Kernel Extreme Learning Machine for Imbalanced Classification Problems. In Internet of Things and Connected Technologies; Springer International Publishing: Cham, Switzerland, 2021; pp. 193–200. [Google Scholar]
Zhou, H.; Tong, J.; Liu, Y.; Zheng, K.; Cao, C. An oversampling FCM-KSMOTE algorithm for imbalanced data classification. J. King Saud Univ.—Comput. Inf. Sci. 2024, 36, 102248. [Google Scholar] [CrossRef]
Dutta, D.; Sil, J.; Dutta, P. A bi-phased multi-objective genetic algorithm based classifier. Expert Syst. Appl. 2020, 146, 113163. [Google Scholar] [CrossRef]

Figure 1. Comparison of performance when β takes different values on the RF classifier. (a) The average performance of each evaluation measure under different β values; (b) the distribution of the F-measure corresponding to different β values across all datasets.

Figure 2. Visualization results of the distribution of minority class samples before and after clustering. (a) Before clustering; (b) after clustering.

Figure 3. Comparison of the recall before and after oversampling on the RF classifier.

Figure 4. Comparison of the recall before and after oversampling on the KNN classifier.

Figure 5. Comparison of the recall before and after oversampling on the SVM classifier.

Figure 6. Visualization results of all algorithms for the dimensionality reduced Haberman dataset after oversampling.

Figure 7. The accuracy of various oversampling algorithms (including CSKGO) was compared using RF classifier on 24 datasets.

Figure 8. The F-measure of various oversampling algorithms (including CSKGO) was compared using RF classifier on 24 datasets.

Figure 9. The G-mean of various oversampling algorithms (including CSKGO) was compared using RF classifier on 24 datasets.

Figure 10. The AUC of various oversampling algorithms (including CSKGO) was compared using RF classifier on 24 datasets.

Figure 11. The accuracy of various oversampling algorithms (including CSKGO) was compared using KNN classifier on 24 datasets.

Figure 12. The F-measure of various oversampling algorithms (including CSKGO) was compared using KNN classifier on 24 datasets.

Figure 13. The G-mean of various oversampling algorithms (including CSKGO) was compared using KNN classifier on 24 datasets.

Figure 14. The AUC of various oversampling algorithms (including CSKGO) was compared using KNN classifier on 24 datasets.

Figure 15. The accuracy of various oversampling algorithms (including CSKGO) was compared using SVM classifier on 24 datasets.

Figure 16. The F-measure of various oversampling algorithms (including CSKGO) was compared using SVM classifier on 24 datasets.

Figure 17. The G-mean of various oversampling algorithms (including CSKGO) was compared using SVM classifier on 24 datasets.

Figure 18. The AUC of various oversampling algorithms (including CSKGO) was compared using SVM classifier on 24 datasets.

Table 1. Datasets.

ID	Dataset	Attributes (R/I/N)	Examples	Minority	IR	NOR
fla	Flare-F	11 (0/0/11)	1066	43	23.79	19.23
eco2	ecoli2	7 (7/0/0)	336	52	5.46	6.85
eco1	ecoli1	7 (7/0/0)	336	77	3.36	6.85
car	car-good	6 (0/0/6)	1728	69	24.04	0
gla0	glass0	9 (0/9/0)	214	70	2.06	21.96
har	harberman	3 (0/3/0)	306	81	2.78	1.96
ion	ionosphere	34 (34/0/0)	351	126	1.79	25.36
led	led7digit-0-2-4-5-6-7-8-9_vs_1	7 (7/0/0)	443	37	10.97	19.64
new	new-thyroid2	5 (4/1/0)	215	35	5.14	26.51
pag0	page-block0	10 (4/6/0)	5472	559	8.79	4.84
pim	pima	8 (8/0/0)	768	268	1.87	4.17
pok9	poker-9_vs_7	10 (0/10/0)	244	8	29.5	0
seg0	segment0	19 (19/0/0)	2308	329	6.02	5.11
veh	vehicle2	18 (0/18/0)	846	218	2.88	1.42
vow0	vowel0	13 (10/3/0)	988	90	9.98	0
win	winequality-red-4	11 (11/0/0)	1599	53	29.17	3.75
wis	wisconsin	9 (0/9/0)	683	239	1.86	16.54
yea1	yeast1	8 (8/0/0)	1484	429	2.46	5.46
yea3	yeast3	8 (8/0/0)	1484	163	8.1	5.46
yea4	yeast4	8 (8/0/0)	1484	51	28.1	5.46
zoo	zoo-3	16 (0/0/16)	101	5	19.2	0
kdd	kddcup-rootkit-imap_vs_back	41 (26/0/15)	2225	22	100.14	20.9
kr	kr-vs-k-zero_vs_fifteen	6 (0/0/6)	2193	27	80.22	1.23
pok8	poker-8-9_vs_6	10 (0/10/0)	1485	25	58.4	0

Table 2. Confusion matrix.

	Predict Positive	Predict Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

Table 3. The optimal number of clusters obtained by each clustering index for the minority classes in the datasets.

ID	DI	SC	CH	DB	CVIN
fla	5	4	4	3	4
eco2	3	2	2	6	6
eco1	3	3	3	7	6
car	7	4	4	4	7
gla0	2	2	2	2	3
har	5	3	3	3	7
ion	10	2	2	9	2
led	2	2	3	3	2
new	5	2	4	2	4
pag0	2	2	20	3	5
pim	12	3	3	4	2
pok9	2	2	2	2	2
seg0	13	6	11	7	15
veh	9	2	3	2	13
vow0	2	8	7	8	7
win	3	3	4	5	4
wis	5	2	2	13	13
yea1	6	2	3	16	18
yea3	5	3	2	11	9
yea4	6	4	3	4	3
zoo	2	2	2	2	2
kdd	2	2	4	4	2
kr	4	2	4	4	4
pok8	2	2	2	2	4

Table 4. Comparison of computational complexity of different clustering validity indices.

Algorithms	Single-k Calculation Complexity	Total Complexity of the Complete Algorithm (Traversal of $\sqrt{N}$ )
(CVIN)CSK-means	$O (k N T)$	$O (N^{2} T)$
(DI)K-means	$O (N^{2})$	$O (N^{2} \sqrt{N})$
(SC)K-means	$O (N^{2})$	$O (N^{2} \sqrt{N})$
(CH)K-means	$O (k N T)$	$O (N^{2} T)$
(DB)K-means	$O (k N T)$	$O (N^{2} T)$

Table 5. Execution time comparison of cluster validity indices–guided K-means methods across datasets of varying sizes.

Dataset	Number	(DI)K-Means	(SC)K-Means	(CH)K-Means	(DB)K-Means	CSK-Means
new	215	0.0621	0.1108	0.0387	0.0682	0.0519
wis	683	0.7653	0.4219	0.247	0.3087	0.29
vow0	988	2.4929	0.9647	0.5788	0.6638	0.6322
seg	2308	29.5484	9.7842	6.2487	7.6397	6.3592
pag0	5472	284.8947	188.7142	171.197	176.0794	171.0506

Table 6. Comparison of the average recall of minority classes before and after oversampling for all datasets in different classifiers.

ID	Before Oversampling	After Oversampling
RF	62.45 (±9.37)	71.92 (±10.06)
KNN	57.02 (±8.88)	76.31 (±10.53)
SVM	60.91 (±13.61)	73.23 (±11.10)

Table 7. Comparison of the average performance of CSKGO and oversampling algorithms on RF classifier.

ID	Accuracy		F1		G-Mean		AUC
RF	93.49	(±1.41)	64.96	(±8.44)	65.65	(±8.66)	92.78	(±2.65)
SMOTE	93.07	(±1.45)	71.71	(±7.72)	72.09	(±7.74)	92.43	(±2.95)
RSMOTE	93.27	(±1.68)	70.42	(±9.27)	70.88	(±9.07)	92.83	(±2.26)
BSMOTE	93.07	(±1.59)	68.9	(±8.65)	69.34	(±8.86)	92.28	(±2.96)
ASY	92.58	(±1.51)	70.92	(±7.48)	71.33	(±7.52)	91.95	(±3.09)
GOS	93.12	(±1.70)	70.35	(±8.74)	70.86	(±8.64)	93.30	(±2.48)
WKS	93.39	(±1.63)	70.53	(±8.53)	70.91	(±7.52)	92.94	(±2.13)
FCMS	93.55	(±1.47)	69.11	(±8.62)	69.8	(±7.52)	92.99	(±2.31)
MOSIG	93.85	(±1.49)	66.88	(±9.42)	67.95	(±9.81)	93.06	(±2.11)
MLONC	93.44	(±1.54)	70.09	(±9.46)	70.64	(±9.24)	93.11	(±2.46)
MLOS	93.65	(±1.31)	64.82	(±8.40)	65.58	(±8.59)	92.93	(±2.54)
CSKGO	93.75	(±1.43)	68.36	(±8.53)	72.54	(±8.44)	93.70	(±1.71)

Table 8. Comparison of the average performance of CSKGO and oversampling algorithms on KNN classifier.

ID	Accuracy		F1		G-Mean		AUC
KNN	92.03	(±1.73)	59.99	(±8.47)	60.77	(±8.39)	86.15	(±5.15)
SMOTE	88.91	(±2.07)	62.91	(±7.14)	64.43	(±7.38)	86.32	(±4.85)
RSMOTE	88.38	(±2.05)	63.94	(±7.27)	65.91	(±7.65)	87.87	(±5.03)
BSMOTE	89.60	(±2.18)	62.90	(±8.27)	64.05	(±8.40)	85.57	(±4.90)
ASY	88.73	(±2.09)	63.00	(±7.31)	64.51	(±7.53)	86.69	(±4.82)
GOS	89.35	(±2.01)	64.34	(±7.02)	65.93	(±7.10)	88.55	(±3.98)
WKS	89.21	(±1.75)	63.69	(±6.37)	65.51	(±6.59)	87.72	(±4.57)
FCMS	90.21	(±1.78)	64.32	(±7.66)	64.80	(±7.80)	85.22	(±5.61)
MOSIG	91.19	(±1.67)	61.63	(±7.24)	61.94	(±7.36)	86.38	(±5.06)
MLONC	90.70	(±1.68)	65.50	(±7.72)	66.37	(±7.91)	86.75	(±5.10)
MLOS	92.08	(±1.65)	58.50	(±8.31)	59.60	(±8.18)	86.18	(±5.09)
CSKGO	90.19	(±1.81)	66.11	(±6.31)	67.53	(±6.42)	88.83	(±3.12)

Table 9. Comparison of the average performance of CSKGO and oversampling algorithms on SVM classifier.

ID	Accuracy		F1		G-Mean		AUC
SVM	82.01	(±6.67)	49.49	(±9.40)	51.12	(±9.59)	77.19	(±8.78)
SMOTE	86.85	(±4.60)	59.45	(±5.65)	61.97	(±6.01)	83.34	(±6.01)
RSMOTE	85.03	(±3.85)	58.28	(±5.67)	61.27	(±5.93)	83.96	(±5.45)
BSMOTE	86.58	(±4.26)	59.76	(±6.44)	62.22	(±6.67)	83.02	(±6.38)
ASY	85.79	(±5.03)	58.57	(±5.94)	61.52	(±6.35)	83.91	(±6.11)
GOS	83.49	(±4.48)	56.18	(±5.22)	59.33	(±5.69)	81.62	(±5.24)
WKS	84.71	(±3.56)	57.00	(±6.33)	60.01	(±5.82)	83.78	(±6.08)
FCMS	90.08	(±3.03)	60.54	(±5.41)	61.72	(±7.91)	82.40	(±6.74)
MOSIG	86.29	(±4.33)	57.30	(±5.92)	59.39	(±6.41)	84.85	(±5.42)
MLONC	90.66	(±1.68)	57.87	(±5.12)	58.66	(±5.14)	84.26	(±5.66)
MLOS	91.92	(±1.18)	51.36	(±5.15)	51.25	(±5.17)	82.60	(±5.53)
CSKGO	88.28	(±4.11)	59.35	(±6.39)	61.48	(±6.70)	84.54	(±6.21)

Table 10. Pairwise comparison of the F-measure of oversampling algorithm by means of a Wilcoxon signed-rank test on RF classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.1379	0.2776	0.0035	0.0764	0.1446	0.1131	0.1003	0.0722	0.0026	0.1094
RSMOTE	NS−		0.3409	0.3483	0.3409	0.4247	0.4801	0.4091	0.3192	0.0392	0.0025
BSMOTE	NS−	NS+		0.0681	0.2514	0.4091	0.3745	0.1020	0.3336	0.0048	0.0427
ASY	S−	NS−	NS−		0.3372	0.4721	0.4841	0.2266	0.3446	0.0166	0.0294
GOS	NS−	NS−	NS−	NS−		0.3192	0.4286	0.4364	0.4364	0.0262	0.0212
WKS	NS−	NS−	NS−	NS+	NS+		0.4364	0.1685	0.4052	0.0139	0.0202
FCMS	NS−	NS+	NS−	NS−	NS+	NS−		0.1867	0.3783	0.0294	0.1170
MOSIG	NS−	NS−	NS−	NS−	NS−	NS−	NS−		0.2483	0.0183	0.0268
MLONC	NS−	NS−	NS−	NS−	NS+	NS−	NS−	NS+		0.0026	0.0143
MLOS	S−	S−	S−	S−	S−	S−	S−	S−	S−		0.00307
CSKGO	NS+	S+	S+	S+	S+	S+	NS+	S+	S+	S+

Table 11. Pairwise comparison of the G-mean of oversampling algorithm by means of a Wilcoxon signed-rank test on RF classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.1292	0.3483	0.0042	0.0571	0.1611	0.1660	0.119	0.07636	0.0028	0.0301
RSMOTE	NS−		0.3557	0.3192	0.2946	0.3897	0.4801	0.46414	0.32997	0.04272	0.0002
BSMOTE	NS−	NS+		0.0778	0.2514	0.4247	0.3745	0.12924	0.30153	0.0048	0.0076
ASY	S−	NS−	NS−		0.2843	0.4920	0.4681	0.25143	0.3707	0.01539	0.0076
GOS	NS−	NS−	NS−	NS−		0.3409	0.3336	0.40905	0.43644	0.07493	0.0054
WKS	NS−	NS−	NS−	NS+	NS+		0.4364	0.2177	0.36393	0.02938	0.0026
FCMS	NS−	NS+	NS−	NS−	NS+	NS−		0.20327	0.44038	0.02938	0.0918
MOSIG	NS−	NS−	NS−	NS−	NS+	NS−	NS−		0.32636	0.01659	0.00639
MLONC	NS−	NS−	NS−	NS−	NS+	NS−	NS−	NS+		0.00391	0.00272
MLOS	S−	S−	S−	S−	NS−	S−	S−	S−	S−		0.00045
CSKGO	S+	S+	S+	S+	S+	S+	NS+	S+	S+	S+

Table 12. Pairwise comparison of the AUC of oversampling algorithm by means of a Wilcoxon signed-rank test on RF classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.2946	0.0681	0.0015	0.2946	0.4404	0.2389	0.40517	0.19766	0.48006	0.0032
RSMOTE	NS−		0.1379	0.0571	0.0233	0.3707	0.3821	0.22065	0.38209	0.46414	0.0005
BSMOTE	NS−	NS−		0.2946	0.0250	0.0951	0.1446	0.07215	0.2327	0.22663	0.0015
ASY	S−	NS−	NS−		0.0113	0.0571	0.0495	0.02118	0.15625	0.17361	0.0006
GOS	NS+	S+	S+	S+		0.0793	0.3015	0.22363	0.26109	0.25463	0.0104
WKS	NS+	NS+	NS+	NS+	NS−		0.3821	0.26763	0.34458	0.2946	0.0010
FCMS	NS+	NS+	NS+	S+	NS−	NS−		0.39743	0.44828	0.37828	0.0099
MOSIG	NS+	NS+	NS+	S+	NS−	NS−	NS−		0.09012	0.21186	0.00402
MLONC	NS−	NS+	NS+	NS+	NS−	NS−	NS−	NS−		0.3409	0.00695
MLOS	NS−	NS+	NS+	NS+	NS−	NS−	NS−	NS−	NS−		0.00695
CSKGO	S+	S+	S+	S+	S+	S+	S+	S+	S+	S+

Table 13. Pairwise comparison of the F-measure of oversampling algorithm by means of a Wilcoxon signed-rank test on KNN classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.4801	0.4920	0.4841	0.2389	0.4562	0.2061	0.3336	0.02938	0.01426	0.0016
RSMOTE	NS−		0.1057	0.1867	0.2119	0.4641	0.4052	0.18141	0.06178	0.00539	0.0023
BSMOTE	NS+	NS−		0.3639	0.1736	0.1736	0.3707	0.4562	0.01321	0.01618	0.0008
ASY	NS+	NS−	NS+		0.1057	0.3783	0.3264	0.38209	0.02169	0.01426	0.0054
GOS	NS+	NS+	NS+	NS+		0.0951	0.3446	0.17361	0.15151	0.00466	0.0154
WKS	NS+	NS+	NS+	NS+	NS−		0.3015	0.37828	0.04006	0.00139	0.0014
FCMS	NS+	NS+	NS+	NS+	NS−	NS+		0.14457	0.12302	0.00368	0.0409
MOSIG	NS−	NS−	NS−	NS−	NS−	NS−	NS−		0.00714	0.00714	0.00494
MLONC	S+	NS+	S+	S+	NS+	S+	NS+	S+		0.00062	0.16109
MLOS	S−	S−	S−	S−	S−	S−	S−	S−	S−		0.00014
CSKGO	S+	S+	S+	S+	S+	S+	S+	S+	NS+	S+

Table 14. Pairwise comparison of the G-mean of oversampling algorithm by means of a Wilcoxon signed-rank test on KNN classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.2090	0.4522	0.4325	0.3409	0.2676	0.3821	0.08534	0.05705	0.00842	0.0029
RSMOTE	NS+		0.0475	0.0749	0.4404	0.4522	0.1762	0.01659	0.19489	0.00368	0.0132
BSMOTE	NS+	S−		0.2514	0.2546	0.1660	0.4801	0.22363	0.0268	0.01786	0.0011
ASY	NS+	NS−	NS+		0.2810	0.2743	0.3015	0.119	0.02938	0.00939	0.0037
GOS	NS+	NS+	NS+	NS+		0.2389	0.3228	0.06057	0.32276	0.00695	0.0239
WKS	NS+	NS−	NS+	NS+	NS−		0.4286	0.10565	0.13136	0.00159	0.0034
FCMS	NS−	NS−	NS+	NS−	NS−	NS−		0.14457	0.05262	0.00539	0.0212
MOSIG	NS−	S−	NS−	NS−	NS−	NS−	NS−		0.00317	0.01321	0.00164
MLONC	NS+	NS+	S+	S+	NS+	NS+	NS+	S+		0.00039	0.05155
MLOS	S−	S−	S−	S−	S−	S−	S−	S−	S−		0.00019
CSKGO	S+	S+	S+	S+	S+	S+	S+	S+	NS+	S+

Table 15. Pairwise comparison of the AUC of oversampling algorithm by means of a Wilcoxon signed-rank test on KNN classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.0427	0.0150	0.4247	0.0427	0.0183	0.0023	0.18673	0.12924	0.22663	0.0029
RSMOTE	S+		0.0021	0.0071	0.1230	0.4841	0.0001	0.00657	0.03288	0.017	0.0314
BSMOTE	S−	S−		0.0287	0.0055	0.0010	0.0183	0.26109	0.07927	0.11507	0.0006
ASY	NS−	S−	S+		0.0202	0.0062	0.0016	0.16853	0.2177	0.26109	0.0048
GOS	S+	NS+	S+	S+		0.1867	0.0001	0.00866	0.02872	0.02169	0.4920
WKS	S+	NS−	S+	S+	NS−		0.0000	0.0113	0.015	0.04457	0.0136
FCMS	S−	S−	S−	S−	S−	S−		0.00587	0.00052	0.01923	0.0001
MOSIG	NS−	S−	NS+	NS−	S−	S−	S+		0.05821	0.38974	0.00031
MLONC	NS−	S−	NS−	NS−	S−	S−	S+	NS+		0.23885	0.00233
MLOS	NS−	S−	NS+	NS−	S−	S−	S+	NS−	NS−		0.00131
CSKGO	S+	S+	S+	S+	NS−	S+	S+	S+	S+	S+

Table 16. Pairwise comparison of the F-measure of oversampling algorithm by means of a Wilcoxon signed-rank test on SVM classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.1314	0.3192	0.0202	0.0016	0.0122	0.3974	0.2327	0.40905	0.00181	0.4168
RSMOTE	NS−		0.2709	0.2743	0.1335	0.0217	0.1762	0.48006	0.4562	0.00695	0.2644
BSMOTE	NS+	NS+		0.0336	0.0143	0.0104	0.2207	0.36393	0.46414	0.00539	0.1762
ASY	S−	NS−	S−		0.0475	0.1210	0.0722	0.42858	0.26109	0.01539	0.1539
GOS	S−	NS−	S−	S−		0.4168	0.0268	0.08851	0.17619	0.04093	0.0036
WKS	S−	S−	S−	NS−	NS+		0.0162	0.05821	0.26109	0.00939	0.0048
FCMS	NS+	NS+	NS+	NS+	S+	S+		0.10935	0.32636	0.00097	0.1539
MOSIG	NS−	NS−	NS−	NS+	NS+	NS+	NS−		0.40905	0.01578	0.19489
MLONC	NS−	NS−	NS+	NS+	NS+	NS+	NS−	NS+		0.01743	0.46812
MLOS	S−	S−	S−	S−	S−	S−	S−	S−	S−		0.00164
CSKGO	NS−	NS+	NS−	NS+	S+	S+	NS−	NS+	NS+	S+

Table 17. Pairwise comparison of the G-mean of oversampling algorithm by means of a Wilcoxon signed-rank test on SVM classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.3897	0.2743	0.1379	0.0126	0.0162	0.4052	0.119	0.27425	0.00149	0.3085
RSMOTE	NS−		0.5000	0.3557	0.0793	0.0192	0.4562	0.14007	0.18673	0.0044	0.3594
BSMOTE	NS+	NS−		0.0722	0.0183	0.0146	0.4247	0.21186	0.22663	0.00368	0.1949
ASY	NS−	NS−	NS−		0.0446	0.0853	0.3974	0.3409	0.43644	0.00842	0.3783
GOS	S−	NS−	S−	S−		0.4286	0.1251	0.17619	0.36393	0.01539	0.0329
WKS	S−	S−	S−	NS−	NS+		0.1057	0.18673	0.5	0.00554	0.0505
FCMS	NS−	NS−	NS−	NS+	NS+	NS+		0.25143	0.27425	0.00131	0.3557
MOSIG	NS−	NS−	NS−	NS−	NS+	NS+	NS−		0.40517	0.00289	0.25143
MLONC	NS−	NS−	NS−	NS−	NS+	NS+	NS−	NS−		0.00776	0.35569
MLOS	S−	S−	S−	S−	S−	S−	S−	S−	S−		0.00149
CSKGO	NS−	NS−	NS−	NS+	S+	NS+	NS−	NS+	NS+	S+

Table 18. Pairwise comparison of the AUC of oversampling algorithm by means of a Wilcoxon signed-rank test on SVM classifier.

	SMOTE	RSMOTE	BSMOTE	ASY	GOS	WKS	FCMS	MOSIG	MLONC	MLOS	CSKGO
SMOTE		0.1660	0.0174	0.3520	0.3409	0.3974	0.2946	0.38209	0.14007	0.1563	0.1515
RSMOTE	NS+		0.0031	0.0630	0.2776	0.2420	0.1401	0.13786	0.02938	0.12302	0.2358
BSMOTE	S−	S−		0.1094	0.0537	0.0244	0.4052	0.18406	0.2177	0.48405	0.0044
ASY	NS−	NS−	NS+		0.2119	0.2877	0.2327	0.41294	0.42465	0.32636	0.0455
GOS	NS+	NS−	NS−	NS+		0.2451	0.2514	0.46017	0.18141	0.28434	0.1446
WKS	NS+	NS−	S+	NS+	NS−		0.2546	0.31207	0.20045	0.24825	0.0735
FCMS	NS−	NS−	NS−	NS−	NS−	NS−		0.28434	0.49202	0.46017	0.1131
MOSIG	NS−	NS−	NS+	NS+	NS+	NS−	NS+		0.23885	0.26763	0.16853
MLONC	NS−	S−	NS+	NS+	NS−	NS−	NS+	NS+		0.48405	0.0778
MLOS	NS−	NS−	NS−	NS−	NS−	NS−	NS+	NS−	NS−		0.2177
CSKGO	NS+	NS+	S+	S+	NS+	NS+	NS+	NS+	NS+	NS+

Table 19. Comparison of F-measure results of three algorithms on imbalanced datasets.

ID	Baseline	Ablation-I	Ablation-II	Ablation-III
fla	19.43	20.66	18.40	21.37
eco2	80.58	81.68	82.67	83.75
eco1	79.26	80.85	78.67	81.87
car	34.64	33.97	50.91	54.38
gla0	69.77	69.36	71.34	71.65
har	42.10	42.51	36.21	40.09
ion	73.18	73.82	75.19	75.69
led	72.51	71.93	73.03	74.94
new	89.77	91.10	91.62	91.21
pag0	80.94	80.09	80.07	80.39
pim	57.07	56.33	57.78	58.45
pok9	33.33	34.67	36.00	37.33
seg0	91.65	92.57	95.47	95.74
veh	84.62	84.14	81.39	82.96
vow0	100.00	100.00	100.00	100.00
win	7.08	6.39	12.11	14.37
wis	96.27	95.83	95.86	96.48
yea1	57.97	55.20	58.06	59.50
yea2	72.50	72.97	74.54	75.16
yea3	29.58	33.86	33.81	34.12
zoo	38.00	40.00	30.00	43.33
kdd	94.29	97.14	100.00	97.14
kr	100.00	100.00	100.00	100.00
pok8	9.36	19.31	14.09	18.89
Ave.	63.08	63.93	64.47	66.20

Table 20. Comparison of G-mean results of three parties on imbalanced datasets.

ID	Baseline	Ablation-I	Ablation-II	Ablation-III
fla	20.73	21.71	19.12	22.27
eco2	81.70	82.53	83.45	84.56
eco1	79.73	81.18	79.01	82.36
car	45.76	45.23	58.16	60.57
gla0	70.70	70.27	72.16	72.01
har	42.39	42.94	36.49	40.72
ion	75.37	75.91	77.06	77.50
led	73.44	72.95	73.71	75.53
new	90.30	91.60	91.70	91.32
pag0	80.96	80.15	80.11	80.43
pim	57.13	56.35	57.81	58.50
pok9	35.69	37.07	39.42	40.47
seg0	91.94	92.82	95.54	95.80
veh	84.72	84.27	81.47	83.13
vow0	100.00	100.00	100.00	100.00
win	9.30	8.46	16.92	19.43
wis	96.28	95.85	95.88	96.48
yea1	58.09	55.27	58.28	59.73
yea2	73.65	73.81	75.18	75.95
yea3	34.41	39.17	39.15	37.19
zoo	41.55	43.09	31.55	45.69
kdd	94.64	97.32	100.00	97.32
kr	100.00	100.00	100.00	100.00
pok8	28.92	28.25	21.22	24.35
Ave.	65.31	65.68	65.97	67.55

Table 21. Comparison of the average results of the three algorithms on the evaluation metrics.

	F-Measure	G-Mean
Baseline	63.08	65.31
+K-means	63.93	65.68
+CSK-means	64.47	65.97
+SRA	66.20	67.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, W.; Huang, X. Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution. Information 2026, 17, 28. https://doi.org/10.3390/info17010028

AMA Style

Xie W, Huang X. Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution. Information. 2026; 17(1):28. https://doi.org/10.3390/info17010028

Chicago/Turabian Style

Xie, Wenhao, and Xiao Huang. 2026. "Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution" Information 17, no. 1: 28. https://doi.org/10.3390/info17010028

APA Style

Xie, W., & Huang, X. (2026). Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution. Information, 17(1), 28. https://doi.org/10.3390/info17010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution

Abstract

1. Introduction

2. Related Work

3. Proposed Algorithm

3.1. K-Means Clustering Algorithm Combining Compactness and Separateness

3.1.1. Related Definitions of Clustering

3.1.2. Implementation Steps of the K-Means Clustering Algorithm Combining Compactness and Separateness

3.2. Gaussian Distribution Oversampling Algorithm

3.3. Introduction of the Proposed Algorithm

4. Experiments

4.1. Dataset

4.2. Evaluation Measures

4.3. Experiments and Results

4.3.1. Validating the Effectiveness and Efficiency of the CSK-Means Clustering Module

4.3.2. Comparison Analysis of Recall Before and After Oversampling

4.3.3. Comparison of Classification Results of Each Dataset After Oversampling

4.3.4. Non-Parametric Statistical Test

4.3.5. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI