Next Article in Journal
Mapping Fake News Research in Digital Media: A Bibliometric and Topic Modeling Analysis of Global Trends
Previous Article in Journal
Simulating Advanced Social Botnets: A Framework for Behavior Realism and Coordinated Stealth
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution

1
School of Science, Xi’an Shiyou University, Xi’an 710065, China
2
School of Computer Science, Xi’an Shiyou University, Xi’an 710065, China
*
Authors to whom correspondence should be addressed.
Information 2026, 17(1), 28; https://doi.org/10.3390/info17010028
Submission received: 9 November 2025 / Revised: 13 December 2025 / Accepted: 19 December 2025 / Published: 31 December 2025

Abstract

Oversampling is common and effective in resolving the classification problem of imbalanced data. Traditional oversampling methods are prone to generating overlapping or noisy samples. Clustering can effectively alleviate the above problems to a certain extent. However, the quality of clustering results has a significant impact on the final classification performance. To address this problem, an oversampling algorithm based on the Gaussian distribution oversampling algorithm and the K-means clustering algorithm combining compactness and separateness (CSKGO) is proposed in this paper. The algorithm first uses the K-means clustering algorithm, combining compactness and separateness to cluster the minority samples, constructs the cluster compactness index and inter-cluster separateness index to obtain the optimal number of clusters and the clustering results, and obtains the local distribution characteristics of the minority samples through clustering. Secondly, the sampling ratio for each cluster is assigned based on the compactness of the clustering results to determine the number of samples for each cluster in the minority class. Then, the mean vectors and covariance matrices of each cluster are calculated, and the Gaussian distribution oversampling algorithm is used to generate new samples that match the distribution of characteristics of the real minority samples, which are combined with the majority samples to form balanced data. To verify the effectiveness of the proposed algorithm, 24 datasets were selected from the University of California Irvine (UCI) Repository, and they were oversampled using the CSKGO algorithm proposed in this paper and other oversampling algorithms, respectively. Finally, these datasets were classified using Random Forest, Support Vector Machine, and K-Nearest Neighbor Classifiers. The results indicate that the algorithm proposed in this paper has higher accuracy, F-measure, G-mean, and AUC values, which can effectively improve the classification performance of the imbalanced datasets.
Keywords: imbalanced data; oversampling algorithm; Gaussian distribution; K-means imbalanced data; oversampling algorithm; Gaussian distribution; K-means

Share and Cite

MDPI and ACS Style

Xie, W.; Huang, X. Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution. Information 2026, 17, 28. https://doi.org/10.3390/info17010028

AMA Style

Xie W, Huang X. Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution. Information. 2026; 17(1):28. https://doi.org/10.3390/info17010028

Chicago/Turabian Style

Xie, Wenhao, and Xiao Huang. 2026. "Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution" Information 17, no. 1: 28. https://doi.org/10.3390/info17010028

APA Style

Xie, W., & Huang, X. (2026). Oversampling Algorithm Based on Improved K-Means and Gaussian Distribution. Information, 17(1), 28. https://doi.org/10.3390/info17010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop