Abstract
For classification problems, an imbalanced dataset can seriously reduce the learning efficiency in machine learning. In order to solve this problem, many scholars have proposed a series of methods mainly from the data and algorithm levels. At the data level, SMOTE is one of the most effective methods; it creates new minority samples through linearly interpolating between existing minority samples. This paper proposes an improved SMOTE-based data-level oversampling method that leverages a symmetrical cube scoring mechanism. This algorithm first exploits the symmetry properties of cubes to construct a new scoring rule based on different symmetric neighboring cubes, thereby dynamically selecting sample points. It then maps back to the original dimensional space, and generates new samples through multiple linear interpolations. This is equivalent to reducing the data to three dimensions, selecting points in that three-dimensional space, and synthesizing new samples by mapping those points back to the corresponding high-dimensional space. Compared to existing SMOTE variants, the proposed method delivers more targeted performance in regions of varying densities and boundary areas. In the experimental section, the proposed method selects several datasets to synthesize samples under different oversampling methods, and then compare the performances of these methods by calculating some evaluation indicators. In addition, to avoid accidental results caused by relying on a single classifier, the performance of each oversampling method is tested in the experimental section using three commonly used classifiers (SVM, ELM, and MLP). The experimental results show that, compared with other oversampling methods, CS-SMOTE achieves the first place in average ranking. Based on 33 datasets, 3 classifiers, and 3 performance metrics, a total of 297 rankings were obtained, and CS-SMOTE ranked first in 179 of them, accounting for 60.27%, which clearly demonstrates its strong capability in addressing class-imbalanced problems.