Next Article in Journal
Support-Vector-Regression-Based Kinematics Solution and Finite-Time Tracking Control Framework for Uncertain Gough–Stewart Platform
Previous Article in Journal
Short-Circuit Detection and Protection Strategies for GaN E-HEMTs in High-Power Applications: A Review
Previous Article in Special Issue
Consumer Transactions Simulation Through Generative Adversarial Networks Under Stock Constraints in Large-Scale Retail
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers †

Department of Computer Science and Information Engineering, Ming Chuan University, Gwei Shan District, Taoyuan 333, Taiwan
*
Author to whom correspondence should be addressed.
This article is a revised and expanded version of our conference paper listed below. Lee, Y.S.; Yen, S.J.; Tang, Y.J. Improved Sampling Methods for Evaluation of Classification Performance. In Proceedings of the7th International Conference on Artificial Intelligence in Information and Communication, Fukuoka, Japan, 18–21 February 2025; pp. 378–382.
Electronics 2025, 14(14), 2876; https://doi.org/10.3390/electronics14142876
Submission received: 31 May 2025 / Revised: 8 July 2025 / Accepted: 15 July 2025 / Published: 18 July 2025
(This article belongs to the Special Issue Data Retrieval and Data Mining)

Abstract

Evaluating the classification accuracy of machine learning models typically involves multiple rounds of random training/test splits, model retraining, and performance averaging. However, this conventional approach is computationally expensive and time-consuming, especially for large datasets or complex models. To address this issue, we propose an effective sampling approach that selects a single training/test split that closely approximates the results obtained from repeated random sampling. Our approach ensures that the sampled data closely reflects the classification performance of the original dataset. Our methods integrate advanced distribution distance metrics and feature weighting techniques tailored for numerical, categorical, and mixed-type datasets. The experimental results demonstrate that our method achieves over 95% agreement with multi-run average accuracy while reducing the overhead of computations by more than 90%. This approach offers a scalable, resource-efficient alternative for reliable model evaluation, particularly valuable in time-critical or resource-constrained applications.

1. Introduction

In the process of constructing a classification model [1,2,3], the data sampling method plays a crucial role. The model’s learning capability is highly dependent on the quality of the sampled training set, while its accuracy is determined by the test set. If the sampling method results in a poor-quality training set for model construction and an overly simple test set, the model may achieve artificially high accuracy, leading to a misjudgment of its actual performance.
Therefore, during the data sampling phase, it is essential to carefully select samples to ensure their representativeness. This requires adopting appropriate sampling strategies and ensuring diversity within the sample to accurately reflect the overall data distribution. One of the most commonly used sampling methods is random sampling, which is both simple and efficient. In this approach, each data point has an equal probability of being selected, making the sampled dataset a reasonable representation of the entire population. Due to its ease of implementation, random sampling is frequently used in practical applications.
Random sampling has the drawback of generating different training and test sets in each iteration, leading to variations in accuracy results. A single instance of random sampling cannot fully reflect the model’s performance, making it unreliable for evaluating classification models when only one training/test split is used. To achieve a more accurate assessment, multiple random samplings are typically required, and the average accuracy across all experiments is used as a measure of overall performance. However, this approach necessitates training multiple models, which is computationally expensive and impractical for real-world applications. Cross-validation offers a solution by mitigating the limitations of single random sampling. By systematically rotating different subsets of data into the training and test sets and averaging the results, cross-validation provides a more comprehensive evaluation of model performance. Given these limitations of random sampling, improving sampling strategies and efficiently selecting appropriate training and test sets have become crucial research topics in the field [4,5,6,7].
The random sampling method presents several challenges in constructing classification models, particularly in terms of efficiency, time consumption, and computing resource demands for model evaluation. These factors significantly impact its practicality in real-world applications. To address these issues, Shin and Oh [8] proposed an improved sampling method, FWS, based on the approach introduced by Kang and Ohs [9]. Their method involves performing multiple random samplings to generate multiple training/test datasets as candidates. For each set, the feature-weighted distance between this set and the original dataset is calculated to assess their suitability. A training/test set is then selected from these multiple candidates to optimize model performance. However, this method still faces difficulties in selecting the most appropriate training and test sets, requiring further refinement.
To improve the accuracy and reliability of model assessment, we employed different techniques for measuring data distribution similarity and calculating feature importance [10], which can only handle datasets with numerical attributes. This article proposes a sampling approach that is capable of handling categorical attributes as well as mixed-type datasets. Additionally, we compare the training/test sets chosen by various sampling methods and evaluate their ability to approximate the average accuracy obtained from multiple random samplings.
The main contributions of this article are summarized as follows: 1. Efficient Single-Split Evaluation Framework: We propose a novel sampling method that eliminates the need for repeated training/test splits and model retraining by selecting a single representative data split that closely approximates the average classification accuracy from multiple evaluations, significantly reducing time and computational cost. 2. Distribution-Aware Sampling with Feature Weighting: Our method introduces advanced techniques for measuring data distribution similarity and incorporates feature importance into the similarity calculation, ensuring that selected subsets accurately reflect the statistical structure of the original dataset. 3. Support for Mixed-Type Data and Comparative Analysis: Unlike existing methods limited to numerical attributes, our approach effectively handles both categorical and mixed-type datasets. We also conduct a comparative evaluation of different sampling strategies, demonstrating the proposed method’s ability to approximate multi-run accuracy outcomes more reliably.

2. Related Work

D. Kang and S. Oh introduced the R-value-based sampling (RBS) method [11] to enhance the evaluation of classification models. This method calculates the R-value, which measures the classification difficulty of a data point by counting how many of its k nearest neighbors belong to a different class. For each data point p, the R-value can be calculated by identifying its k nearest neighbors and counting the number of neighbors that belong to a different class than p. As a result, the R-value of p ranges from 0 to k, indicating the degree of classification difficulty associated with that point. The RBS method groups data points based on their R-values and applies stratified sampling [12] within each group. The sampled training and test datasets from all groups are then combined to form the final training/test set. By utilizing the R-value as a quantitative indicator of classification difficulty, this method ensures that the training and test sets are constructed with comparable levels of classification complexity. Experimental results demonstrate that [11], compared to random sampling [13], D-optimal sampling [14], and MDC methods [15], the RBS method generates training/test sets that better align with the actual performance of classification models, leading to more reliable model evaluation.
Although the R-value-based sampling (RBS) method accounts for class overlap among data points, it does not consider the overall distribution distance or the importance of individual features. To address these limitations, Shin and Oh proposed the feature-weighted sampling (FWS) method [8], an improved version of RBS designed to generate multiple candidate training/test sets and select the most suitable one. One limitation of the original RBS method [11] is its reliance on stratified sampling, which often results in the repeated selection of the same training/test sets. To introduce more diversity, FWS replaces stratified sampling with random sampling, ensuring greater variability in the generated datasets. Additionally, FWS transforms the original dataset into histograms, applying the same transformation to each candidate training/test set. To evaluate which candidate set most accurately represents the overall population distribution, Earth Mover’s Distance (EMD) [16] is used as a similarity measure. This approach allows for a more precise selection of training/test sets, addressing the shortcomings of RBS and improving the robustness of model evaluation.
Earth Mover’s Distance (EMD) [16] is a metric used to measure the similarity between distributions by calculating the minimum amount of effort required to transform one distribution into another. This transformation is conceptualized as moving units of “soil” from one distribution to match the target distribution. Given two distributions—P (representing the training/test dataset) and Q (representing the original dataset)—the EMD between P and Q is determined by the optimal transport distance and the corresponding amount of data that needs to be moved. During model training, the contribution of different features significantly impacts overall performance.
To account for this, feature-weighted sampling (FWS) incorporates feature importance into the distance calculation. Specifically, it utilizes Shapley values [17] to quantify each feature’s contribution to the model. Shapley values, originating from cooperative game theory, are widely applied in machine learning to fairly assess the contribution of each feature to a model’s predictions [18]. In the context of game theory, the Shapley value provides a principled approach for distributing value among multiple players. When adapted to machine learning, it allocates the predictive influence among input features in an equitable manner. Despite its effectiveness, a notable limitation of the Shapley value is its reliance on numerical inputs, which restricts its direct applicability to categorical variables. This presents challenges when analyzing datasets that contain both numerical and categorical features.
After computing the Shapley value to determine the weight w(fi) for each numerical feature fi, and calculating the Earth Mover’s Distance (EMD) d(fi) for each training/test set, the overall feature-weighted distance d is derived using Equation (1). In this Equation, dtrain(fi) represents the EMD of feature fi in the training set, while dtest(fi) represents the EMD of feature fi in the test set. Once the feature-weighted distances for all candidate training/test sets are computed, the FWS method selects the training/test set with the smallest feature-weighted distance, as it best preserves the overall data distribution and feature importance, ensuring optimal model evaluation.
d = i = 1 n w ( f i ) × ( d t r a i n ( f i ) + d t e s t ( f i ) )
In summary, existing methods either ignore the overall distribution distance and feature importance, as in RBS [11], or are limited to numerical data, as in FWS [8]. To address these limitations, this article introduces a set of feature-weighted sampling methods that incorporate various distribution distance metrics and feature weighting techniques to overcome the shortcomings of previous approaches.

3. Our Approach

In this section, we introduce our proposed sampling framework designed to improve the accuracy and reliability of classification performance evaluation on mixed-type datasets that contain both numerical and categorical attributes. We begin by enhancing the FWS method [8], which was limited to datasets composed solely of numerical attributes. To address this limitation, we introduce a series of enhancements that enable the framework to effectively process categorical attributes, ensuring broader applicability in real-world scenarios. Based on experimental findings, we ultimately propose an integrated sampling strategy that accommodates both numerical and categorical features. The subsequent subsections detail the sampling procedures for different attribute types.

3.1. The Sampling Methods on the Dataset with Numerical Attributes

In this subsection, we describe how our method conducts sampling on the dataset with only numerical attributes [10], which generates the most suitable training and test sets for accurately evaluating classification performance. Our approach enhances the FWS method [8] by refining data distribution distance measurements and feature importance calculations.
Our method consists of two phases. In the first phase, we apply a modified RBS [8] to generate 1000 training/test candidate sets. The choice of 1000 candidate sets was made empirically as a trade-off between computational feasibility and performance stability. In the second phase, we evaluate these candidate sets and select the one with the smallest feature-weighted distribution distance in Equation (1). To ensure the most suitable training and test set selection, our approach improves the FWS method by refining feature-weighted calculations and data distribution distance assessment. The overall architecture of our approach is illustrated in Figure 1 [10].
In calculating distribution similarity, FWS employs the Earth Mover’s Distance (EMD) [16] to measure the distance between the population data and the training/test datasets. However, the computation of EMD requires data discretization, which can lead to information loss and potentially impact the accuracy of the results. To address this limitation, our method utilizes three alternative approaches for measuring distributional distance: Energy Distance [19], distance correlation hypothesis testing (dcor) [20], and the Kolmogorov–Smirnov (K-S) test [21].
Energy Distance, inspired by Newton’s concept of gravitational energy, serves as a versatile statistical tool. It can be applied in various contexts such as testing statistical independence through distance covariance, assessing goodness-of-fit, performing non-parametric tests for distribution equality, extending Analysis of Variance (ANOVA), identifying change points, conducting feature selection, and more [19]. This metric is particularly useful for detecting similarities between two probability distributions. The formal definition of Energy Distance is provided in Equation (2), where X and Y represent two distributions, and NX and NY denote the number of samples from each, respectively. One of the key strengths of Energy Distance lies in its sensitivity to both the shape and location of distributions, making it especially effective for testing distributional homogeneity. However, a notable drawback is that it requires computing pairwise distances between all sample values, which can become computationally expensive for large datasets. Additionally, the Energy Distance does not have a standardized scale.
ϵ N X , N Y X , Y = 2 N X N Y i = 1 N X j = 1 N Y x i y j 1 N X 2 i = 1 N X j = 1 N Y x i x j 1 N Y 2 i = 1 N X j = 1 N Y y i y j
The value derived from the Energy Distance is not confined to a fixed range, which can result in it disproportionately influencing the overall weighted distance calculation during sample selection. To address this, Carreño and Torrecilla [20] introduced an implementation of energy-based metrics in the Python dcor package (version 0.6). This package supports computations such as distance covariance, distance correlation, partial distance correlation, and hypothesis testing. In our approach, we also utilize dcor to assess the similarity between different data distributions.
The dcor-Hypothesis test accounts for differences in sample sizes to maintain comparability across datasets. It estimates the p-value through permutation testing by repeatedly shuffling the samples and computing the energy statistic, which measures how often the permuted statistic exceeds the original, indicating the degree of similarity or dependence. This test is suitable for multi-dimensional data and is sensitive to variations in distribution shape, location, and scale. The dcor-Hypothesis test is defined in Equation (3), in which N X and N Y are sample sizes of datasets X and Y, and ϵ N X , N Y X , Y is a measure of distance-based dependence between X and Y. In practice, this is often a form of energy statistic or distance correlation value. The scaling factor N X N Y N X + N Y adjusts for unequal sample sizes, ensuring the test statistic T is comparable across datasets. Therefore, the statistic T captures how strongly the datasets X and Y are related based on their mutual spatial (distance) structure. Unlike Energy Distance, the dcor-Hypothesis test outputs values within a fixed range of 0 to 1, providing a normalized measure of similarity. However, this method is computationally intensive, especially for large datasets, due to the repeated permutations involved in the testing process.
T = N X N Y N X + N Y ϵ N X , N Y X , Y
The third distribution similarity calculation method employed to measure the distribution distance between the population dataset and the training/test sets is the Kolmogorov–Smirnov (K-S) test [21], which is a non-parametric statistical method used to compare two probability distributions (two-sample test), or a sample with a reference distribution (one-sample test). In the context of distribution similarity, it helps determine whether two samples, such as a population dataset and a training or test dataset, are drawn from the same distribution.
In multivariate statistics, the K-S test is a widely adopted method for assessing differences in data distributions [21]. It quantifies the similarity between two distributions by evaluating the maximum difference between their cumulative distribution functions, as illustrated in Equation (4). This maximum deviation is taken as the measure of distance between the distributions. Here, F(P) and F(Q) represent the cumulative distribution functions of distributions P and Q, respectively.
D = m a x F P F ( Q )
Compared to the Earth Mover’s Distance [16], the K-S test is more intuitive and easier to apply, especially for one-dimensional data, where it is both simple and computationally efficient. Its output ranges consistently between 0 and 1. However, when applied to multi-dimensional data, it requires careful consideration and validation to ensure accuracy.
For feature importance estimation, the Shapley value used in the FWS method [8] considers all possible feature ranking combinations, leading to excessive computation time, making it impractical for real-world applications. Additionally, Shapley values can sometimes be negative, which may reduce the calculated distance and lead to inaccurate results. To address these issues, our approach adopts Analysis of Variance (ANOVA) to assess the relationship between each feature and the target variable. The resulting p-value is used as a measure of importance. In statistical analysis, the p-value indicates whether the null hypothesis can be rejected. Typically, a p-value below a predefined significance level (commonly 0.05) suggests that the feature has a statistically significant effect on the target variable. Therefore, we treat the 1 p-value as an indicator of feature importance.
F = M S B M S W
M S B = i = 1 k n i ( X i ¯ X ¯ ) 2 k 1
M S W = i = 1 k j = 1 n i ( X i j X i ¯ ) 2 N k
The F-statistic and p-value are commonly used as the basis for hypothesis testing in ANOVA, as shown in Equation (5). The F-statistic is calculated using the mean square between groups (MSB), which measures variation between groups (sample means), and the mean square within groups (MSW), which measures variation within each group. MSB and MSW are shown in Equation (6) and Equation (7), respectively, in which N and k denote the total number of observations and total number of groups, respectively, n i denotes the number of observations in the i-th group, X i ¯ and X ¯ represent the mean of the i-th group (the average of all values in group i) and the average of all observations across all groups, respectively, and X i j is the j-th observation in the i-th group. Therefore, ( X i ¯ X ¯ ) is the squared difference between group mean and grand mean, which represents how much group i deviates from the overall mean, and ( X i j X i ¯ ) is the squared difference between an individual observation and its group mean, which presents how much individual values deviate from their own group mean.

3.2. The Sampling Methods on the Dataset with Categorical Attributes

In this subsection, we describe our proposed feature-weighted sampling methods on a categorical dataset. The framework of our sampling method for a categorical dataset is illustrated in Figure 2. The process is also divided into two phases: In the first phase, the modified RBS method [8] is used to generate 1000 training/test sets, which serve as candidate sets. In the second phase, each candidate set is evaluated by calculating the feature-weighted distribution distance between the population dataset and the training/test sets (candidate sets) in Equation (1) and selecting the one with the smallest feature-weighted distribution distance. The distribution distance between the population dataset and the training/test dataset is measured using Kullback–Leibler divergence (KLD) [22,23,24] and Jensen–Shannon divergence (JSD) [24,25,26]. Feature importance is assessed using the p-value from the Chi-square test ( χ 2 Test).
We adopt Kullback–Leibler divergence (KLD) and Jensen–Shannon divergence (JSD) as methods for calculating distribution distance. The KL divergence is a non-symmetric measure of how one probability distribution Q diverges from a second, reference probability distribution P. It is often used to measure the difference between two probability distributions, especially in the context of machine learning and statistical modeling. The KL divergence is defined in Equation (8), where P and Q represent two different distributions. Specifically, p(x) is the probability of event x under distribution P, and q(x) is the probability of the same event x under distribution Q. The term p(x) serves as a weighting factor, emphasizing events with higher occurrence probability and greater influence on the overall distribution. KL divergence quantifies how much “extra information” is needed to represent events from P using the distribution Q. A higher value indicates greater divergence. The Kullback–Leibler divergence (KLD) can be used to measure the degree of dissimilarity between two probability distributions. When the two distributions are identical, the KLD is zero. It is also asymmetric, as indicated in Equation (9), and can be regarded as a form of information measure [27].
D K L P Q = x p x ln p x q x
D K L P Q D K L Q P
Due to the asymmetry of KL divergence, it is necessary to compute the divergence in both directions: one using the population data (P) as the reference distribution, and the other using the training data (Q) as the reference distribution. This double computation gives a more complete understanding of the dissimilarity between two distributions. If the difference between the two distributions is substantial, the KL divergence can grow without bound and approach infinity.
To address KL divergence’s asymmetry and unboundedness, our method also uses Jensen–Shannon divergence, a symmetrized and smoothed version of KL divergence. JS divergence [25] can be used for fitness testing and to solve the asymmetry problem of KL divergence. Equation (10) is the definition of JS divergence, where P and Q represent two different distributions, and their values range from 0 to 1. JS divergence offers a more robust and interpretable measure of similarity between distributions. Both JS divergence and KL divergence require estimating the probability distributions associated with the data.
J S D P Q = 1 2 D K L P P + Q 2 + 1 2 D K L Q P + Q 2
For the feature importance, we adopt the Chi-square test [28] as the basis for calculating the feature importance of a categorical dataset. The Chi-square test is a commonly used statistical method for significance testing, which assesses the relationship between the feature variable and the target variable. The Chi-square statistic value for each feature reflects its relevance or importance in predicting the target. Higher Chi-square scores indicate a stronger association and hence greater feature importance. These importance scores are then used as feature weights in the divergence-based sampling method to ensure that important features have more influence on divergence calculations, improving the quality of training/test data selection.

4. Experimental Results

In this section, we first describe the datasets we used to perform the experiments and introduce the evaluation metric, and then we apply the four classification algorithms, Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), and Support Vector Machine (SVC), to build the classifiers. Finally, we assess the reliability of various sampling methods by analyzing the classification accuracy on test datasets, using the classifiers trained on the training sets derived from each sampling strategy.

4.1. Datasets and Classifiers

The datasets used in our experiments can be obtained from the UCI repository (https://archive.ics.uci.edu/; accessed on 1 July 2025) and Kaggle (https://www.kaggle.com/; accessed on 1 July 2025), which are all used in classification tasks. We use five datasets with only numerical attributes, five datasets with only categorical attributes, and five datasets with hybrid attribute types. Table 1 is the description of the datasets used in the experiments, in which there are three fields for each dataset: number (#) of instances, number (#) of attributes, and number (#) of classes. In the field # of attributes, N denotes the numerical attributes, and C denotes the categorical attributes. For example, N:6, C:9 means there are six numerical attributes and nine categorical attributes in the dataset.
Table 2 shows the parameters of the four algorithms used to build the classifiers. For the Decision Tree (DT) classifier, we set the splitting criterion to ‘entropy’, enabling the use of information gain for more informative splits, and specify min_samples_leaf = 2 to prevent overfitting by ensuring each leaf node contains at least two samples. The Random Forest (RF) classifier is configured with n_estimators = 500, providing robustness through ensemble learning with a large number of trees, while max_features = ‘sqrt’ limits the number of features considered at each split to the square root of the total, enhancing diversity and reducing variance. The K-Nearest Neighbor (KNN) classifier uses n_neighbors = 5, meaning classification is based on the majority vote among the five closest training samples, balancing bias and variance. Lastly, the Support Vector Machine (SVM) utilizes the ‘rbf’ kernel, allowing it to model nonlinear decision boundaries by projecting data into a higher-dimensional space. These parameter choices are selected based on common best practices and prior studies, aiming to ensure fair and effective comparisons.

4.2. Evaluation Metric MAI

To assess whether the training/test set generated by a sampling method meets the expected outcomes, a reliable evaluation metric is required. During the sampling process, it is challenging to determine in advance whether the selected training/test sets will produce the desired results. To address this, one common strategy is to use the accuracy obtained from random sampling as a baseline for comparison. However, relying on the accuracy of a single random sample can be misleading due to high variability across different samples. A more robust approach involves computing the average accuracy over multiple random samplings, providing a more stable and reliable reference metric. This average is referred to as the Absolute Evaluation Value (AEV) [9], as defined in Equation (11). In this context, test_acci denotes the accuracy of the i-th random sampling, m represents the total number of random samplings, and AEV is the mean accuracy computed across all m samplings.
A E V = i = 1 m t e s t _ a c c i m
Once the Absolute Evaluation Value (AEV) is determined, the Mean Accuracy Indicator (MAI) is used to evaluate whether a training/test set is appropriate [9]. As shown in Equation (12), MAI is calculated by subtracting the average accuracy of multiple random samplings (AEV) from the classification accuracy (ACC) of a given sampling method and then dividing the result by the standard deviation (SD) of the AEV. Intuitively, a lower MAI value indicates a smaller discrepancy between ACC and AEV, suggesting that the sampling method produces results closely aligned with the average performance of multiple random samplings, thereby demonstrating its effectiveness.
M A I = A C C A E V S D

4.3. Experimental Results on Numerical Datasets

We first conducted 1000 random samples on five datasets with only numerical attributes in Table 1 to generate 1000 training/test sets. Using four different classification algorithms, we built classification models and obtained accuracy on the test set for each sampling. We then calculated the average accuracy and standard deviation of the test dataset across the 1000 classifiers built from the random samplings, as shown in Table 3.
We employ the modified RBS sampling method [8] to generate 1000 candidate training/test sets for each of the five numerical datasets. For each candidate set, we use the four distribution distance calculation methods: EMD, Energy Distance, dcor-Hypothesis testing, and K-S test, and apply Shapely value and ANOVA to calculate feature weights, respectively, in which the combination of EMD and Shapely value is the method FWS [8]. Subsequently, we calculate the feature-weighted distance between each candidate training/test set and the original dataset using Equation (1). The training/test set exhibiting the smallest feature-weighted distance is selected. The classification accuracy (ACC) of the test set for the four classifiers built by using the training set on the selected sets is presented in Table 4 and Table 5, in which Shapely value and ANOVA are used to calculate feature weight, respectively. According to Table 3 and from Table 4 and Table 5, the evaluation metric MAIs for the sampling methods based on the four distribution distance calculation methods with Shapely value and ANOVA as feature-weighted computation are shown in Table 6 and Table 7, respectively.
According to the results in Table 6 and Table 7, the dcor-Hypothesis testing method achieves the lowest average MAI across all the sampling methods. This indicates that it is the most effective approach for selecting training and test sets that closely match the original data distribution. The reason for its superior performance lies in its two key components: (1) distance correlation, which captures both linear and nonlinear dependencies between variables, and (2) hypothesis testing, which filters out statistically insignificant relationships, reducing the influence of noise or irrelevant features. When combined with ANOVA-based feature weighting, which assigns higher importance to features with greater between-class variance, the method ensures that the feature-weighted distance reflects the most discriminative aspects of the data. As a result, the training and test sets selected using this method are more representative of the original dataset in terms of both overall structure and feature relevance. Therefore, dcor-Hypothesis testing alongside ANOVA for feature-weighted calculation yields the smallest average MAI values.
Table 8 and Table 9 are the execution times of each sampling method for different classifiers on different numerical datasets with feature weighting calculated by Sharply value and ANOVA, respectively. From these experiments, we can see that the execution time required for randomly sampling 1000 training/test sets and computing test accuracy is significantly longer than that for generating 1000 candidate sets and selecting the one with the lowest MAI value. In contrast, the execution times of different feature-weighted sampling methods are relatively similar except for dcor-Hypothesis testing.

4.4. Experimental Results on Categorical Datasets

We also performed 1000 random samplings on five datasets with only categorical attributes in Table 1, resulting in 1000 distinct training/test set pairs. For each of these pairs, we applied four different classification algorithms to construct corresponding classification models and recorded the accuracy on the test dataset. Subsequently, we calculated the average test accuracy and its standard deviation across the 1000 classification models generated from the random samplings. The results are presented in Table 10.
Table 11 shows the classification accuracy of the test dataset obtained from four classifiers using five categorical datasets. For each dataset, candidate training/test sets were evaluated by computing their distribution distances from the original dataset using Kullback–Leibler divergence (KLD), Jensen–Shannon divergence (JSD), and Earth Mover’s Distance, incorporating feature importance calculated by the Chi-square test [28], computing feature-weighted distances. The training/test set with the smallest distribution distance to the original set was selected. For KLD, two variants were considered: one treating the original dataset as the reference distribution (P∥Q) and the other treating the candidate training/test set as the reference distribution (Q∥P). In addition, in order to enable the FWS method to be executed on these datasets, we flattened the categorical attribute values and converted them into numerical attributes by one-hot encoding, and used Earth Mover’s Distance (EMD) and Shapley values to calculate feature-weighted distance.
According to Table 10 and Table 11, Table 12 shows the evaluation results in terms of MAI for various sampling methods that utilize different feature-weighted distribution distance measures. In these methods, feature importance is determined using the Chi-square test [28], and comparisons are made against the feature-weighted sampling (FWS) method [8], which applies one-hot encoding to transform categorical attributes into numerical values. From the results in Table 10, it is evident that both KL divergence and JS divergence, when combined with Chi-square-based feature weighting, outperform the FWS method. The FWS method’s reliance on one-hot encoding may introduce high-dimensional sparsity and fail to capture statistical dependencies effectively, especially when categorical attributes have a large number of levels. In contrast, KL divergence and JS divergence explicitly model the divergence between probability distributions of features, incorporating feature importance to better reflect the relative significance of each attribute.
Among all evaluated methods, JS divergence with feature importance achieves the lowest average MAI, demonstrating the most stable and reliable performance. This is because JS divergence is a symmetric and smoothed version of KL divergence and more robustly handles discrepancies between distributions. By integrating JS divergence with Chi-square feature importance, the method effectively prioritizes informative attributes while ensuring the sampled training and test sets maintain high distributional fidelity to the original dataset. This alignment leads to more accurate model evaluation, ultimately resulting in the smallest MAI values observed across all scenarios.
The execution times of random sampling and each feature-weighted sampling method for different classifiers on different categorical datasets are shown in Table 13. The experimental results are similar to the above experiments on numerical datasets; that is, randomly sampling 1000 training/test sets and calculating their test accuracies takes significantly more time than generating 1000 candidate sets and selecting the one with the lowest MAI value. Meanwhile, the execution times of the various feature-weighted sampling methods are generally comparable.

4.5. Experimental Results on Mix-Type Datasets

We also performed 1000 random samplings on five mix-type datasets containing numerical attributes and categorical attributes in Table 1, resulting in 1000 distinct training/test set pairs. For each sampling, classification models were constructed using four different algorithms, and the corresponding test accuracies were recorded. Subsequently, we computed the mean accuracy and standard deviation of the test results across the 1000 classifiers derived from these random samples, which are shown in Table 14.
Since the combination of the distribution distance method, dcor-Hypothesis testing, with ANOVA for calculating feature-weighted distribution distances yielded the best performance on numerical datasets, and the combination of JS divergence with the Chi-square test performed best on categorical datasets, we propose a hybrid sampling approach tailored for mixed-type datasets. Specifically, we compute feature-weighted distances separately for numerical and categorical attributes: for numerical attributes, we apply dcor-Hypothesis testing, which captures both linear and nonlinear dependencies together with ANOVA, which emphasizes features with high between-class variance. This ensures that the most discriminative numerical features are appropriately weighted during distance computation. For categorical attributes, we use JS divergence, a symmetric and smoothed divergence measure, along with the Chi-square test for feature weighting. The Chi-square test highlights attributes with strong class association, and JS divergence more robustly compares feature distributions in the presence of sparsity or imbalance, which are common in categorical data.
To further validate the effectiveness of the hybrid approach, we conducted an additional experiment where all categorical features were flattened using one-hot encoding and treated as numerical data. After transforming all categorical features in the mixed-type dataset into numerical form, we then applied the best-performing numerical method, dcor-Hypothesis testing with ANOVA, to calculate the feature-weighted distance between the original dataset and candidate sets, and the training/test set with the smallest feature-weighted distance was selected. The corresponding test accuracy and MAI values are presented in Table 15.
As shown in Table 15, our hybrid approach outperforms this alternative method (dcor-Hypothesis testing with ANOVA). This improvement occurs because converting categorical attributes into sparse binary vectors via one-hot encoding can dilute feature relevance and introduce high dimensionality, potentially distorting the true distributional structure. In contrast, our hybrid approach preserves the intrinsic characteristics of each attribute type by applying the most appropriate distance measure and feature importance method to each. As a result, it more accurately selects training/test sets that maintain the original dataset’s structure and feature relationships, thereby minimizing distributional deviation and achieving the smallest MAI across mixed-type datasets.

5. Conclusions

In the past, it was common to repeatedly split the dataset into training and test sets and build multiple classification models to evaluate classification accuracy, which requires substantial computational time for repeated modeling and evaluation, and becomes impractical when dealing with large-scale datasets. To address this, some researchers have proposed sampling methods to obtain a single training/test dataset that approximates the results of repeated modeling. Among the existing methods, the feature-weighted sampling (FWS) method is currently regarded as the most effective. It calculates the distribution distance between the original dataset and the training/test dataset using Earth Mover’s Distance (EMD) and employs Shapley values to compute feature importance. However, EMD requires data discretization prior to distribution similarity calculation, which may compromise data fidelity. Moreover, Shapley values are applicable only to numerical attributes.
To overcome these limitations, this study proposes improvements to the FWS method. Specifically, we introduce a sampling approach that does not require discretization when computing distribution similarity and also propose a sampling strategy capable of handling categorical datasets. Experimental results show that our proposed method achieves lower MAI (Mean Accuracy Indicator) and performs better than the original FWS method. Among all the methods, dcor-Hypothesis testing and JS divergence combined with statistical feature weighting ANOVA and Chi-square, respectively, obtain the lowest average MAI. The execution time of all the feature-weighted sampling methods is reduced by over 90% compared to that of random sampling. Furthermore, in our sampling framework, the best-performing sampling strategies for numerical and categorical datasets are, respectively, applied to the corresponding attributes in mixed-type datasets. The results also demonstrate that our hybrid method achieves strong performance in sampling the datasets with hybrid attribute types. As future work, we plan to develop adaptive strategies to dynamically adjust the number of candidate sets based on computational resources and explore methods for applying our framework to unstructured and high-dimensional datasets.

Author Contributions

Conceptualization, Y.-S.L. and S.-J.Y.; methodology, Y.-J.T.; software, Y.-J.T.; validation, Y.-S.L. and S.-J.Y.; formal analysis, S.-J.Y.; resources, Y.-S.L.; writing—original draft preparation, Y.-J.T.; writing—review and editing, Y.-S.L. and S.-J.Y.; supervision, Y.-S.L. and S.-J.Y.; project administration, Y.-S.L. and S.-J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets analyzed in this study are publicly available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/; accessed on 1 July 2025) and Kaggle (https://www.kaggle.com/; accessed on 1 July 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sen, P.C.; Hajra, M.; Ghosh, M. Supervised Classification Algorithms in Machine Learning: A Survey and Review. In Proceedings of the IEM Graph, Kolkata, India, 6–8 September 2018; pp. 99–111. [Google Scholar]
  2. Rauschert, S.; Raubenheimer, K.; Melton, P.; Huang, R. Machine Learning and Clinical Epigenetics: A Review of Challenges for Diagnosis and Classification. Clin. Epigenetics 2020, 12, 51. [Google Scholar] [CrossRef] [PubMed]
  3. Henrique, B.M.; Sobreiro, V.A.; Kimura, H. Literature review: Machine Learning Techniques Applied to Financial Market Prediction. Expert Syst. Appl. 2019, 124, 226–251. [Google Scholar] [CrossRef]
  4. Sharma, G. Pros and Cons of Different Sampling Techniques. Int. J. Appl. Res. 2017, 3, 749–752. [Google Scholar]
  5. Stratton, S.J. Population Research: Convenience Sampling Strategies. Prehospital Disaster Med. 2021, 36, 373–374. [Google Scholar] [CrossRef] [PubMed]
  6. Taherdoost, H. Sampling Methods in Research Methodology; How to Choose a Sampling Technique for Research. Int. J. Acad. Res. Manag. 2016, 5, 18–27. [Google Scholar] [CrossRef]
  7. Bellhouse, D. Systematic Sampling Methods. In Encyclopedia of Biostatistics; Chichester: New York, NY, USA, 2005; pp. 4478–4482. [Google Scholar]
  8. Shin, H.; Oh, S. Feature-Weighted Sampling for Proper Evaluation of Classification Models. Appl. Sci. 2021, 11, 2039. [Google Scholar] [CrossRef]
  9. Kang, D.; Oh, S. Balanced training/test set Sampling for Proper Evaluation of Classification Models. Intell. Data Anal. 2020, 24, 5–18. [Google Scholar] [CrossRef]
  10. Lee, Y.S.; Yen, S.J.; Tang, Y.J. Improved Sampling Methods for Evaluation of Classification Performance. In Proceedings of the 7th International Conference on Artificial Intelligence in Information and Communication, Fukuoka, Japan, 18–21 February 2025; pp. 378–382. [Google Scholar]
  11. Oh, S. A New Dataset Evaluation Method Based on Category Overlap. Comput. Biol. Med. 2011, 41, 115–122. [Google Scholar] [CrossRef] [PubMed]
  12. Parsons, V.L. Stratified Sampling. In Wiley StatsRef: Statistics Reference Online; John Wiley and Sons: Hoboken, NJ, USA, 2014; pp. 102–144. [Google Scholar]
  13. Berndt, A.E. Sampling methods. J. Hum. Lact. 2020, 36, 224–226. [Google Scholar] [CrossRef] [PubMed]
  14. Martin, E.J.; Critchlow, R.E. Beyond Mere Diversity: Tailoring Combinatorial Libraries for Drug Discovery. J. Comb. Chem. 1999, 1, 32–45. [Google Scholar] [CrossRef] [PubMed]
  15. Hudson, B.D.; Hyde, R.M.; Rahr, E.; Wood, J.; Osman, J. Parameter Based Methods for Compound Selection from Chemical Databases. In Quantitative Structure-Activity Relationships; CRC Press: Boca Raton, FL, USA, 1996; pp. 285–289. [Google Scholar]
  16. Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
  17. Covert, I.; Lundberg, S.M.; Lee, S.I. Understanding Global Feature Contributions with Additive Importance Measures. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA; pp. 17212–17223. [Google Scholar]
  18. Fryer, D.; Strümke, I.; Nguyen, H. Shapley Values for Feature Selection: The Good, the Bad, and the Axioms. IEEE Access 2021, 9, 144352–144360. [Google Scholar] [CrossRef]
  19. Rizzo, M.L.; Székely, G.J. Energy Distance. Wiley Interdiscip. Rev. Comput. Stat. 2016, 8, 27–38. [Google Scholar] [CrossRef]
  20. Ramos-Carreño, C.; Torrecilla, J.L. Dcor: Distance Correlation and Energy Statistics in Python. SoftwareX 2023, 22, 101326. [Google Scholar] [CrossRef]
  21. Justel, A.; Peña, D.; Zamar, R. A multivariate Kolmogorov-Smirnov Test of Goodness of Fit. Stat. Probab. Lett. 1997, 35, 251–259. [Google Scholar] [CrossRef]
  22. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  23. Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press (CUP): Cambridge, MA, USA, 2012. [Google Scholar]
  24. Murphy, K. A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  25. Menéndez, M.; Pardo, J.; Pardo, L.; Pardo, M. The Jensen-Shannon Divergence. J. Frankl. Inst. 1997, 334, 307–318. [Google Scholar] [CrossRef]
  26. Fuglede, B.; Topsoe, F. Jensen-Shannon Divergence and Hilbert Space Embedding. In Proceedings of the International Symposium on Information Theory, Chicago, IL, USA, 27 June–2 July 2004; p. 31. [Google Scholar]
  27. Belov, D.I.; Armstrong, R.D. Distributions of the Kullback–Leibler Divergence with Applications. Br. J. Math. Stat. Psychol. 2011, 64, 291–309. [Google Scholar] [CrossRef] [PubMed]
  28. Peker, N.; Kubat, C. Application of Chi-Square Discretization Algorithms to Ensemble Classification Methods. Expert Syst. Appl. 2021, 185, 115540. [Google Scholar] [CrossRef]
Figure 1. The process of the proposed sampling method on a numerical dataset.
Figure 1. The process of the proposed sampling method on a numerical dataset.
Electronics 14 02876 g001
Figure 2. The process of the proposed sampling method for a categorical dataset.
Figure 2. The process of the proposed sampling method for a categorical dataset.
Electronics 14 02876 g002
Table 1. Dataset description.
Table 1. Dataset description.
Dataset# of Instances# of Attributes# of Classes
breastcancer569N:302
breastTissue106N:96
ecoli336N:78
pima_diabetes768N:82
seed218N:83
balance-scale625C:43
congressional_voting_records435C:162
Qualitative_Bankruptcy250C:62
SPEC_Heart267C:222
Vector_Borne_Disease263C:6411
credit_approval690N:6, C:92
Differentiated_Thyroid_Cancer_Recurrence383N:1, C:152
Fertility100N:3, C:62
Heart_Disease303N:5, C:82
Wholesale_customers_data440N:6, C:13
Table 2. The parameters for each classifier.
Table 2. The parameters for each classifier.
ClassifierParameter
Decision Tree (DT)criterion = ‘entropy’, min_samples_leaf = 2
Random Forest (RF)n_estimators = 500, max_features = ‘sqrt’
K-Nearest Neighbor (KNN)n_neighbors = 5
Support Vector Machine (SVM)kernel = ‘rbf’
Table 3. Average accuracy (AEV) and standard deviation (SD) for each classifier on numerical datasets.
Table 3. Average accuracy (AEV) and standard deviation (SD) for each classifier on numerical datasets.
DatasetClassifierAEVSD
breastcanerDT0.9290.020
KNN0.9690.013
RF0.9590.016
SVM0.9760.012
breastTissueDT0.6550.086
KNN0.6390.085
RF0.6840.082
SVM0.5490.075
ecoliDT0.7970.043
KNN0.8550.034
RF0.8620.033
SVM0.8640.035
pima_diabetesDT0.7010.033
KNN0.7350.028
RF0.7620.026
SVM0.7670.027
seedDT0.9070.039
KNN0.9290.033
RF0.9240.036
SVM0.9310.031
Table 4. The accuracy of different feature-weighted sampling methods (Shapely value).
Table 4. The accuracy of different feature-weighted sampling methods (Shapely value).
DatasetClassifier(FWS)
EMD
Energy Distancedcor-Hypothesis TestingK-S Test
breastcancerDT0.9350.9450.9300.935
KNN0.9790.9690.9690.969
RF0.9720.9790.9810.974
SVM0.9790.9790.9790.979
breastTissueDT0.7390.7030.7030.714
KNN0.6790.6790.6430.714
RF0.7600.7250.6890.714
SVM0.6690.5610.6330.597
ecoliDT0.8120.7650.8000.812
KNN0.8240.8470.8350.824
RF0.8470.8820.8350.847
SVM0.8350.8820.8710.835
pima_DT0.7050.7310.6940.705
diabetesKNN0.7410.7050.6790.736
RF0.7620.7560.7980.777
SVM0.7720.7820.8080.803
seedDT0.9070.9260.9070.926
KNN0.9630.9070.9630.907
RF0.9260.9260.9070.926
SVM0.9630.9070.9260.944
Table 5. The accuracy of different feature-weighted sampling methods (ANOVA).
Table 5. The accuracy of different feature-weighted sampling methods (ANOVA).
DatasetClassifierEMDEnergy Distancedcor-Hypothesis TestingK-S Test
breastcancerDT0.9510.9020.9370.923
KNN0.9720.9650.9510.965
RF0.9510.9580.9650.958
SVM0.9720.9720.9720.972
breastTissueDT0.5000.7140.6430.750
KNN0.6070.6430.6070.714
RF0.6430.7500.6790.679
SVM0.5360.5360.5000.571
ecoliDT0.8120.8240.7650.835
KNN0.8230.8940.8820.847
RF0.8590.8940.8470.871
SVM0.8350.8940.8590.824
pima_DT0.7310.7100.7100.710
diabetesKNN0.7560.7820.7250.720
RF0.8080.7510.7980.756
SVM0.8030.7670.7820.762
seedDT0.8890.9260.9070.944
KNN0.9440.8890.9440.907
RF0.9630.9070.9630.907
SVM0.9260.9070.9260.907
Table 6. The MAI of different feature-weighted sampling methods (Shapely value).
Table 6. The MAI of different feature-weighted sampling methods (Shapely value).
DatasetClassifier(FWS)
EMD
Energy Distancedcor-Hypothesis TestingK-S Test
breastcancerDT0.2860.7480.0590.286
KNN0.9180.1390.1390.139
RF0.8181.2531.3560.921
SVM0.2890.2890.2820.282
breastTissueDT0.9770.5600.5600.691
KNN0.4620.4620.0440.880
RF0.9350.4990.0640.372
SVM1.5920.1711.1180.302
ecoliDT0.3470.7560.0710.347
KNN0.9460.2500.5980.946
RF0.4600.6060.5980.460
SVM0.8250.5190.1830.825
pima_DT0.7321.2050.1020.260
diabetesKNN0.2280.9780.7100.603
RF0.3690.3691.1701.033
SVM0.9620.0130.0130.182
seedDT0.0200.4900.0200.490
KNN1.0390.6551.0390.655
RF0.0470.0470.4630.047
SVM1.0380.7400.1470.445
Average MAI0.6650.5370.4370.508
Table 7. The MAI of different feature-weighted sampling methods (ANOVA).
Table 7. The MAI of different feature-weighted sampling methods (ANOVA).
DatasetClassifierEMDEnergy Distancedcor-Hypothesis TestingK-S Test
breastcancerDT0.2860.6310.4030.286
KNN0.6680.6681.1960.918
RF0.4870.0520.3830.487
SVM0.2890.2890.2890.282
breastTissueDT0.9771.5250.1430.977
KNN0.3740.0440.3740.462
RF0.9350.8080.3720.499
SVM0.1710.1710.6450.645
ecoliDT0.0710.3470.0710.899
KNN0.9461.1440.7960.250
RF0.4600.9610.4600.251
SVM0.8250.8550.1531.161
pima_DT1.4730.2600.0561.205
diabetesKNN0.1481.7280.6030.148
RF1.2330.0320.1690.833
SVM0.9890.0130.7670.989
seedDT0.4500.4900.0200.020
KNN0.4741.2200.4741.039
RF1.0670.4631.0670.557
SVM0.1470.7400.1471.038
AverageMAI0.6260.6220.4290.647
Table 8. The Execution time (s.) of the feature-weighted sampling methods (Shapely value).
Table 8. The Execution time (s.) of the feature-weighted sampling methods (Shapely value).
DatasetClassifierRandom Sampling(FWS)
EMD
Energy Distancedcor-Hypothesis TestingK-S Test
breastcancerDT1117.9458.3258.071060.6278.09
KNN1117.9358.3358.071060.6278.1
RF1117.9460.0752.921062.41278.94
SVM1117.9660.0759.931062.43279.85
breastTissueDT10.926.75360.117.4210.92
KNN11.026.52359.517.3911.02
RF10.856.54359.717.410.85
SVM10.846.56359.617.3910.84
ecoliDT9.528.02166.3917.529.52
KNN9.558.03166.2917.549.55
RF9.538.02166.2517.59.53
SVM9.588.04166.2617.59.58
pima_DT529.1121.2621.41147.87103.34
diabetesKNN529.1321.2221.39147.85103.3
RF529.1121.2521.4147.87103.31
SVM529.1221.2421.39147.86103.3
seedDT932.8310.678.52276.4433.33
KNN932.8510.578.51276.4233.31
RF932.8510.648.52276.4433.3
SVM932.810.658.55276.4533.33
Average Execution time751.6522.3120.71402.3790.10
Table 9. The Execution time (sec.) of the feature-weighted sampling methods (ANOVA).
Table 9. The Execution time (sec.) of the feature-weighted sampling methods (ANOVA).
DatasetClassifierRandom Sampling(FWS)
EMD
Energy Distancedcor-Hypothesis TestingK-S Test
breastcancerDT1117.9423.7223.461025.99243.46
KNN1117.9323.6923.451025.98243.48
RF1117.9424.4625.181027.73245.18
SVM1117.9623.6923.451025.96243.48
breastTissueDT641.986.081.93345.712.58
KNN641.976.11.95345.612.48
RF641.986.151.94345.7212.48
SVM641.966.131.95345.812.52
ecoliDT536.146.975.47163.8314.97
KNN536.126.945.46163.8514.94
RF536.146.955.44163.8814.92
SVM536.136.945.48163.8414.97
pima_DT529.1112.8412.99139.4594.92
diabetesKNN529.1312.8312.96139.494.95
RF529.1112.8412.97139.4294.93
SVM529.1212.8312.96139.4894.95
seedDT932.836.474.31272.229.35
KNN932.856.454.3272.329.31
RF932.856.444.32272.2329.33
SVM932.86.474.31272.2529.32
Average Execution time751.6511.259.71389.5379.13
Table 10. Average accuracy (AEV) and standard deviation (SD) of each classifier on categorical datasets.
Table 10. Average accuracy (AEV) and standard deviation (SD) of each classifier on categorical datasets.
DatasetClassifierAEVSD
balance-scaleDT0.7450.034
KNN0.7440.030
RF0.8450.026
SVM0.8620.025
congressional_votingDT0.9510.025
_recordsKNN0.9220.032
RF0.9630.021
SVM0.9630.022
Qualitative_DT0.9950.012
BankruptcyKNN0.9960.008
RF0.99950.005
SVM0.9960.009
SPEC_HeartDT0.7470.049
KNN0.7960.046
RF0.8240.040
SVM0.8280.041
Vector_BorneDT0.7090.064
_DiseaseKNN0.6730.057
RF0.9340.030
SVM0.9140.037
Table 11. The accuracy of different feature-weighted sampling methods on categorical datasets.
Table 11. The accuracy of different feature-weighted sampling methods on categorical datasets.
DatasetClassifierKLD
P Q
KLD
Q P
JSDFWS
(EMD
Shapely)
balance-scaleDT0.7410.7220.7340.747
KNN0.7470.7470.7470.747
RF0.8230.8480.8290.816
SVM0.8160.8160.8160.816
congressional_DT0.9490.9490.9490.898
voting _recordsKNN0.9150.9150.9150.949
RF0.9830.9830.9830.966
SVM0.9830.9830.9830.966
Qualitative_DT1.0001.0001.0001.000
_BankruptcyKNN1.0001.0001.0000.984
RF1.0001.0001.0001.000
SVM0.9840.9840.9841.000
SPEC_HeartDT0.7160.7160.7610.761
KNN0.7760.7760.7760.821
RF0.7760.7910.7910.791
SVM0.8210.8210.8210.791
Vector_BorneDT0.7610.7610.7760.761
_DiseaseKNN0.6870.6870.6870.687
RF0.9400.9550.9400.955
SVM0.9550.9550.9550.955
Table 12. The MAI of different feature-weighted sampling methods on categorical datasets.
Table 12. The MAI of different feature-weighted sampling methods on categorical datasets.
DatasetClassifierKLD
P Q
KLD
Q P
JSDFWS
(EMD
Shapely)
balance-scaleDT0.8810.8810.4780.064
KNN0.7950.7950.7950.098
RF0.2720.8180.2721.078
SVM0.6730.6730.6731.825
congressional_DT0.0880.0880.0882.141
voting _recordsKNN0.1930.1930.1930.853
RF0.9740.9740.9740.165
SVM0.9210.9210.9210.154
Qualitative_DT0.4740.4740.4740.474
_BankruptcyKNN0.5480.5480.5481.503
RF0.0870.0870.0870.087
SVM1.3351.3351.3350.467
SPEC_HeartDT0.6260.6260.2860.286
KNN0.4290.4290.4290.536
RF1.1930.8170.8170.817
SVM0.1740.1740.1740.902
Vector_BorneDT0.8060.8061.0390.806
_DiseaseKNN0.2340.2340.2340.234
RF0.2050.7040.2050.704
SVM1.1211.1211.1211.121
AverageMAI0.6010.6350.5570.716
Table 13. The execution time (s.) of the feature-weighted sampling methods on categorical datasets.
Table 13. The execution time (s.) of the feature-weighted sampling methods on categorical datasets.
DatasetClassifierRandom
Sampling
KLD
P Q
KLD
Q P
JSDFWS
(EMD Shapely)
balance-scaleDT640.3331.815.3832.4983.9
KNN640.3131.8515.3632.583.87
RF640.331.8715.3632.4883.8
SVM640.331.8215.3732.583.88
congressional_DT528.1420.814.7221.513.28
voting _recordsKNN528.1320.84.721.5313.26
RF52820.794.7221.5413.28
SVM528.220.794.7421.5113.25
Qualitative_DT799.219.173.3317.4811.28
_BankruptcyKNN799.2719.193.3517.4511.27
RF799.2519.153.3317.4611.28
SVM799.2419.183.3417.4811.25
SPEC_HeartDT649.338.036.0145.1287.5
KNN649.3538.056.0545.1587.47
RF649.3238.046.0445.1287.51
SVM649.3638.076.0545.1587.52
Vector_BorneDT1072.6106.59.4496.01208.7
_DiseaseKNN1072.5106.529.4396208.5
RF1072.63106.539.4596.01208.68
SVM1072.66106.519.4496.05208.5
Average Execution time737.9243.277.7842.5380.90
Table 14. Average accuracy (AEV) and standard deviation (SD) of each classifier on mix-type datasets.
Table 14. Average accuracy (AEV) and standard deviation (SD) of each classifier on mix-type datasets.
DatasetClassifierACCSD
credit_approvalDT0.8170.029
KNN0.8620.024
RF0.8730.023
SVM0.8620.023
Differentiated_DT0.9420.022
Thyroid_Cancer_KNN0.9220.027
RecurrenceRF0.9600.017
SVM0.9550.019
FertilityDT0.8460.064
KNN0.8530.055
RF0.8690.057
SVM0.8810.057
Heart_DiseaseDT0.7490.048
KNN0.8290.038
RF0.8220.038
SVM0.8390.037
Wholesale_customersDT0.5240.045
_dataKNN0.6350.037
RF0.7090.036
SVM0.7180.037
Table 15. MAI and test accuracy for mix-type datasets.
Table 15. MAI and test accuracy for mix-type datasets.
DatasetClassifierOur Hybrid
Method (Accuracy)
dcor-Hypothesis Testing ANOVA
(Accuracy)
Our Hybrid
Method (MAI)
dcor-Hypothesis Testing ANOVA
(MAI)
credit_approvalDT0.7930.7680.8281.669
KNN0.8960.8961.4141.414
RF0.8540.8660.8670.329
SVM0.8540.8540.3760.376
Differentiated_DT0.9380.9380.1890.189
Thyroid_Cancer_KNN0.9380.9380.5960.596
RecurrenceRF0.9480.9480.6990.699
SVM0.9380.9380.9100.910
FertilityDT0.8850.8850.6100.610
KNN0.8080.8080.8200.820
RF0.8460.8460.3980.398
SVM0.8850.8850.0590.059
Heart_DiseaseDT0.7920.8050.9031.172
KNN0.8700.8701.0851.085
RF0.8570.8570.9180.918
SVM0.8570.8570.4860.486
Wholesale_DT0.4950.5140.6330.228
customers _dataKNN0.6760.6761.1051.105
RF0.7120.7030.0960.152
SVM0.7390.7390.5580.558
AverageMAI 0.6780.689
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yen, S.-J.; Lee, Y.-S.; Tang, Y.-J. An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers. Electronics 2025, 14, 2876. https://doi.org/10.3390/electronics14142876

AMA Style

Yen S-J, Lee Y-S, Tang Y-J. An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers. Electronics. 2025; 14(14):2876. https://doi.org/10.3390/electronics14142876

Chicago/Turabian Style

Yen, Show-Jane, Yue-Shi Lee, and Yi-Jie Tang. 2025. "An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers" Electronics 14, no. 14: 2876. https://doi.org/10.3390/electronics14142876

APA Style

Yen, S.-J., Lee, Y.-S., & Tang, Y.-J. (2025). An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers. Electronics, 14(14), 2876. https://doi.org/10.3390/electronics14142876

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop