An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers

Yen, Show-Jane; Lee, Yue-Shi; Tang, Yi-Jie

doi:10.3390/electronics14142876

Open AccessArticle

An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers^†

by

Show-Jane Yen

,

Yue-Shi Lee

^*

and

Yi-Jie Tang

Department of Computer Science and Information Engineering, Ming Chuan University, Gwei Shan District, Taoyuan 333, Taiwan

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of our conference paper listed below. Lee, Y.S.; Yen, S.J.; Tang, Y.J. Improved Sampling Methods for Evaluation of Classification Performance. In Proceedings of the7th International Conference on Artificial Intelligence in Information and Communication, Fukuoka, Japan, 18–21 February 2025; pp. 378–382.

Electronics 2025, 14(14), 2876; https://doi.org/10.3390/electronics14142876

Submission received: 31 May 2025 / Revised: 8 July 2025 / Accepted: 15 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Data Retrieval and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Evaluating the classification accuracy of machine learning models typically involves multiple rounds of random training/test splits, model retraining, and performance averaging. However, this conventional approach is computationally expensive and time-consuming, especially for large datasets or complex models. To address this issue, we propose an effective sampling approach that selects a single training/test split that closely approximates the results obtained from repeated random sampling. Our approach ensures that the sampled data closely reflects the classification performance of the original dataset. Our methods integrate advanced distribution distance metrics and feature weighting techniques tailored for numerical, categorical, and mixed-type datasets. The experimental results demonstrate that our method achieves over 95% agreement with multi-run average accuracy while reducing the overhead of computations by more than 90%. This approach offers a scalable, resource-efficient alternative for reliable model evaluation, particularly valuable in time-critical or resource-constrained applications.

Keywords:

data sampling; classification; accuracy evaluation; data distribution

1. Introduction

In the process of constructing a classification model [1,2,3], the data sampling method plays a crucial role. The model’s learning capability is highly dependent on the quality of the sampled training set, while its accuracy is determined by the test set. If the sampling method results in a poor-quality training set for model construction and an overly simple test set, the model may achieve artificially high accuracy, leading to a misjudgment of its actual performance.

Therefore, during the data sampling phase, it is essential to carefully select samples to ensure their representativeness. This requires adopting appropriate sampling strategies and ensuring diversity within the sample to accurately reflect the overall data distribution. One of the most commonly used sampling methods is random sampling, which is both simple and efficient. In this approach, each data point has an equal probability of being selected, making the sampled dataset a reasonable representation of the entire population. Due to its ease of implementation, random sampling is frequently used in practical applications.

Random sampling has the drawback of generating different training and test sets in each iteration, leading to variations in accuracy results. A single instance of random sampling cannot fully reflect the model’s performance, making it unreliable for evaluating classification models when only one training/test split is used. To achieve a more accurate assessment, multiple random samplings are typically required, and the average accuracy across all experiments is used as a measure of overall performance. However, this approach necessitates training multiple models, which is computationally expensive and impractical for real-world applications. Cross-validation offers a solution by mitigating the limitations of single random sampling. By systematically rotating different subsets of data into the training and test sets and averaging the results, cross-validation provides a more comprehensive evaluation of model performance. Given these limitations of random sampling, improving sampling strategies and efficiently selecting appropriate training and test sets have become crucial research topics in the field [4,5,6,7].

The random sampling method presents several challenges in constructing classification models, particularly in terms of efficiency, time consumption, and computing resource demands for model evaluation. These factors significantly impact its practicality in real-world applications. To address these issues, Shin and Oh [8] proposed an improved sampling method, FWS, based on the approach introduced by Kang and Ohs [9]. Their method involves performing multiple random samplings to generate multiple training/test datasets as candidates. For each set, the feature-weighted distance between this set and the original dataset is calculated to assess their suitability. A training/test set is then selected from these multiple candidates to optimize model performance. However, this method still faces difficulties in selecting the most appropriate training and test sets, requiring further refinement.

To improve the accuracy and reliability of model assessment, we employed different techniques for measuring data distribution similarity and calculating feature importance [10], which can only handle datasets with numerical attributes. This article proposes a sampling approach that is capable of handling categorical attributes as well as mixed-type datasets. Additionally, we compare the training/test sets chosen by various sampling methods and evaluate their ability to approximate the average accuracy obtained from multiple random samplings.

The main contributions of this article are summarized as follows: 1. Efficient Single-Split Evaluation Framework: We propose a novel sampling method that eliminates the need for repeated training/test splits and model retraining by selecting a single representative data split that closely approximates the average classification accuracy from multiple evaluations, significantly reducing time and computational cost. 2. Distribution-Aware Sampling with Feature Weighting: Our method introduces advanced techniques for measuring data distribution similarity and incorporates feature importance into the similarity calculation, ensuring that selected subsets accurately reflect the statistical structure of the original dataset. 3. Support for Mixed-Type Data and Comparative Analysis: Unlike existing methods limited to numerical attributes, our approach effectively handles both categorical and mixed-type datasets. We also conduct a comparative evaluation of different sampling strategies, demonstrating the proposed method’s ability to approximate multi-run accuracy outcomes more reliably.

2. Related Work

D. Kang and S. Oh introduced the R-value-based sampling (RBS) method [11] to enhance the evaluation of classification models. This method calculates the R-value, which measures the classification difficulty of a data point by counting how many of its k nearest neighbors belong to a different class. For each data point p, the R-value can be calculated by identifying its k nearest neighbors and counting the number of neighbors that belong to a different class than p. As a result, the R-value of p ranges from 0 to k, indicating the degree of classification difficulty associated with that point. The RBS method groups data points based on their R-values and applies stratified sampling [12] within each group. The sampled training and test datasets from all groups are then combined to form the final training/test set. By utilizing the R-value as a quantitative indicator of classification difficulty, this method ensures that the training and test sets are constructed with comparable levels of classification complexity. Experimental results demonstrate that [11], compared to random sampling [13], D-optimal sampling [14], and MDC methods [15], the RBS method generates training/test sets that better align with the actual performance of classification models, leading to more reliable model evaluation.

Although the R-value-based sampling (RBS) method accounts for class overlap among data points, it does not consider the overall distribution distance or the importance of individual features. To address these limitations, Shin and Oh proposed the feature-weighted sampling (FWS) method [8], an improved version of RBS designed to generate multiple candidate training/test sets and select the most suitable one. One limitation of the original RBS method [11] is its reliance on stratified sampling, which often results in the repeated selection of the same training/test sets. To introduce more diversity, FWS replaces stratified sampling with random sampling, ensuring greater variability in the generated datasets. Additionally, FWS transforms the original dataset into histograms, applying the same transformation to each candidate training/test set. To evaluate which candidate set most accurately represents the overall population distribution, Earth Mover’s Distance (EMD) [16] is used as a similarity measure. This approach allows for a more precise selection of training/test sets, addressing the shortcomings of RBS and improving the robustness of model evaluation.

Earth Mover’s Distance (EMD) [16] is a metric used to measure the similarity between distributions by calculating the minimum amount of effort required to transform one distribution into another. This transformation is conceptualized as moving units of “soil” from one distribution to match the target distribution. Given two distributions—P (representing the training/test dataset) and Q (representing the original dataset)—the EMD between P and Q is determined by the optimal transport distance and the corresponding amount of data that needs to be moved. During model training, the contribution of different features significantly impacts overall performance.

To account for this, feature-weighted sampling (FWS) incorporates feature importance into the distance calculation. Specifically, it utilizes Shapley values [17] to quantify each feature’s contribution to the model. Shapley values, originating from cooperative game theory, are widely applied in machine learning to fairly assess the contribution of each feature to a model’s predictions [18]. In the context of game theory, the Shapley value provides a principled approach for distributing value among multiple players. When adapted to machine learning, it allocates the predictive influence among input features in an equitable manner. Despite its effectiveness, a notable limitation of the Shapley value is its reliance on numerical inputs, which restricts its direct applicability to categorical variables. This presents challenges when analyzing datasets that contain both numerical and categorical features.

After computing the Shapley value to determine the weight w(f_i) for each numerical feature f_i, and calculating the Earth Mover’s Distance (EMD) d(f_i) for each training/test set, the overall feature-weighted distance d is derived using Equation (1). In this Equation, d_train(f_i) represents the EMD of feature f_i in the training set, while d_test(f_i) represents the EMD of feature f_i in the test set. Once the feature-weighted distances for all candidate training/test sets are computed, the FWS method selects the training/test set with the smallest feature-weighted distance, as it best preserves the overall data distribution and feature importance, ensuring optimal model evaluation.

d = \sum_{i = 1}^{n} w (f_{i}) \times (d_{t r a i n} (f_{i}) + d_{t e s t} (f_{i}))

(1)

In summary, existing methods either ignore the overall distribution distance and feature importance, as in RBS [11], or are limited to numerical data, as in FWS [8]. To address these limitations, this article introduces a set of feature-weighted sampling methods that incorporate various distribution distance metrics and feature weighting techniques to overcome the shortcomings of previous approaches.

3. Our Approach

In this section, we introduce our proposed sampling framework designed to improve the accuracy and reliability of classification performance evaluation on mixed-type datasets that contain both numerical and categorical attributes. We begin by enhancing the FWS method [8], which was limited to datasets composed solely of numerical attributes. To address this limitation, we introduce a series of enhancements that enable the framework to effectively process categorical attributes, ensuring broader applicability in real-world scenarios. Based on experimental findings, we ultimately propose an integrated sampling strategy that accommodates both numerical and categorical features. The subsequent subsections detail the sampling procedures for different attribute types.

3.1. The Sampling Methods on the Dataset with Numerical Attributes

In this subsection, we describe how our method conducts sampling on the dataset with only numerical attributes [10], which generates the most suitable training and test sets for accurately evaluating classification performance. Our approach enhances the FWS method [8] by refining data distribution distance measurements and feature importance calculations.

Our method consists of two phases. In the first phase, we apply a modified RBS [8] to generate 1000 training/test candidate sets. The choice of 1000 candidate sets was made empirically as a trade-off between computational feasibility and performance stability. In the second phase, we evaluate these candidate sets and select the one with the smallest feature-weighted distribution distance in Equation (1). To ensure the most suitable training and test set selection, our approach improves the FWS method by refining feature-weighted calculations and data distribution distance assessment. The overall architecture of our approach is illustrated in Figure 1 [10].

In calculating distribution similarity, FWS employs the Earth Mover’s Distance (EMD) [16] to measure the distance between the population data and the training/test datasets. However, the computation of EMD requires data discretization, which can lead to information loss and potentially impact the accuracy of the results. To address this limitation, our method utilizes three alternative approaches for measuring distributional distance: Energy Distance [19], distance correlation hypothesis testing (dcor) [20], and the Kolmogorov–Smirnov (K-S) test [21].

Energy Distance, inspired by Newton’s concept of gravitational energy, serves as a versatile statistical tool. It can be applied in various contexts such as testing statistical independence through distance covariance, assessing goodness-of-fit, performing non-parametric tests for distribution equality, extending Analysis of Variance (ANOVA), identifying change points, conducting feature selection, and more [19]. This metric is particularly useful for detecting similarities between two probability distributions. The formal definition of Energy Distance is provided in Equation (2), where X and Y represent two distributions, and N_X and N_Y denote the number of samples from each, respectively. One of the key strengths of Energy Distance lies in its sensitivity to both the shape and location of distributions, making it especially effective for testing distributional homogeneity. However, a notable drawback is that it requires computing pairwise distances between all sample values, which can become computationally expensive for large datasets. Additionally, the Energy Distance does not have a standardized scale.

ϵ_{N_{X}, N_{Y}} (X, Y) = \frac{2}{N_{X} N_{Y}} \sum_{i = 1}^{N_{X}} \sum_{j = 1}^{N_{Y}} ‖x_{i} - y_{j}‖ - \frac{1}{{N_{X}}^{2}} \sum_{i = 1}^{N_{X}} \sum_{j = 1}^{N_{Y}} ‖x_{i} - x_{j}‖ - \frac{1}{{N_{Y}}^{2}} \sum_{i = 1}^{N_{X}} \sum_{j = 1}^{N_{Y}} ‖y_{i} - y_{j}‖

(2)

The value derived from the Energy Distance is not confined to a fixed range, which can result in it disproportionately influencing the overall weighted distance calculation during sample selection. To address this, Carreño and Torrecilla [20] introduced an implementation of energy-based metrics in the Python dcor package (version 0.6). This package supports computations such as distance covariance, distance correlation, partial distance correlation, and hypothesis testing. In our approach, we also utilize dcor to assess the similarity between different data distributions.

The dcor-Hypothesis test accounts for differences in sample sizes to maintain comparability across datasets. It estimates the p-value through permutation testing by repeatedly shuffling the samples and computing the energy statistic, which measures how often the permuted statistic exceeds the original, indicating the degree of similarity or dependence. This test is suitable for multi-dimensional data and is sensitive to variations in distribution shape, location, and scale. The dcor-Hypothesis test is defined in Equation (3), in which

N_{X}

and

N_{Y}

are sample sizes of datasets X and Y, and

ϵ_{N_{X}, N_{Y}} (X, Y)

is a measure of distance-based dependence between X and Y. In practice, this is often a form of energy statistic or distance correlation value. The scaling factor

\frac{N_{X} N_{Y}}{N_{X} + N_{Y}}

adjusts for unequal sample sizes, ensuring the test statistic T is comparable across datasets. Therefore, the statistic T captures how strongly the datasets X and Y are related based on their mutual spatial (distance) structure. Unlike Energy Distance, the dcor-Hypothesis test outputs values within a fixed range of 0 to 1, providing a normalized measure of similarity. However, this method is computationally intensive, especially for large datasets, due to the repeated permutations involved in the testing process.

T = \frac{N_{X} N_{Y}}{N_{X} + N_{Y}} ϵ_{N_{X}, N_{Y}} (X, Y)

(3)

The third distribution similarity calculation method employed to measure the distribution distance between the population dataset and the training/test sets is the Kolmogorov–Smirnov (K-S) test [21], which is a non-parametric statistical method used to compare two probability distributions (two-sample test), or a sample with a reference distribution (one-sample test). In the context of distribution similarity, it helps determine whether two samples, such as a population dataset and a training or test dataset, are drawn from the same distribution.

In multivariate statistics, the K-S test is a widely adopted method for assessing differences in data distributions [21]. It quantifies the similarity between two distributions by evaluating the maximum difference between their cumulative distribution functions, as illustrated in Equation (4). This maximum deviation is taken as the measure of distance between the distributions. Here, F(P) and F(Q) represent the cumulative distribution functions of distributions P and Q, respectively.

D = m a x |F (P) - F (Q)|

(4)

Compared to the Earth Mover’s Distance [16], the K-S test is more intuitive and easier to apply, especially for one-dimensional data, where it is both simple and computationally efficient. Its output ranges consistently between 0 and 1. However, when applied to multi-dimensional data, it requires careful consideration and validation to ensure accuracy.

For feature importance estimation, the Shapley value used in the FWS method [8] considers all possible feature ranking combinations, leading to excessive computation time, making it impractical for real-world applications. Additionally, Shapley values can sometimes be negative, which may reduce the calculated distance and lead to inaccurate results. To address these issues, our approach adopts Analysis of Variance (ANOVA) to assess the relationship between each feature and the target variable. The resulting p-value is used as a measure of importance. In statistical analysis, the p-value indicates whether the null hypothesis can be rejected. Typically, a p-value below a predefined significance level (commonly 0.05) suggests that the feature has a statistically significant effect on the target variable. Therefore, we treat the

1 -

p-value as an indicator of feature importance.

F = \frac{M S B}{M S W}

(5)

M S B = \frac{\sum_{i = 1}^{k} {n_{i} (\bar{X_{i}} - \bar{X})}^{2}}{k - 1}

(6)

M S W = \frac{\sum_{i = 1}^{k} \sum_{j = 1}^{n_{i}} {(X_{i j} - \bar{X_{i}})}^{2}}{N - k}

(7)

The F-statistic and p-value are commonly used as the basis for hypothesis testing in ANOVA, as shown in Equation (5). The F-statistic is calculated using the mean square between groups (MSB), which measures variation between groups (sample means), and the mean square within groups (MSW), which measures variation within each group. MSB and MSW are shown in Equation (6) and Equation (7), respectively, in which N and k denote the total number of observations and total number of groups, respectively,

n_{i}

denotes the number of observations in the i-th group,

\bar{X_{i}}

and

\bar{X}

represent the mean of the i-th group (the average of all values in group i) and the average of all observations across all groups, respectively, and

X_{i j}

is the j-th observation in the i-th group. Therefore,

(\bar{X_{i}} - \bar{X})

is the squared difference between group mean and grand mean, which represents how much group i deviates from the overall mean, and

(X_{i j} - \bar{X_{i}})

is the squared difference between an individual observation and its group mean, which presents how much individual values deviate from their own group mean.

3.2. The Sampling Methods on the Dataset with Categorical Attributes

In this subsection, we describe our proposed feature-weighted sampling methods on a categorical dataset. The framework of our sampling method for a categorical dataset is illustrated in Figure 2. The process is also divided into two phases: In the first phase, the modified RBS method [8] is used to generate 1000 training/test sets, which serve as candidate sets. In the second phase, each candidate set is evaluated by calculating the feature-weighted distribution distance between the population dataset and the training/test sets (candidate sets) in Equation (1) and selecting the one with the smallest feature-weighted distribution distance. The distribution distance between the population dataset and the training/test dataset is measured using Kullback–Leibler divergence (KLD) [22,23,24] and Jensen–Shannon divergence (JSD) [24,25,26]. Feature importance is assessed using the p-value from the Chi-square test (

χ^{2}

Test).

We adopt Kullback–Leibler divergence (KLD) and Jensen–Shannon divergence (JSD) as methods for calculating distribution distance. The KL divergence is a non-symmetric measure of how one probability distribution Q diverges from a second, reference probability distribution P. It is often used to measure the difference between two probability distributions, especially in the context of machine learning and statistical modeling. The KL divergence is defined in Equation (8), where P and Q represent two different distributions. Specifically, p(x) is the probability of event x under distribution P, and q(x) is the probability of the same event x under distribution Q. The term p(x) serves as a weighting factor, emphasizing events with higher occurrence probability and greater influence on the overall distribution. KL divergence quantifies how much “extra information” is needed to represent events from P using the distribution Q. A higher value indicates greater divergence. The Kullback–Leibler divergence (KLD) can be used to measure the degree of dissimilarity between two probability distributions. When the two distributions are identical, the KLD is zero. It is also asymmetric, as indicated in Equation (9), and can be regarded as a form of information measure [27].

D_{K L} (P ∥ Q) = \sum_{x} p (x) \ln (\frac{p (x)}{q (x)})

(8)

D_{K L} (P ∥ Q) \neq D_{K L} (Q ∥ P)

(9)

Due to the asymmetry of KL divergence, it is necessary to compute the divergence in both directions: one using the population data (P) as the reference distribution, and the other using the training data (Q) as the reference distribution. This double computation gives a more complete understanding of the dissimilarity between two distributions. If the difference between the two distributions is substantial, the KL divergence can grow without bound and approach infinity.

To address KL divergence’s asymmetry and unboundedness, our method also uses Jensen–Shannon divergence, a symmetrized and smoothed version of KL divergence. JS divergence [25] can be used for fitness testing and to solve the asymmetry problem of KL divergence. Equation (10) is the definition of JS divergence, where P and Q represent two different distributions, and their values range from 0 to 1. JS divergence offers a more robust and interpretable measure of similarity between distributions. Both JS divergence and KL divergence require estimating the probability distributions associated with the data.

J S D (P ∥ Q) = \frac{1}{2} D_{K L} (P ∥ \frac{P + Q}{2}) + \frac{1}{2} D_{K L} (Q ∥ \frac{P + Q}{2})

(10)

For the feature importance, we adopt the Chi-square test [28] as the basis for calculating the feature importance of a categorical dataset. The Chi-square test is a commonly used statistical method for significance testing, which assesses the relationship between the feature variable and the target variable. The Chi-square statistic value for each feature reflects its relevance or importance in predicting the target. Higher Chi-square scores indicate a stronger association and hence greater feature importance. These importance scores are then used as feature weights in the divergence-based sampling method to ensure that important features have more influence on divergence calculations, improving the quality of training/test data selection.

4. Experimental Results

In this section, we first describe the datasets we used to perform the experiments and introduce the evaluation metric, and then we apply the four classification algorithms, Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), and Support Vector Machine (SVC), to build the classifiers. Finally, we assess the reliability of various sampling methods by analyzing the classification accuracy on test datasets, using the classifiers trained on the training sets derived from each sampling strategy.

4.1. Datasets and Classifiers

The datasets used in our experiments can be obtained from the UCI repository (https://archive.ics.uci.edu/; accessed on 1 July 2025) and Kaggle (https://www.kaggle.com/; accessed on 1 July 2025), which are all used in classification tasks. We use five datasets with only numerical attributes, five datasets with only categorical attributes, and five datasets with hybrid attribute types. Table 1 is the description of the datasets used in the experiments, in which there are three fields for each dataset: number (#) of instances, number (#) of attributes, and number (#) of classes. In the field # of attributes, N denotes the numerical attributes, and C denotes the categorical attributes. For example, N:6, C:9 means there are six numerical attributes and nine categorical attributes in the dataset.

Table 2 shows the parameters of the four algorithms used to build the classifiers. For the Decision Tree (DT) classifier, we set the splitting criterion to ‘entropy’, enabling the use of information gain for more informative splits, and specify min_samples_leaf = 2 to prevent overfitting by ensuring each leaf node contains at least two samples. The Random Forest (RF) classifier is configured with n_estimators = 500, providing robustness through ensemble learning with a large number of trees, while max_features = ‘sqrt’ limits the number of features considered at each split to the square root of the total, enhancing diversity and reducing variance. The K-Nearest Neighbor (KNN) classifier uses n_neighbors = 5, meaning classification is based on the majority vote among the five closest training samples, balancing bias and variance. Lastly, the Support Vector Machine (SVM) utilizes the ‘rbf’ kernel, allowing it to model nonlinear decision boundaries by projecting data into a higher-dimensional space. These parameter choices are selected based on common best practices and prior studies, aiming to ensure fair and effective comparisons.

4.2. Evaluation Metric MAI

To assess whether the training/test set generated by a sampling method meets the expected outcomes, a reliable evaluation metric is required. During the sampling process, it is challenging to determine in advance whether the selected training/test sets will produce the desired results. To address this, one common strategy is to use the accuracy obtained from random sampling as a baseline for comparison. However, relying on the accuracy of a single random sample can be misleading due to high variability across different samples. A more robust approach involves computing the average accuracy over multiple random samplings, providing a more stable and reliable reference metric. This average is referred to as the Absolute Evaluation Value (AEV) [9], as defined in Equation (11). In this context, test_acc_i denotes the accuracy of the i-th random sampling, m represents the total number of random samplings, and AEV is the mean accuracy computed across all m samplings.

A E V = \frac{\sum_{i = 1}^{m} {t e s t_a c c}_{i}}{m}

(11)

Once the Absolute Evaluation Value (AEV) is determined, the Mean Accuracy Indicator (MAI) is used to evaluate whether a training/test set is appropriate [9]. As shown in Equation (12), MAI is calculated by subtracting the average accuracy of multiple random samplings (AEV) from the classification accuracy (ACC) of a given sampling method and then dividing the result by the standard deviation (SD) of the AEV. Intuitively, a lower MAI value indicates a smaller discrepancy between ACC and AEV, suggesting that the sampling method produces results closely aligned with the average performance of multiple random samplings, thereby demonstrating its effectiveness.

M A I = \frac{⌈A C C - A E V⌉}{S D}

(12)

4.3. Experimental Results on Numerical Datasets

We first conducted 1000 random samples on five datasets with only numerical attributes in Table 1 to generate 1000 training/test sets. Using four different classification algorithms, we built classification models and obtained accuracy on the test set for each sampling. We then calculated the average accuracy and standard deviation of the test dataset across the 1000 classifiers built from the random samplings, as shown in Table 3.

We employ the modified RBS sampling method [8] to generate 1000 candidate training/test sets for each of the five numerical datasets. For each candidate set, we use the four distribution distance calculation methods: EMD, Energy Distance, dcor-Hypothesis testing, and K-S test, and apply Shapely value and ANOVA to calculate feature weights, respectively, in which the combination of EMD and Shapely value is the method FWS [8]. Subsequently, we calculate the feature-weighted distance between each candidate training/test set and the original dataset using Equation (1). The training/test set exhibiting the smallest feature-weighted distance is selected. The classification accuracy (ACC) of the test set for the four classifiers built by using the training set on the selected sets is presented in Table 4 and Table 5, in which Shapely value and ANOVA are used to calculate feature weight, respectively. According to Table 3 and from Table 4 and Table 5, the evaluation metric MAIs for the sampling methods based on the four distribution distance calculation methods with Shapely value and ANOVA as feature-weighted computation are shown in Table 6 and Table 7, respectively.

According to the results in Table 6 and Table 7, the dcor-Hypothesis testing method achieves the lowest average MAI across all the sampling methods. This indicates that it is the most effective approach for selecting training and test sets that closely match the original data distribution. The reason for its superior performance lies in its two key components: (1) distance correlation, which captures both linear and nonlinear dependencies between variables, and (2) hypothesis testing, which filters out statistically insignificant relationships, reducing the influence of noise or irrelevant features. When combined with ANOVA-based feature weighting, which assigns higher importance to features with greater between-class variance, the method ensures that the feature-weighted distance reflects the most discriminative aspects of the data. As a result, the training and test sets selected using this method are more representative of the original dataset in terms of both overall structure and feature relevance. Therefore, dcor-Hypothesis testing alongside ANOVA for feature-weighted calculation yields the smallest average MAI values.

Table 8 and Table 9 are the execution times of each sampling method for different classifiers on different numerical datasets with feature weighting calculated by Sharply value and ANOVA, respectively. From these experiments, we can see that the execution time required for randomly sampling 1000 training/test sets and computing test accuracy is significantly longer than that for generating 1000 candidate sets and selecting the one with the lowest MAI value. In contrast, the execution times of different feature-weighted sampling methods are relatively similar except for dcor-Hypothesis testing.

4.4. Experimental Results on Categorical Datasets

We also performed 1000 random samplings on five datasets with only categorical attributes in Table 1, resulting in 1000 distinct training/test set pairs. For each of these pairs, we applied four different classification algorithms to construct corresponding classification models and recorded the accuracy on the test dataset. Subsequently, we calculated the average test accuracy and its standard deviation across the 1000 classification models generated from the random samplings. The results are presented in Table 10.

Table 11 shows the classification accuracy of the test dataset obtained from four classifiers using five categorical datasets. For each dataset, candidate training/test sets were evaluated by computing their distribution distances from the original dataset using Kullback–Leibler divergence (KLD), Jensen–Shannon divergence (JSD), and Earth Mover’s Distance, incorporating feature importance calculated by the Chi-square test [28], computing feature-weighted distances. The training/test set with the smallest distribution distance to the original set was selected. For KLD, two variants were considered: one treating the original dataset as the reference distribution (P∥Q) and the other treating the candidate training/test set as the reference distribution (Q∥P). In addition, in order to enable the FWS method to be executed on these datasets, we flattened the categorical attribute values and converted them into numerical attributes by one-hot encoding, and used Earth Mover’s Distance (EMD) and Shapley values to calculate feature-weighted distance.

According to Table 10 and Table 11, Table 12 shows the evaluation results in terms of MAI for various sampling methods that utilize different feature-weighted distribution distance measures. In these methods, feature importance is determined using the Chi-square test [28], and comparisons are made against the feature-weighted sampling (FWS) method [8], which applies one-hot encoding to transform categorical attributes into numerical values. From the results in Table 10, it is evident that both KL divergence and JS divergence, when combined with Chi-square-based feature weighting, outperform the FWS method. The FWS method’s reliance on one-hot encoding may introduce high-dimensional sparsity and fail to capture statistical dependencies effectively, especially when categorical attributes have a large number of levels. In contrast, KL divergence and JS divergence explicitly model the divergence between probability distributions of features, incorporating feature importance to better reflect the relative significance of each attribute.

Among all evaluated methods, JS divergence with feature importance achieves the lowest average MAI, demonstrating the most stable and reliable performance. This is because JS divergence is a symmetric and smoothed version of KL divergence and more robustly handles discrepancies between distributions. By integrating JS divergence with Chi-square feature importance, the method effectively prioritizes informative attributes while ensuring the sampled training and test sets maintain high distributional fidelity to the original dataset. This alignment leads to more accurate model evaluation, ultimately resulting in the smallest MAI values observed across all scenarios.

The execution times of random sampling and each feature-weighted sampling method for different classifiers on different categorical datasets are shown in Table 13. The experimental results are similar to the above experiments on numerical datasets; that is, randomly sampling 1000 training/test sets and calculating their test accuracies takes significantly more time than generating 1000 candidate sets and selecting the one with the lowest MAI value. Meanwhile, the execution times of the various feature-weighted sampling methods are generally comparable.

4.5. Experimental Results on Mix-Type Datasets

We also performed 1000 random samplings on five mix-type datasets containing numerical attributes and categorical attributes in Table 1, resulting in 1000 distinct training/test set pairs. For each sampling, classification models were constructed using four different algorithms, and the corresponding test accuracies were recorded. Subsequently, we computed the mean accuracy and standard deviation of the test results across the 1000 classifiers derived from these random samples, which are shown in Table 14.

Since the combination of the distribution distance method, dcor-Hypothesis testing, with ANOVA for calculating feature-weighted distribution distances yielded the best performance on numerical datasets, and the combination of JS divergence with the Chi-square test performed best on categorical datasets, we propose a hybrid sampling approach tailored for mixed-type datasets. Specifically, we compute feature-weighted distances separately for numerical and categorical attributes: for numerical attributes, we apply dcor-Hypothesis testing, which captures both linear and nonlinear dependencies together with ANOVA, which emphasizes features with high between-class variance. This ensures that the most discriminative numerical features are appropriately weighted during distance computation. For categorical attributes, we use JS divergence, a symmetric and smoothed divergence measure, along with the Chi-square test for feature weighting. The Chi-square test highlights attributes with strong class association, and JS divergence more robustly compares feature distributions in the presence of sparsity or imbalance, which are common in categorical data.

To further validate the effectiveness of the hybrid approach, we conducted an additional experiment where all categorical features were flattened using one-hot encoding and treated as numerical data. After transforming all categorical features in the mixed-type dataset into numerical form, we then applied the best-performing numerical method, dcor-Hypothesis testing with ANOVA, to calculate the feature-weighted distance between the original dataset and candidate sets, and the training/test set with the smallest feature-weighted distance was selected. The corresponding test accuracy and MAI values are presented in Table 15.

As shown in Table 15, our hybrid approach outperforms this alternative method (dcor-Hypothesis testing with ANOVA). This improvement occurs because converting categorical attributes into sparse binary vectors via one-hot encoding can dilute feature relevance and introduce high dimensionality, potentially distorting the true distributional structure. In contrast, our hybrid approach preserves the intrinsic characteristics of each attribute type by applying the most appropriate distance measure and feature importance method to each. As a result, it more accurately selects training/test sets that maintain the original dataset’s structure and feature relationships, thereby minimizing distributional deviation and achieving the smallest MAI across mixed-type datasets.

5. Conclusions

In the past, it was common to repeatedly split the dataset into training and test sets and build multiple classification models to evaluate classification accuracy, which requires substantial computational time for repeated modeling and evaluation, and becomes impractical when dealing with large-scale datasets. To address this, some researchers have proposed sampling methods to obtain a single training/test dataset that approximates the results of repeated modeling. Among the existing methods, the feature-weighted sampling (FWS) method is currently regarded as the most effective. It calculates the distribution distance between the original dataset and the training/test dataset using Earth Mover’s Distance (EMD) and employs Shapley values to compute feature importance. However, EMD requires data discretization prior to distribution similarity calculation, which may compromise data fidelity. Moreover, Shapley values are applicable only to numerical attributes.

To overcome these limitations, this study proposes improvements to the FWS method. Specifically, we introduce a sampling approach that does not require discretization when computing distribution similarity and also propose a sampling strategy capable of handling categorical datasets. Experimental results show that our proposed method achieves lower MAI (Mean Accuracy Indicator) and performs better than the original FWS method. Among all the methods, dcor-Hypothesis testing and JS divergence combined with statistical feature weighting ANOVA and Chi-square, respectively, obtain the lowest average MAI. The execution time of all the feature-weighted sampling methods is reduced by over 90% compared to that of random sampling. Furthermore, in our sampling framework, the best-performing sampling strategies for numerical and categorical datasets are, respectively, applied to the corresponding attributes in mixed-type datasets. The results also demonstrate that our hybrid method achieves strong performance in sampling the datasets with hybrid attribute types. As future work, we plan to develop adaptive strategies to dynamically adjust the number of candidate sets based on computational resources and explore methods for applying our framework to unstructured and high-dimensional datasets.

Author Contributions

Conceptualization, Y.-S.L. and S.-J.Y.; methodology, Y.-J.T.; software, Y.-J.T.; validation, Y.-S.L. and S.-J.Y.; formal analysis, S.-J.Y.; resources, Y.-S.L.; writing—original draft preparation, Y.-J.T.; writing—review and editing, Y.-S.L. and S.-J.Y.; supervision, Y.-S.L. and S.-J.Y.; project administration, Y.-S.L. and S.-J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets analyzed in this study are publicly available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/; accessed on 1 July 2025) and Kaggle (https://www.kaggle.com/; accessed on 1 July 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Sen, P.C.; Hajra, M.; Ghosh, M. Supervised Classification Algorithms in Machine Learning: A Survey and Review. In Proceedings of the IEM Graph, Kolkata, India, 6–8 September 2018; pp. 99–111. [Google Scholar]
Rauschert, S.; Raubenheimer, K.; Melton, P.; Huang, R. Machine Learning and Clinical Epigenetics: A Review of Challenges for Diagnosis and Classification. Clin. Epigenetics 2020, 12, 51. [Google Scholar] [CrossRef] [PubMed]
Henrique, B.M.; Sobreiro, V.A.; Kimura, H. Literature review: Machine Learning Techniques Applied to Financial Market Prediction. Expert Syst. Appl. 2019, 124, 226–251. [Google Scholar] [CrossRef]
Sharma, G. Pros and Cons of Different Sampling Techniques. Int. J. Appl. Res. 2017, 3, 749–752. [Google Scholar]
Stratton, S.J. Population Research: Convenience Sampling Strategies. Prehospital Disaster Med. 2021, 36, 373–374. [Google Scholar] [CrossRef] [PubMed]
Taherdoost, H. Sampling Methods in Research Methodology; How to Choose a Sampling Technique for Research. Int. J. Acad. Res. Manag. 2016, 5, 18–27. [Google Scholar] [CrossRef]
Bellhouse, D. Systematic Sampling Methods. In Encyclopedia of Biostatistics; Chichester: New York, NY, USA, 2005; pp. 4478–4482. [Google Scholar]
Shin, H.; Oh, S. Feature-Weighted Sampling for Proper Evaluation of Classification Models. Appl. Sci. 2021, 11, 2039. [Google Scholar] [CrossRef]
Kang, D.; Oh, S. Balanced training/test set Sampling for Proper Evaluation of Classification Models. Intell. Data Anal. 2020, 24, 5–18. [Google Scholar] [CrossRef]
Lee, Y.S.; Yen, S.J.; Tang, Y.J. Improved Sampling Methods for Evaluation of Classification Performance. In Proceedings of the 7th International Conference on Artificial Intelligence in Information and Communication, Fukuoka, Japan, 18–21 February 2025; pp. 378–382. [Google Scholar]
Oh, S. A New Dataset Evaluation Method Based on Category Overlap. Comput. Biol. Med. 2011, 41, 115–122. [Google Scholar] [CrossRef] [PubMed]
Parsons, V.L. Stratified Sampling. In Wiley StatsRef: Statistics Reference Online; John Wiley and Sons: Hoboken, NJ, USA, 2014; pp. 102–144. [Google Scholar]
Berndt, A.E. Sampling methods. J. Hum. Lact. 2020, 36, 224–226. [Google Scholar] [CrossRef] [PubMed]
Martin, E.J.; Critchlow, R.E. Beyond Mere Diversity: Tailoring Combinatorial Libraries for Drug Discovery. J. Comb. Chem. 1999, 1, 32–45. [Google Scholar] [CrossRef] [PubMed]
Hudson, B.D.; Hyde, R.M.; Rahr, E.; Wood, J.; Osman, J. Parameter Based Methods for Compound Selection from Chemical Databases. In Quantitative Structure-Activity Relationships; CRC Press: Boca Raton, FL, USA, 1996; pp. 285–289. [Google Scholar]
Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Covert, I.; Lundberg, S.M.; Lee, S.I. Understanding Global Feature Contributions with Additive Importance Measures. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA; pp. 17212–17223. [Google Scholar]
Fryer, D.; Strümke, I.; Nguyen, H. Shapley Values for Feature Selection: The Good, the Bad, and the Axioms. IEEE Access 2021, 9, 144352–144360. [Google Scholar] [CrossRef]
Rizzo, M.L.; Székely, G.J. Energy Distance. Wiley Interdiscip. Rev. Comput. Stat. 2016, 8, 27–38. [Google Scholar] [CrossRef]
Ramos-Carreño, C.; Torrecilla, J.L. Dcor: Distance Correlation and Energy Statistics in Python. SoftwareX 2023, 22, 101326. [Google Scholar] [CrossRef]
Justel, A.; Peña, D.; Zamar, R. A multivariate Kolmogorov-Smirnov Test of Goodness of Fit. Stat. Probab. Lett. 1997, 35, 251–259. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press (CUP): Cambridge, MA, USA, 2012. [Google Scholar]
Murphy, K. A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Menéndez, M.; Pardo, J.; Pardo, L.; Pardo, M. The Jensen-Shannon Divergence. J. Frankl. Inst. 1997, 334, 307–318. [Google Scholar] [CrossRef]
Fuglede, B.; Topsoe, F. Jensen-Shannon Divergence and Hilbert Space Embedding. In Proceedings of the International Symposium on Information Theory, Chicago, IL, USA, 27 June–2 July 2004; p. 31. [Google Scholar]
Belov, D.I.; Armstrong, R.D. Distributions of the Kullback–Leibler Divergence with Applications. Br. J. Math. Stat. Psychol. 2011, 64, 291–309. [Google Scholar] [CrossRef] [PubMed]
Peker, N.; Kubat, C. Application of Chi-Square Discretization Algorithms to Ensemble Classification Methods. Expert Syst. Appl. 2021, 185, 115540. [Google Scholar] [CrossRef]

Figure 1. The process of the proposed sampling method on a numerical dataset.

Figure 2. The process of the proposed sampling method for a categorical dataset.

Table 1. Dataset description.

Dataset	# of Instances	# of Attributes	# of Classes
breastcancer	569	N:30	2
breastTissue	106	N:9	6
ecoli	336	N:7	8
pima_diabetes	768	N:8	2
seed	218	N:8	3
balance-scale	625	C:4	3
congressional_voting_records	435	C:16	2
Qualitative_Bankruptcy	250	C:6	2
SPEC_Heart	267	C:22	2
Vector_Borne_Disease	263	C:64	11
credit_approval	690	N:6, C:9	2
Differentiated_Thyroid_Cancer_Recurrence	383	N:1, C:15	2
Fertility	100	N:3, C:6	2
Heart_Disease	303	N:5, C:8	2
Wholesale_customers_data	440	N:6, C:1	3

Table 2. The parameters for each classifier.

Classifier	Parameter
Decision Tree (DT)	criterion = ‘entropy’, min_samples_leaf = 2
Random Forest (RF)	n_estimators = 500, max_features = ‘sqrt’
K-Nearest Neighbor (KNN)	n_neighbors = 5
Support Vector Machine (SVM)	kernel = ‘rbf’

Table 3. Average accuracy (AEV) and standard deviation (SD) for each classifier on numerical datasets.

Dataset	Classifier	AEV	SD
breastcaner	DT	0.929	0.020
	KNN	0.969	0.013
	RF	0.959	0.016
	SVM	0.976	0.012
breastTissue	DT	0.655	0.086
	KNN	0.639	0.085
	RF	0.684	0.082
	SVM	0.549	0.075
ecoli	DT	0.797	0.043
	KNN	0.855	0.034
	RF	0.862	0.033
	SVM	0.864	0.035
pima_diabetes	DT	0.701	0.033
	KNN	0.735	0.028
	RF	0.762	0.026
	SVM	0.767	0.027
seed	DT	0.907	0.039
	KNN	0.929	0.033
	RF	0.924	0.036
	SVM	0.931	0.031

Table 4. The accuracy of different feature-weighted sampling methods (Shapely value).

Dataset	Classifier	(FWS) EMD	Energy Distance	dcor-Hypothesis Testing	K-S Test
breastcancer	DT	0.935	0.945	0.930	0.935
	KNN	0.979	0.969	0.969	0.969
	RF	0.972	0.979	0.981	0.974
	SVM	0.979	0.979	0.979	0.979
breastTissue	DT	0.739	0.703	0.703	0.714
	KNN	0.679	0.679	0.643	0.714
	RF	0.760	0.725	0.689	0.714
	SVM	0.669	0.561	0.633	0.597
ecoli	DT	0.812	0.765	0.800	0.812
	KNN	0.824	0.847	0.835	0.824
	RF	0.847	0.882	0.835	0.847
	SVM	0.835	0.882	0.871	0.835
pima_	DT	0.705	0.731	0.694	0.705
diabetes	KNN	0.741	0.705	0.679	0.736
	RF	0.762	0.756	0.798	0.777
	SVM	0.772	0.782	0.808	0.803
seed	DT	0.907	0.926	0.907	0.926
	KNN	0.963	0.907	0.963	0.907
	RF	0.926	0.926	0.907	0.926
	SVM	0.963	0.907	0.926	0.944

Table 5. The accuracy of different feature-weighted sampling methods (ANOVA).

Dataset	Classifier	EMD	Energy Distance	dcor-Hypothesis Testing	K-S Test
breastcancer	DT	0.951	0.902	0.937	0.923
	KNN	0.972	0.965	0.951	0.965
	RF	0.951	0.958	0.965	0.958
	SVM	0.972	0.972	0.972	0.972
breastTissue	DT	0.500	0.714	0.643	0.750
	KNN	0.607	0.643	0.607	0.714
	RF	0.643	0.750	0.679	0.679
	SVM	0.536	0.536	0.500	0.571
ecoli	DT	0.812	0.824	0.765	0.835
	KNN	0.823	0.894	0.882	0.847
	RF	0.859	0.894	0.847	0.871
	SVM	0.835	0.894	0.859	0.824
pima_	DT	0.731	0.710	0.710	0.710
diabetes	KNN	0.756	0.782	0.725	0.720
	RF	0.808	0.751	0.798	0.756
	SVM	0.803	0.767	0.782	0.762
seed	DT	0.889	0.926	0.907	0.944
	KNN	0.944	0.889	0.944	0.907
	RF	0.963	0.907	0.963	0.907
	SVM	0.926	0.907	0.926	0.907

Table 6. The MAI of different feature-weighted sampling methods (Shapely value).

Dataset	Classifier	(FWS) EMD	Energy Distance	dcor-Hypothesis Testing	K-S Test
breastcancer	DT	0.286	0.748	0.059	0.286
	KNN	0.918	0.139	0.139	0.139
	RF	0.818	1.253	1.356	0.921
	SVM	0.289	0.289	0.282	0.282
breastTissue	DT	0.977	0.560	0.560	0.691
	KNN	0.462	0.462	0.044	0.880
	RF	0.935	0.499	0.064	0.372
	SVM	1.592	0.171	1.118	0.302
ecoli	DT	0.347	0.756	0.071	0.347
	KNN	0.946	0.250	0.598	0.946
	RF	0.460	0.606	0.598	0.460
	SVM	0.825	0.519	0.183	0.825
pima_	DT	0.732	1.205	0.102	0.260
diabetes	KNN	0.228	0.978	0.710	0.603
	RF	0.369	0.369	1.170	1.033
	SVM	0.962	0.013	0.013	0.182
seed	DT	0.020	0.490	0.020	0.490
	KNN	1.039	0.655	1.039	0.655
	RF	0.047	0.047	0.463	0.047
	SVM	1.038	0.740	0.147	0.445
Average	MAI	0.665	0.537	0.437	0.508

Table 7. The MAI of different feature-weighted sampling methods (ANOVA).

Dataset	Classifier	EMD	Energy Distance	dcor-Hypothesis Testing	K-S Test
breastcancer	DT	0.286	0.631	0.403	0.286
	KNN	0.668	0.668	1.196	0.918
	RF	0.487	0.052	0.383	0.487
	SVM	0.289	0.289	0.289	0.282
breastTissue	DT	0.977	1.525	0.143	0.977
	KNN	0.374	0.044	0.374	0.462
	RF	0.935	0.808	0.372	0.499
	SVM	0.171	0.171	0.645	0.645
ecoli	DT	0.071	0.347	0.071	0.899
	KNN	0.946	1.144	0.796	0.250
	RF	0.460	0.961	0.460	0.251
	SVM	0.825	0.855	0.153	1.161
pima_	DT	1.473	0.260	0.056	1.205
diabetes	KNN	0.148	1.728	0.603	0.148
	RF	1.233	0.032	0.169	0.833
	SVM	0.989	0.013	0.767	0.989
seed	DT	0.450	0.490	0.020	0.020
	KNN	0.474	1.220	0.474	1.039
	RF	1.067	0.463	1.067	0.557
	SVM	0.147	0.740	0.147	1.038
Average	MAI	0.626	0.622	0.429	0.647

Table 8. The Execution time (s.) of the feature-weighted sampling methods (Shapely value).

Dataset	Classifier	Random Sampling	(FWS) EMD	Energy Distance	dcor-Hypothesis Testing	K-S Test
breastcancer	DT	1117.94	58.32	58.07	1060.6	278.09
	KNN	1117.93	58.33	58.07	1060.6	278.1
	RF	1117.94	60.07	52.92	1062.41	278.94
	SVM	1117.96	60.07	59.93	1062.43	279.85
breastTissue	DT	10.92	6.75	360.1	17.42	10.92
	KNN	11.02	6.52	359.5	17.39	11.02
	RF	10.85	6.54	359.7	17.4	10.85
	SVM	10.84	6.56	359.6	17.39	10.84
ecoli	DT	9.52	8.02	166.39	17.52	9.52
	KNN	9.55	8.03	166.29	17.54	9.55
	RF	9.53	8.02	166.25	17.5	9.53
	SVM	9.58	8.04	166.26	17.5	9.58
pima_	DT	529.11	21.26	21.41	147.87	103.34
diabetes	KNN	529.13	21.22	21.39	147.85	103.3
	RF	529.11	21.25	21.4	147.87	103.31
	SVM	529.12	21.24	21.39	147.86	103.3
seed	DT	932.83	10.67	8.52	276.44	33.33
	KNN	932.85	10.57	8.51	276.42	33.31
	RF	932.85	10.64	8.52	276.44	33.3
	SVM	932.8	10.65	8.55	276.45	33.33
Average	Execution time	751.65	22.31	20.71	402.37	90.10

Table 9. The Execution time (sec.) of the feature-weighted sampling methods (ANOVA).

Dataset	Classifier	Random Sampling	(FWS) EMD	Energy Distance	dcor-Hypothesis Testing	K-S Test
breastcancer	DT	1117.94	23.72	23.46	1025.99	243.46
	KNN	1117.93	23.69	23.45	1025.98	243.48
	RF	1117.94	24.46	25.18	1027.73	245.18
	SVM	1117.96	23.69	23.45	1025.96	243.48
breastTissue	DT	641.98	6.08	1.93	345.7	12.58
	KNN	641.97	6.1	1.95	345.6	12.48
	RF	641.98	6.15	1.94	345.72	12.48
	SVM	641.96	6.13	1.95	345.8	12.52
ecoli	DT	536.14	6.97	5.47	163.83	14.97
	KNN	536.12	6.94	5.46	163.85	14.94
	RF	536.14	6.95	5.44	163.88	14.92
	SVM	536.13	6.94	5.48	163.84	14.97
pima_	DT	529.11	12.84	12.99	139.45	94.92
diabetes	KNN	529.13	12.83	12.96	139.4	94.95
	RF	529.11	12.84	12.97	139.42	94.93
	SVM	529.12	12.83	12.96	139.48	94.95
seed	DT	932.83	6.47	4.31	272.2	29.35
	KNN	932.85	6.45	4.3	272.3	29.31
	RF	932.85	6.44	4.32	272.23	29.33
	SVM	932.8	6.47	4.31	272.25	29.32
Average	Execution time	751.65	11.25	9.71	389.53	79.13

Table 10. Average accuracy (AEV) and standard deviation (SD) of each classifier on categorical datasets.

Dataset	Classifier	AEV	SD
balance-scale	DT	0.745	0.034
	KNN	0.744	0.030
	RF	0.845	0.026
	SVM	0.862	0.025
congressional_voting	DT	0.951	0.025
_records	KNN	0.922	0.032
	RF	0.963	0.021
	SVM	0.963	0.022
Qualitative_	DT	0.995	0.012
Bankruptcy	KNN	0.996	0.008
	RF	0.9995	0.005
	SVM	0.996	0.009
SPEC_Heart	DT	0.747	0.049
	KNN	0.796	0.046
	RF	0.824	0.040
	SVM	0.828	0.041
Vector_Borne	DT	0.709	0.064
_Disease	KNN	0.673	0.057
	RF	0.934	0.030
	SVM	0.914	0.037

Table 11. The accuracy of different feature-weighted sampling methods on categorical datasets.

Dataset	Classifier	KLD $(P ∥ Q)$	KLD $(Q ∥ P)$	JSD	FWS (EMD Shapely)
balance-scale	DT	0.741	0.722	0.734	0.747
	KNN	0.747	0.747	0.747	0.747
	RF	0.823	0.848	0.829	0.816
	SVM	0.816	0.816	0.816	0.816
congressional_	DT	0.949	0.949	0.949	0.898
voting _records	KNN	0.915	0.915	0.915	0.949
	RF	0.983	0.983	0.983	0.966
	SVM	0.983	0.983	0.983	0.966
Qualitative_	DT	1.000	1.000	1.000	1.000
_Bankruptcy	KNN	1.000	1.000	1.000	0.984
	RF	1.000	1.000	1.000	1.000
	SVM	0.984	0.984	0.984	1.000
SPEC_Heart	DT	0.716	0.716	0.761	0.761
	KNN	0.776	0.776	0.776	0.821
	RF	0.776	0.791	0.791	0.791
	SVM	0.821	0.821	0.821	0.791
Vector_Borne	DT	0.761	0.761	0.776	0.761
_Disease	KNN	0.687	0.687	0.687	0.687
	RF	0.940	0.955	0.940	0.955
	SVM	0.955	0.955	0.955	0.955

Table 12. The MAI of different feature-weighted sampling methods on categorical datasets.

Dataset	Classifier	KLD $(P ∥ Q)$	KLD $(Q ∥ P)$	JSD	FWS (EMD Shapely)
balance-scale	DT	0.881	0.881	0.478	0.064
	KNN	0.795	0.795	0.795	0.098
	RF	0.272	0.818	0.272	1.078
	SVM	0.673	0.673	0.673	1.825
congressional_	DT	0.088	0.088	0.088	2.141
voting _records	KNN	0.193	0.193	0.193	0.853
	RF	0.974	0.974	0.974	0.165
	SVM	0.921	0.921	0.921	0.154
Qualitative_	DT	0.474	0.474	0.474	0.474
_Bankruptcy	KNN	0.548	0.548	0.548	1.503
	RF	0.087	0.087	0.087	0.087
	SVM	1.335	1.335	1.335	0.467
SPEC_Heart	DT	0.626	0.626	0.286	0.286
	KNN	0.429	0.429	0.429	0.536
	RF	1.193	0.817	0.817	0.817
	SVM	0.174	0.174	0.174	0.902
Vector_Borne	DT	0.806	0.806	1.039	0.806
_Disease	KNN	0.234	0.234	0.234	0.234
	RF	0.205	0.704	0.205	0.704
	SVM	1.121	1.121	1.121	1.121
Average	MAI	0.601	0.635	0.557	0.716

Table 13. The execution time (s.) of the feature-weighted sampling methods on categorical datasets.

Dataset	Classifier	Random Sampling	KLD $(P ∥ Q)$	KLD $(Q ∥ P)$	JSD	FWS (EMD Shapely)
balance-scale	DT	640.33	31.8	15.38	32.49	83.9
	KNN	640.31	31.85	15.36	32.5	83.87
	RF	640.3	31.87	15.36	32.48	83.8
	SVM	640.3	31.82	15.37	32.5	83.88
congressional_	DT	528.14	20.81	4.72	21.5	13.28
voting _records	KNN	528.13	20.8	4.7	21.53	13.26
	RF	528	20.79	4.72	21.54	13.28
	SVM	528.2	20.79	4.74	21.51	13.25
Qualitative_	DT	799.2	19.17	3.33	17.48	11.28
_Bankruptcy	KNN	799.27	19.19	3.35	17.45	11.27
	RF	799.25	19.15	3.33	17.46	11.28
	SVM	799.24	19.18	3.34	17.48	11.25
SPEC_Heart	DT	649.3	38.03	6.01	45.12	87.5
	KNN	649.35	38.05	6.05	45.15	87.47
	RF	649.32	38.04	6.04	45.12	87.51
	SVM	649.36	38.07	6.05	45.15	87.52
Vector_Borne	DT	1072.6	106.5	9.44	96.01	208.7
_Disease	KNN	1072.5	106.52	9.43	96	208.5
	RF	1072.63	106.53	9.45	96.01	208.68
	SVM	1072.66	106.51	9.44	96.05	208.5
Average	Execution time	737.92	43.27	7.78	42.53	80.90

Table 14. Average accuracy (AEV) and standard deviation (SD) of each classifier on mix-type datasets.

Dataset	Classifier	ACC	SD
credit_approval	DT	0.817	0.029
	KNN	0.862	0.024
	RF	0.873	0.023
	SVM	0.862	0.023
Differentiated_	DT	0.942	0.022
Thyroid_Cancer_	KNN	0.922	0.027
Recurrence	RF	0.960	0.017
	SVM	0.955	0.019
Fertility	DT	0.846	0.064
	KNN	0.853	0.055
	RF	0.869	0.057
	SVM	0.881	0.057
Heart_Disease	DT	0.749	0.048
	KNN	0.829	0.038
	RF	0.822	0.038
	SVM	0.839	0.037
Wholesale_customers	DT	0.524	0.045
_data	KNN	0.635	0.037
	RF	0.709	0.036
	SVM	0.718	0.037

Table 15. MAI and test accuracy for mix-type datasets.

Dataset	Classifier	Our Hybrid Method (Accuracy)	dcor-Hypothesis Testing ANOVA (Accuracy)	Our Hybrid Method (MAI)	dcor-Hypothesis Testing ANOVA (MAI)
credit_approval	DT	0.793	0.768	0.828	1.669
	KNN	0.896	0.896	1.414	1.414
	RF	0.854	0.866	0.867	0.329
	SVM	0.854	0.854	0.376	0.376
Differentiated_	DT	0.938	0.938	0.189	0.189
Thyroid_Cancer_	KNN	0.938	0.938	0.596	0.596
Recurrence	RF	0.948	0.948	0.699	0.699
	SVM	0.938	0.938	0.910	0.910
Fertility	DT	0.885	0.885	0.610	0.610
	KNN	0.808	0.808	0.820	0.820
	RF	0.846	0.846	0.398	0.398
	SVM	0.885	0.885	0.059	0.059
Heart_Disease	DT	0.792	0.805	0.903	1.172
	KNN	0.870	0.870	1.085	1.085
	RF	0.857	0.857	0.918	0.918
	SVM	0.857	0.857	0.486	0.486
Wholesale_	DT	0.495	0.514	0.633	0.228
customers _data	KNN	0.676	0.676	1.105	1.105
	RF	0.712	0.703	0.096	0.152
	SVM	0.739	0.739	0.558	0.558
Average	MAI			0.678	0.689

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yen, S.-J.; Lee, Y.-S.; Tang, Y.-J. An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers. Electronics 2025, 14, 2876. https://doi.org/10.3390/electronics14142876

AMA Style

Yen S-J, Lee Y-S, Tang Y-J. An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers. Electronics. 2025; 14(14):2876. https://doi.org/10.3390/electronics14142876

Chicago/Turabian Style

Yen, Show-Jane, Yue-Shi Lee, and Yi-Jie Tang. 2025. "An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers" Electronics 14, no. 14: 2876. https://doi.org/10.3390/electronics14142876

APA Style

Yen, S.-J., Lee, Y.-S., & Tang, Y.-J. (2025). An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers. Electronics, 14(14), 2876. https://doi.org/10.3390/electronics14142876

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers^†

Abstract

1. Introduction

2. Related Work

3. Our Approach

3.1. The Sampling Methods on the Dataset with Numerical Attributes

3.2. The Sampling Methods on the Dataset with Categorical Attributes

4. Experimental Results

4.1. Datasets and Classifiers

4.2. Evaluation Metric MAI

4.3. Experimental Results on Numerical Datasets

4.4. Experimental Results on Categorical Datasets

4.5. Experimental Results on Mix-Type Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers †

Abstract

1. Introduction

2. Related Work

3. Our Approach

3.1. The Sampling Methods on the Dataset with Numerical Attributes

3.2. The Sampling Methods on the Dataset with Categorical Attributes

4. Experimental Results

4.1. Datasets and Classifiers

4.2. Evaluation Metric MAI

4.3. Experimental Results on Numerical Datasets

4.4. Experimental Results on Categorical Datasets

4.5. Experimental Results on Mix-Type Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

An Effective Hybrid Sampling Strategy for Single-Split Evaluation of Classifiers^†