Next Article in Journal
Research on Freight Transportation Carbon Emission Reduction Based on System Dynamics
Previous Article in Journal
A Novel Structure of Rubber Ring for Hydraulic Buffer Seal Based on Numerical Simulation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Feature-Weighted Sampling for Proper Evaluation of Classification Models

1
Department of Computer Science, Dankook University 152, Yongin 16890, Korea
2
Department of Software Science, Dankook University, Yongin 16890, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(5), 2039; https://doi.org/10.3390/app11052039
Submission received: 4 February 2021 / Revised: 20 February 2021 / Accepted: 23 February 2021 / Published: 25 February 2021
(This article belongs to the Special Issue Feature Engineering for Machine Learning)

Abstract

:
In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

1. Introduction

Classification problems in machine learning can be easily found in the real world. Doctors diagnose patients as either diseased or healthy based on the symptoms of a specific disease in the past, and in online commerce, security experts decide whether transactions are fraudulent or normal based on the pattern of previous transactions. As in this example, the purpose of classification in machine learning is to predict unknown features based on past data. An explicit classification target, such as “diseased” or “healthy”, is called a class label. The classification belongs to supervised learning because it uses a class label. Representative classification algorithms include decision trees, artificial neural networks (ANNs), naive Bayes (NB) classifiers, support vector machine (SVM), and k-nearest neighbors (KNN) [1].
In general, the development of a classification model comprises two phases, as shown in Figure 1, starting with data partitioning. The entire dataset is divided into a training set and a test set, each of which is used during different stages and for different purposes. The first is the learning or training phase using the training set. At this time, part of the training set is used as a validation set. The second phase is the model evaluation phase using the test set. The evaluation result using a test set is considered the final performance of the trained model. The inherent problem in the development of a classification model is that the model’s performance (accuracy) inevitably depends on the divided training and test set. This is because the model reflects the characteristics of the training set, but the accuracy of the model is influenced by the characteristics of the test set. If a model with poor actual performance is evaluated with an easy-to-classify test set, the model performance will look good. Conversely, if a model with good performance is evaluated by a difficult-to-classify test set, the model performance will be underestimated. In our previous work [2], we showed that 1000 cases of train/test sets by random sampling produced different classification accuracies from 0.848 to 0.975. This phenomenon is due to the difference in data distribution between the training and test sets, emphasizing that dividing the entire dataset into training and test sets has a significant impact on model performance evaluation.
The ideal goal of splitting train/test sets is that the distributions of both the training and test sets become the same as the whole dataset. However, this is a difficult task for multi-dimensional datasets. Various methods have been proposed to solve this problem. Random sampling is an easy and widely used method. In random sampling, each data instance has the same probability of being chosen, and this can reduce the bias of model performance. However, it produces a high variance in model performance, if a dataset has an abnormal distribution or the size of the sample is small [3,4]. Systematic sampling is a method of extracting data by randomly arranging data and skipping at regular intervals [5]. Stratified sampling is a method of first dividing a population into layers, so that they do not overlap, and then sampling from each layer. It uses the internal structure (layers) and the distribution of a dataset [4]. D-optimal [6] and the most descriptive compound method (MDC) [7] are advanced stratified sampling methods. The potential error of the descriptor and rank sum of the distance between compounds are the internal structures of D-optimal and MDC, respectively.
R-value-based sampling (RBS) [2] is a type of stratified sampling. It divides the entire dataset into n groups (layers) according to the ratio of “class overlap”, and applies systematic sampling to each group. In general, the classification accuracy for a dataset is strongly influenced by the degree of overlap of the classes in the dataset [4,8]. The degree of class overlap was measured using the R-value [8]. Let us suppose a data instance p and q1, q2, …, qk are the k-nearest neighbor instances of p. If r is the number of instances that belong to the k-nearest neighbors and their class labels are different from that of p, the degree of overlap of p is r (0 ≤ rk). In other words, p belongs to group r. The experimental results confirm that RBS produces better training and test sets than random and several non-random sampling methods.
In the machine learning area, k-fold cross-validation has been used to overcome the overfitting problem in classification. It makes k training models, and the mean of test accuracies is considered as an evaluation measure for parameter tuning of a model or comparison of different models. The repeated holdout method, also known as Monte Carlo cross-validation, is also available for model evaluation [3,9]. During the iteration of the holdout process, the dataset is randomly divided into training and test sets, and the mean of the model accuracy gradually converges to one value [2]. The purpose of k-fold cross-validation and the holdout method is different from that of the sampling methods. Both k-fold cross-validation and holdout methods produce multiple train/test sets, and as a result, they make multiple prediction models. We cannot know which is a desirable model. Therefore, they were excluded from the discussion of the sampling issue.
In this study, we propose an improved sampling method based on RBS. We generated candidate train/test sets using the modified RBS algorithm and evaluated the distribution similarity between the candidates and the whole dataset. In the evaluation process, a data histogram and feature importance were considered. Finally, the case with the smallest deviation of the distribution was selected. We compared the proposed method with RBS, and we confirmed that the proposed method shows better performance than the previous RBS.

2. Materials and Methods

As mentioned earlier, the ideal training and test sets should have the same distribution as the original dataset. To achieve this goal, we propose a method called feature-weighted sampling (FWS). Our main idea is as follows:
(1)
Generate numerous candidate cases of train/test sets using modified RBS.
(2)
Evaluate the similarity between the original dataset and candidate cases. The similarity is measured by the distance.
(3)
Choose the case that has smallest distance to original dataset.
Figure 2 summarizes the proposed method in detail. The first phase generates n train/test set candidates with stratified random sampling. Stratified sampling uses the modified RBS method, which reflects the amorphic property of the data, called class overlap. The second step is to select one of the candidates with the distribution that is most similar to the original dataset. To evaluate the similarity of distribution, we measured the distance between the train/test sets and the original dataset in terms of distance. To calculate the distance between the original dataset and train/test sets, we tested Bhattacharyya distance [10], histogram intersection [11], and Earth Mover’s Distance [12]. Finally, we adopted the Earth Mover’s Distance. Feature importance was applied to the weighting feature during the distance calculation. As a result, the train/test sets that had the smallest distance from the original dataset were selected. For the evaluation of the sampling method, we devised a metric named the mean accuracy index (MAI). Using the MAI, we compared the proposed FWS and RBS. Twenty benchmark datasets and four classifiers, including k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and C50 were used for the comparison.

2.1. Phase 1: Generate Candidates

In the candidate generation step, 1000 candidates with a pair of train/test sets were generated using a modified RBS. Next, 25% and 75% of the total instances were sampled to the test set and training set, respectively. Class overlap is the key concept of an RBS. We first summarize the class overlap and explain the modified RBS.

2.1.1. Concept of Class Overlap

Class overlap refers to the overlap of data instances among classes, and wide class overlap makes it difficult to classify tasks [8]. The overlap number of an instance p is calculated by counting the number of instances with different class labels in the k-nearest neighbors. Figure 3 shows the class overlap value for a data instance (red cross in Figure 3) when k = 3. If the overlap number is over the threshold, we can determine that p is located in the overlapped area. The ratio of instances located in the overlapped area is the R-value [8]. The R-value can be used to evaluate the quality of the datasets. In RBS, the overlap number of an instance is used to group the instance. If k = 3, then an instance can belong to one of the four groups. The RBS performs sampling train/test instances from the four groups.

2.1.2. Modified RBS

The original RBS adopts a stratified sampling method. It groups each instance according to the class overlap number and then samples each group in a stratified manner. As a result, the original RBS always produces the same training and test sets. We replaced the stratified sampling with random sampling in the original RBS. The modified RBS produces various training/test sets according to the random seed. Figure 4 shows the pseudocode for the modified RBS [2].

2.2. Phase 2: Evaluate the Candidates and Select Best Train/Test Sets

The main goal of Phase 2 is to find the best training/tests from 1000 candidates. We evaluated each candidate according to the workflow shown in Figure 5. Each feature in the dataset was scaled to have a value between 0 and 1, and then histograms were generated for the whole dataset and candidate train/test sets. Based on the histogram data, the similarity in the distribution between the whole dataset and the training set, and between the whole data set and the test set was measured using the Earth Mover’s Distance. The final similarity distance for each candidate was obtained by summing the obtained similarity distance for each feature, which reflects the weight relative to the importance of the feature. Once the similarity distances for all candidates were obtained, we selected the candidate with the smallest distance as the output of the FWS method. We explain the histogram generation, similarity calculation, and feature weighting in the following sections.

2.2.1. Generation of Histograms

The histogram represents the approximated distribution by corresponding a set of real values to an equally wide interval, which is called bin. For example, if a histogram is configured with n bin, it can be defined as a h i s t o g r a m = { ( b i n i ,   v a l u e i ) |   1 i   n ,     w h e r e   b i n k <   b i n j   w h e n   k < j     } . The above definition allows the histogram to be represented as a bar for data visualization. However, it is more advantageous to use it as a pure mathematical object containing an approximate data distribution [13,14]. Histograms are mathematical tools that extract compressed characteristic information of a dataset and play an important role in various fields such as computer vision, image retrieval, and databases [12,13,14,15]. We confirmed that the histogram approach is better than the statistical quantile.
This work also views the histogram as a mathematical object and attempts to measure the quantitative similarity between the entire dataset and the candidate dataset. By transforming the real distribution into a histogram, finding train/test sets with the distribution most similar to the entire dataset can be considered the same as the image retrieval problem. Our goal was to find the most similar histogram image of the entire dataset from 1000 candidate histogram images.

2.2.2. Measurement of Histogram Similarity

We evaluated the similarity of histograms using distance perspective closeness. Although there are several methods and metrics to obtain similarity distances between histograms [14,15], we exploited the Earth Mover’s Distance [12], which adopts a cross-bin scheme. Unlike the bin-by-bin method, cross-bin measurement evaluates not only exactly corresponding bins but also non-responding bins (Figure 6) [12]. It induces less sensitivity to the location of bins and better reflects human-aware similarities [12]. The Earth Mover’s Distance is a cross-bin method based on the optimal transport theory, and several studies have demonstrated its superiority [12,15]. In addition, this measurement method has the properties of true distance metrics that satisfy non-negativity, symmetry, and triangle order inequality [15]. In this study, the similarity between datasets is defined as the sum of the histogram distances of all features. Furthermore, the Earth Mover’s Distance was calculated using the emdist package in CRAN (https://cran.r-project.org/web/packages/emdist/index.html, accessed on 25 February 2021).

2.2.3. Feature Weighting

Previously, we conceptually defined the similarity between datasets by the distances between features; however, simple distances can be a problem. This is because each feature not only has a different distribution of values, but also has a different degree of contribution to model accuracy. In other words, the same similarity distance between features has different effects on predictive power. For example, although features A and B have equally strong similarity distances, A may have a very strong effect on model accuracy, whereas B may have a weak effect. Therefore, when calculating the distances between each feature, we must apply the weight according to the effect of each feature.
There are many methods to evaluate the effects of features, such as information gain and chi-square. We used the Shapley value-based feature importance method [16]. The Shapley value is a method for evaluating the contribution of each feature value in an instance to the model. It takes the idea of game theory to distribute profits fairly according to the contribution of each player. Recently, Covert [16] proposed a method to measure feature importance from a global perspective of the Shapley value as a dataset rather than each instance. This method, called SAGE, has also been published as a Python package. We used this method to obtain feature importance and assign weights when calculating the similarity distances. The weighted distance between the entire dataset and the given train/test sets was defined as follows:
d = i = 1 n w ( f i ) × ( d t r a i n ( f i ) + d t e s t ( f i ) )
  • d : similarity distance of given train/test sets
  • w ( f i ) : weight of i-th feature
  • d t r a i n ( f i ) : Similarity distance between whole dataset and training set for i-th feature
  • d t e s t ( f i ) : Similarity distance between whole dataset and test set for i-th feature
The pseudocode for proposed FWS method is described in Figure 7.

2.3. Evaluation of FWS Method

To confirm the performance of the proposed sampling method, we compared it with the original RBS, because RBS has traditionally outperformed other methods. Other methods have already been compared with RBS, and we can omit the comparison with other methods. MAI was used as an evaluation metric. For the benchmark test, 20 datasets and 4 classification algorithms were employed.

2.3.1. Evaluation Metric: MAI

Measuring the quality of given train/test sets is a difficult issue, because we do not know which ideal train/test sets completely reflect the entire dataset. Kang [2] proposed MAI as a solution. He generated 1000 train/test sets by random sampling and measured the mean accuracies of a classification algorithm. He considered the mean accuracy as the accuracy of ideal train/test sets. In statistics, the mean of large samples converges to the mean of the population. Let us suppose that AEV is the mean accuracy from n train/test sets. The AEV can be defined as follows:
A E V   =   ( i = 1 n t e s t _ a c c i   ) / n
where test_acci is the test accuracy generated by the ith random sampling. MAI is defined by the following equation:
M A I   =   | A C C   A E V | S D
where ACC refers to the test accuracy derived from the classification model for a test set in n train/test sets, and SD is the standard deviation of test accuracies ( t e s t _ a c c i ) in the AEV. The intuitive meaning of MAI is “how far the given ACC is from the AEV”. Therefore, the smaller the MAI is, the better. We used the MAI as an evaluation metric for the train/test sets.

2.3.2. Benchmark Datasets and Classifiers

To compare the proposed FWS and RBS, we used 20 benchmark datasets with various numbers of features (attributes), classes, and instances. The datasets were collected from the UCI Machine Learning Repository (http://archive.ics.uci.edu/mL/, accessed on 25 February 2021) and Kaggle site (https://www.kaggle.com/, accessed on 25 February 2021), and are listed in Table 1. Four classification algorithms, KNN, SVM, RF, and C50 were tested. They are supported by R packages. The packages and parameters used are listed in Table 2. We divided the entire dataset into training and test sets at a 75:25 ratio to build and evaluate the classification models.

3. Results

In the first phase of generating a candidate train/test set, the MAI value was examined while adjusting the K value, which determines the sensitivity of category overlap during balanced sampling. We experimented with the influence of K. Figure 8 and Table A1 in Appendix A describe the results. The average MAI was measured according to K. In this experiment, the bin width was fixed at 0.2. As we can see, the overall performance was the best when K was 3. When the value of K increased, the number of groups also increased, and instances in a specific group tended to become sparse. When the instances of each group were insufficient, the diversity of the distribution could not be secured. Therefore, a small number of K is advantageous for the proposed method.
The value of bin width is another important parameter for the proposed FWS. Therefore, we experimented with the influence of the bin width. We tested the value 0.2 (the number of bins is 5), 0.1 (the number of bins is 10), and 0.05 (the number of bins is 20), and K was fixed at 3. Figure 9 and Table A2 in Appendix A summarize the results. When the bin width was 0.2, the performance was slightly good, but there was no significant difference overall. In another experiment, we confirmed that 0.2 was the best for multi-class datasets (number of classes > 2), whereas 0.05 was the best for binary-class datasets. Therefore, we used 0.05 and 0.2 as hybrid methods for the final FWS method.
Table 3 shows the final experimental results when k = 3 and the bin width is hybrid. The 61 cases (76%) of MAI of FWS were better than those of RBS, and RBS was better than FWS in 19 cases (24%). This result indicates that FWS is a more improved method than original RBS and previous methods. The details are discussed in the next section.

4. Discussion

RBS is an efficient sampling method compared to the previous methods. The proposed FWS is a more improved method than RBS. Figure 10 shows how much more advanced FWS is than RBS. As shown in Figure 10a, the average MAI value of FWS was 0.460, whereas that of RBS was 0.920. As we can see, the smaller the MAI, the better. Therefore, FWS improved MAI by 56% compared to RBS. Figure 10b compares the standard deviations of the MAI. The standard deviation of FWS was 0.403, whereas that of RBS was 0.779. In other words, this means the distribution of the MAI value of the FWS was smaller than that of the RBS. FWS yielded more stable sampling results than RBS. Figure 10c shows the range of MAI. The range was calculated as (maximum of MAI) − (minimum of MAI). It also shows the distribution of MAI values. The ranges of FWS and RBS were 2.359 and 4.619, respectively. The fluctuation range of the FWS was smaller than that of the RBS. All statistics in Figure 10 show that FWS is a more stable and accurate method than RBS. Furthermore, it is proven that the similarity of distribution between the train/test sets and whole dataset is an important factor for their ideal splitting.
In the development of the prediction model, the quality of the features determines the performance of the model. In general, the influence of features is greater than that of classification algorithms [17]. Therefore, considering the feature weight for distance calculation in the classification is reasonable. Figure 11 shows the influence of feature weighting in the FWS method. We compared FWS with and without feature weights. In the average of MAI, the “without case” was 0.634 whereas “with case” was 0.490 (Figure 11a). This means that feature weighting improved the performance of the FWS. In terms of the standard deviation, both cases were similar (Figure 11b). The ranges of “with case” and “without case” were 2.314 and 1.801, respectively (Figure 11c). This is because the maximum value of “with case” was large, which is less important than the standard deviation.
We analyzed the variance in the MAI according to the number of classes. In the result of RBS, the MAI value of binary-class datasets was higher than that of multi-class datasets, whereas the difference was not large in FWS (Figure 12). This means that FWS is not influenced by the variance in the class number. FWS is a more stable method than RBS.
In this study, we confirmed that the similarity of distribution between the original dataset and train/test sets is an important factor for accurate sampling. Furthermore, feature-weighted distance calculation can improve the sampling performance. If we use the proposed FWS for splitting train/test sets, we can more accurately evaluate the classification models. In our experiment, FWS performed better than RBS in 61 of 80 cases of train/test sets. This shows that the FWS has room for further improvement and is a topic for further research.

Author Contributions

Conceptualization, S.O.; methodology, H.S. and S.O.; software, H.S.; validation, H.S. and S.O.; formal analysis, H.S. and S.O.; investigation, H.S. and S.O.; resources, H.S.; data curation, H.S.; writing—original draft preparation, H.S. and S.O.; writing—review and editing, H.S. and S.O.; visualization, H.S. and S.O.; supervision, S.O.; project administration, S.O.; funding acquisition, S.O. All authors have read and agreed to the published version of the manuscript.

Funding

The present research was supported by the research fund of Dankook University in 2019.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. MAI of FWS according to K (bin width = 0.2).
Table A1. MAI of FWS according to K (bin width = 0.2).
NoClassifierK = 3K = 4K = 5K = 6K = 7
1C502.3832.3680.4990.4990.499
KNN0.5520.5341.2520.4800.480
RF0.4790.4790.4791.0200.479
SVM0.7920.7750.9401.5680.543
2C500.0760.4910.2320.7190.120
KNN0.4740.2410.5152.0640.390
RF1.2330.2111.3280.1533.192
SVM1.0751.5570.0581.1750.418
3C500.0780.4461.1890.8350.164
KNN1.7660.1661.1950.6590.639
RF0.3510.0270.3790.4140.407
SVM0.7210.6820.6820.6630.644
4C500.1611.0100.9321.2930.079
KNN0.8450.5321.1350.4580.313
RF0.7600.0551.1171.4951.112
SVM1.2130.3320.0231.4070.107
5C500.5380.0780.4140.3221.122
KNN0.2140.2860.3260.1650.430
RF0.6180.2511.4410.5021.316
SVM0.4850.5760.1820.7300.731
6C500.2680.8370.8321.1841.943
KNN0.2430.0270.0250.5760.845
RF0.1590.5030.1670.9480.274
SVM0.7370.4470.7290.4470.727
7C500.3470.5240.3370.8830.160
KNN0.1220.4660.2410.0660.235
RF0.3170.9050.4841.0970.084
SVM0.2521.0060.8100.6200.246
8C500.1450.9201.1692.0730.344
KNN0.2310.3460.3610.6891.287
RF0.4391.0230.5020.1400.502
SVM1.1852.0470.5291.4900.217
9C500.8070.0757.8901.8200.818
KNN1.0390.4820.5170.7190.144
RF0.6900.6330.8300.7780.085
SVM1.6522.1450.0861.5310.086
10C500.0200.4480.5080.8321.027
KNN0.6750.1830.7810.1230.241
RF0.5000.0540.1151.3400.948
SVM0.3591.2530.2321.2121.123
11C500.3650.3980.2730.4290.273
KNN0.2670.2300.5330.1950.230
RF0.1520.1140.1140.8440.114
SVM0.3160.2780.5000.2420.500
12C500.9300.7021.6160.8991.128
KNN0.3120.9120.9120.4220.067
RF0.5140.7680.5121.2830.512
SVM0.8391.6152.1600.2941.112
13C500.6410.3110.8250.6341.249
KNN0.0580.4981.3660.4600.133
RF0.2561.0891.4020.0630.787
SVM0.7841.2500.6190.4141.642
14C500.4620.1851.2200.7200.605
KNN1.0240.2950.1000.6301.349
RF0.3280.3750.8400.0180.135
SVM0.4050.6190.7460.5020.453
15C500.4540.2010.5850.2800.042
KNN0.3180.1490.1610.9180.537
RF0.6670.0910.1291.1181.073
SVM0.4250.1500.6430.5060.678
16C500.1000.1380.1760.1380.247
KNN0.0661.1840.3910.9760.723
RF0.0850.6140.8191.3640.217
SVM0.0230.0630.1010.0630.174
17C501.0622.0670.6770.1452.023
KNN0.4240.3821.7630.8071.272
RF0.6131.9970.5701.1171.959
SVM0.8261.5811.5520.5121.727
18C500.0780.4461.1890.8350.164
KNN1.7660.1661.1950.6590.639
RF0.3510.0270.3790.4140.407
SVM0.7210.6820.6820.6630.644
19C500.6360.0620.3471.5150.388
KNN0.0992.1520.7240.7470.243
RF1.3090.7512.2640.5660.302
SVM0.0331.1420.3431.1262.135
20C500.2260.4590.3660.7370.381
KNN0.6930.6560.1040.1100.266
RF0.0860.9610.8271.3190.120
SVM0.6101.1730.8190.4480.440
mean0.5540.6540.7750.7540.645
Table A2. MAI of FWS according to bin width (K = 3).
Table A2. MAI of FWS according to bin width (K = 3).
NoClassifierbw = 0.05bw = 0.1bw = 0.2
1C502.3832.3832.383
KNN0.5520.5520.552
RF0.4790.4790.479
SVM0.7920.7920.792
2C500.0761.5191.519
KNN0.4740.6150.615
RF1.2331.0161.016
SVM1.0751.0401.040
3C500.7410.0780.078
KNN0.2021.7661.766
RF0.0610.3510.351
SVM0.1550.7210.721
4C500.1610.9410.161
KNN0.1880.1880.845
RF0.8711.6870.760
SVM0.0540.7191.213
5C500.5380.5380.538
KNN0.2140.2140.214
RF0.6180.6180.618
SVM0.4850.4850.485
6C500.2680.2680.268
KNN0.2430.2430.243
RF0.1590.1590.159
SVM0.7370.7370.737
7C500.3470.3470.347
KNN0.1220.1220.122
RF0.3170.3170.317
SVM0.2520.2520.252
8C500.1480.1480.145
KNN0.8880.8880.231
RF0.2360.2360.439
SVM0.4790.4791.185
9C500.8070.8070.807
KNN0.0250.6401.039
RF0.0650.0600.690
SVM0.2971.6621.652
10C500.3440.3440.020
KNN0.0710.0710.675
RF0.5000.5000.500
SVM0.8480.8480.359
11C500.3650.3650.365
KNN0.5140.5140.267
RF0.1520.1520.152
SVM0.3160.3160.316
12C500.2130.9300.930
KNN1.0470.3120.312
RF0.2570.5140.514
SVM1.1120.8390.839
13C501.6671.6670.641
KNN0.4830.4830.058
RF1.9751.9750.256
SVM1.0561.0560.784
14C500.2890.4620.462
KNN0.3491.0241.024
RF0.0780.3280.328
SVM0.2090.4050.405
15C501.2900.1970.454
KNN0.2060.8020.318
RF0.8390.8390.667
SVM0.1340.1340.425
16C500.3480.3480.100
KNN1.0331.0330.066
RF0.4180.4180.085
SVM1.1271.1270.023
17C501.6970.4191.062
KNN1.3680.8960.424
RF0.2401.4670.613
SVM1.8430.2100.826
18C500.7410.0780.078
KNN0.2021.7661.766
RF0.0610.3510.351
SVM0.1550.7210.721
19C500.6950.6360.636
KNN0.8420.0990.099
RF0.9551.3091.309
SVM0.7830.0330.033
20C500.2260.0550.226
KNN0.6930.0800.693
RF0.4481.3620.086
SVM0.6100.2440.610
mean0.8270.8310.829

References

  1. Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Informatica 2007, 31, 249–268. [Google Scholar]
  2. Kang, D.; Oh, S. Balanced Training/Test Set Sampling for Proper Evaluation of Classification Models. Intell. Data Anal. 2020, 24, 5–18. [Google Scholar] [CrossRef]
  3. Reitermanova, Z. Data Splitting. In Proceedings of the WDS, Prague, Czech Republic, 1–4 June 2010; Volume 10, pp. 31–36. [Google Scholar]
  4. Ditrich, J. Data Representativeness Problem in Credit Scoring. Acta Oeconomica Pragensia 2015, 2015, 3–17. [Google Scholar] [CrossRef] [Green Version]
  5. Elsayir, H. Comparison of Precision of Systematic Sampling with Some Other Probability Samplings. Stat. J. Theor. Appl. Stat. 2014, 3, 111–116. [Google Scholar] [CrossRef]
  6. Martin, E.J.; Critchlow, R.E. Beyond Mere Diversity: Tailoring Combinatorial Libraries for Drug Discovery. J. Comb. Chem. 1999, 1, 32–45. [Google Scholar] [CrossRef] [PubMed]
  7. Hudson, B.D.; Hyde, R.M.; Rahr, E.; Wood, J.; Osman, J. Parameter Based Methods for Compound Selection from Chemical Databases. Quant. Struct. Act. Relatsh. 1996, 15, 285–289. [Google Scholar] [CrossRef]
  8. Oh, S. A New Dataset Evaluation Method Based on Category Overlap. Comput. Biol. Med. 2011, 41, 115–122. [Google Scholar] [CrossRef] [PubMed]
  9. Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv 2018, arXiv:181112808. [Google Scholar]
  10. Wu, B.; Nevatia, R. Tracking of multiple, partially occluded humans based on static body part detection. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 1, pp. 951–958. [Google Scholar]
  11. Shi, X.; Ling, H.; Xing, J.; Hu, W. Multi-target tracking by rank-1 tensor approximation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2387–2394. [Google Scholar]
  12. Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
  13. Ioannidis, Y. The History of Histograms (abridged). In Proceedings of the 2003 VLDB Conference; Berlin, Germany, 9–12 September 2003, Freytag, J.-C., Lockemann, P., Abiteboul, S., Carey, M., Selinger, P., Heuer, A., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 2003; pp. 19–30. ISBN 978-0-12-722442-8. [Google Scholar]
  14. Bityukov, S.I.; Maksimushkina, A.V.; Smirnova, V.V. Comparison of Histograms in Physical Research. Nucl. Energy Technol. 2016, 2, 108–113. [Google Scholar] [CrossRef] [Green Version]
  15. Bazan, E.; Dokládal, P.; Dokladalova, E. Quantitative Analysis of Similarity Measures of Distributions. 2019; ⟨hal-01984970⟩. [Google Scholar]
  16. Covert, I.; Lundberg, S.; Lee, S.-I. Understanding Global Feature Contributions with Additive Importance Measures. Adv. Neural Inf. Process. Syst. 2020, 33, 17212–17223. [Google Scholar]
  17. Zheng, A.; Casari, A. Feature Engineering for Machine Learning, 1st ed.; O Reilly: Sebastopol, CA, USA, 2018; pp. 1–4. [Google Scholar]
Figure 1. Development process of a classification model.
Figure 1. Development process of a classification model.
Applsci 11 02039 g001
Figure 2. Overall process of the proposed sampling method.
Figure 2. Overall process of the proposed sampling method.
Applsci 11 02039 g002
Figure 3. An example of grouping of data instances according to the class overlap number where k = 3.
Figure 3. An example of grouping of data instances according to the class overlap number where k = 3.
Applsci 11 02039 g003
Figure 4. Pseudocode of modified R-value-based sampling (RBS).
Figure 4. Pseudocode of modified R-value-based sampling (RBS).
Applsci 11 02039 g004
Figure 5. The measurement of similarity between original dataset and given train/test sets.
Figure 5. The measurement of similarity between original dataset and given train/test sets.
Applsci 11 02039 g005
Figure 6. Comparison of two methods for measurement of histogram similarity.
Figure 6. Comparison of two methods for measurement of histogram similarity.
Applsci 11 02039 g006
Figure 7. The pseudocode of feature-weighted sampling (FWS) method.
Figure 7. The pseudocode of feature-weighted sampling (FWS) method.
Applsci 11 02039 g007
Figure 8. Mean accuracy index (MAI) of FWS according to K (bin width = 0.2).
Figure 8. Mean accuracy index (MAI) of FWS according to K (bin width = 0.2).
Applsci 11 02039 g008
Figure 9. MAI of FWS according to bin width (K = 3).
Figure 9. MAI of FWS according to bin width (K = 3).
Applsci 11 02039 g009
Figure 10. Comparison of RBS and FWS.
Figure 10. Comparison of RBS and FWS.
Applsci 11 02039 g010
Figure 11. FWS with and without feature-weighted distance.
Figure 11. FWS with and without feature-weighted distance.
Applsci 11 02039 g011
Figure 12. MAI of FWS according to number of class (K = 3, bin width = hybrid).
Figure 12. MAI of FWS according to number of class (K = 3, bin width = hybrid).
Applsci 11 02039 g012
Table 1. List of benchmark datasets.
Table 1. List of benchmark datasets.
NoName# of Features# of Instances# of Class
1audit257722
2avila1010,43012
3breastcancer305692
4breastTissue911066
5ecoil73368
6Frogs_MFCCs2271273
7gender_classification750012
8glass92146
9hill_Valley10012122
10ionosphere333512
11iris41503
12liver63452
13music_genre26100010
14pima_diabetes87682
15satimage3644356
16seed72103
17statlog_segment1623107
18wdbc305692
19winequality1148936
20Wireless_Indoor720004
Table 2. Summary of classifier and applied parameters.
Table 2. Summary of classifier and applied parameters.
ClassifierR PackageParameter Values
KNNclassk = 5
SVMe1071Default
RFrandomForestDefault
C50C50trials = 1
Table 3. Final experimental results.
Table 3. Final experimental results.
DatasetClassifierMASDRBS
Accuracy
FWS
Accuracy
RBS
MAI
FWS
MAI
1C500.9980.0040.994790.9900.9722.383
KNN0.9570.0140.947920.9490.6280.552
RF0.9980.0030.994791.0001.0830.479
SVM0.9690.0120.93750.9592.6230.792
2C500.9740.0040.976940.9740.6450.076
KNN0.6980.0070.692540.7020.7580.474
RF0.9790.0030.977710.9750.3881.233
SVM0.690.0110.682550.6780.6521.075
3C500.9360.0210.9860.9512.3890.741
KNN0.9680.0130.9860.9651.3480.202
RF0.9590.0170.9720.9580.7400.061
SVM0.9740.0120.9930.9721.5320.155
4C500.6650.0690.7080.6760.6320.161
KNN0.6610.0780.6250.5950.4580.845
RF0.6990.0660.5830.6491.7460.760
SVM0.5980.0700.5830.5140.2151.213
5C500.8030.0360.7830.8220.5460.538
KNN0.8500.0270.8550.8560.2090.214
RF0.8600.0290.8310.8780.9870.618
SVM0.8070.0610.7350.7781.1910.485
6C500.9630.0050.9670.9610.7540.268
KNN0.9920.0020.9910.9920.3190.243
RF0.9870.0030.9840.9870.9730.159
SVM0.9920.0020.9900.9901.0370.737
7C500.9720.0040.9700.9700.3690.347
KNN0.9650.0050.9700.9651.0860.122
RF0.9740.0040.9770.9730.6550.317
SVM0.9720.0040.9730.9700.2980.252
8C500.6860.0550.6150.6941.2750.145
KNN0.6340.0490.5960.6450.7680.231
RF0.7790.0480.7500.7580.6080.439
SVM0.6860.0480.6730.6290.2761.185
9C500.5050.0020.5050.5030.1100.807
KNN0.5480.0250.5580.5490.3800.025
RF0.6000.0260.6530.6012.0610.065
SVM0.5150.0170.5450.5101.7760.297
10C500.90.030.8620.8901.2760.344
KNN0.8440.0290.8390.8460.1690.071
RF0.9340.0220.9540.9230.8820.500
SVM0.9420.0220.9430.9230.0170.848
11C500.9380.0360.9440.9510.1740.365
KNN0.960.0310.9440.9510.4840.267
RF0.9560.030.9440.9510.3760.152
SVM0.9610.0310.9440.9510.5380.316
12C500.6480.0480.7560.6372.2520.213
KNN0.6070.0450.6050.5600.0621.047
RF0.7250.0430.7670.7140.9830.257
SVM0.6930.040.7210.6480.6891.112
13C500.4850.030.4840.4660.0290.641
KNN0.6160.0270.5680.6171.7900.058
RF0.6450.0270.6280.6520.6100.256
SVM0.6540.0280.6480.6330.2290.784
14C500.7360.0290.7240.7450.4230.289
KNN0.7340.0260.7190.7240.5700.349
RF0.7620.0250.7340.7601.1090.078
SVM0.7610.0260.7240.7551.4050.209
15C500.8570.0100.8590.8520.2250.454
KNN0.9010.0080.8980.8980.3430.318
RF0.9100.0080.9140.9150.5480.667
SVM0.8910.0080.8920.8940.0560.425
16C500.9080.0390.8820.9120.6640.100
KNN0.9280.0320.8820.9301.4210.066
RF0.9270.0350.8820.9301.2760.085
SVM0.9290.0300.8820.9301.5330.023
17C500.9640.0080.9770.9731.6441.062
KNN0.9590.0070.9630.9560.6610.424
RF0.9780.0060.9800.9740.4650.613
SVM0.9440.0080.9550.9501.3400.826
18C500.9360.0210.9860.9510.8950.741
KNN0.9680.0130.9860.9650.9650.202
RF0.9590.0170.9720.9580.9510.061
SVM0.9740.0120.9930.9720.9790.155
19C500.5730.0140.6360.5814.6360.636
KNN0.5430.0120.5710.5412.3290.099
RF0.6840.0110.7230.6993.4041.309
SVM0.5710.0110.5770.5720.4970.033
20C500.970.0070.9720.9680.2960.226
KNN0.9840.0050.980.9800.7320.693
RF0.9840.0050.9780.9841.0400.086
SVM0.9810.0050.980.9840.1580.610
MA: mean classification accuracy, SD: standard deviation.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shin, H.; Oh, S. Feature-Weighted Sampling for Proper Evaluation of Classification Models. Appl. Sci. 2021, 11, 2039. https://doi.org/10.3390/app11052039

AMA Style

Shin H, Oh S. Feature-Weighted Sampling for Proper Evaluation of Classification Models. Applied Sciences. 2021; 11(5):2039. https://doi.org/10.3390/app11052039

Chicago/Turabian Style

Shin, Hyunseok, and Sejong Oh. 2021. "Feature-Weighted Sampling for Proper Evaluation of Classification Models" Applied Sciences 11, no. 5: 2039. https://doi.org/10.3390/app11052039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop