Feature-Weighted Sampling for Proper Evaluation of Classiﬁcation Models

: In machine learning applications, classiﬁcation schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classiﬁcation model. We ﬁrst generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the ﬁnal train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.


Introduction
Classification problems in machine learning can be easily found in the real world. Doctors diagnose patients as either diseased or healthy based on the symptoms of a specific disease in the past, and in online commerce, security experts decide whether transactions are fraudulent or normal based on the pattern of previous transactions. As in this example, the purpose of classification in machine learning is to predict unknown features based on past data. An explicit classification target, such as "diseased" or "healthy", is called a class label. The classification belongs to supervised learning because it uses a class label. Representative classification algorithms include decision trees, artificial neural networks (ANNs), naive Bayes (NB) classifiers, support vector machine (SVM), and knearest neighbors (KNN) [1].
In general, the development of a classification model comprises two phases, as shown in Figure 1, starting with data partitioning. The entire dataset is divided into a training set and a test set, each of which is used during different stages and for different purposes. The first is the learning or training phase using the training set. At this time, part of the training set is used as a validation set. The second phase is the model evaluation phase using the test set. The evaluation result using a test set is considered the final performance of the trained model. The inherent problem in the development of a classification model is that the model's performance (accuracy) inevitably depends on the divided training and test set. This is because the model reflects the characteristics of the training set, but the accuracy of the model is influenced by the characteristics of the test set. If a model with poor actual performance is evaluated with an easy-to-classify test set, the model performance will look good. Conversely, if a model with good performance is evaluated by a difficult-to-classify test set, the model performance will be underestimated. In our previous work [2], we showed that 1000 cases of train/test sets by random sampling In our previous work [2], we showed that 1000 cases of train/test sets by random sampling produced different classification accuracies from 0.848 to 0.975. This phenomenon is due to the difference in data distribution between the training and test sets, emphasizing that dividing the entire dataset into training and test sets has a significant impact on model performance evaluation. The ideal goal of splitting train/test sets is that the distributions of both the training and test sets become the same as the whole dataset. However, this is a difficult task for multi-dimensional datasets. Various methods have been proposed to solve this problem. Random sampling is an easy and widely used method. In random sampling, each data instance has the same probability of being chosen, and this can reduce the bias of model performance. However, it produces a high variance in model performance, if a dataset has an abnormal distribution or the size of the sample is small [3,4]. Systematic sampling is a method of extracting data by randomly arranging data and skipping at regular intervals [5]. Stratified sampling is a method of first dividing a population into layers, so that they do not overlap, and then sampling from each layer. It uses the internal structure (layers) and the distribution of a dataset [4]. D-optimal [6] and the most descriptive compound method (MDC) [7] are advanced stratified sampling methods. The potential error of the descriptor and rank sum of the distance between compounds are the internal structures of D-optimal and MDC, respectively. R-value-based sampling (RBS) [2] is a type of stratified sampling. It divides the entire dataset into n groups (layers) according to the ratio of "class overlap", and applies systematic sampling to each group. In general, the classification accuracy for a dataset is strongly influenced by the degree of overlap of the classes in the dataset [4,8]. The degree of class overlap was measured using the R-value [8]. Let us suppose a data instance p and q1, q2, ..., qk are the k-nearest neighbor instances of p. If r is the number of instances that belong to the k-nearest neighbors and their class labels are different from that of p, the degree of overlap of p is r (0 ≤ r ≤ k). In other words, p belongs to group r. The experimental results confirm that RBS produces better training and test sets than random and several non-random sampling methods.
In the machine learning area, k-fold cross-validation has been used to overcome the overfitting problem in classification. It makes k training models, and the mean of test accuracies is considered as an evaluation measure for parameter tuning of a model or comparison of different models. The repeated holdout method, also known as Monte Carlo cross-validation, is also available for model evaluation [3,9]. During the iteration of the holdout process, the dataset is randomly divided into training and test sets, and the mean of the model accuracy gradually converges to one value [2]. The purpose of k-fold cross- The ideal goal of splitting train/test sets is that the distributions of both the training and test sets become the same as the whole dataset. However, this is a difficult task for multi-dimensional datasets. Various methods have been proposed to solve this problem. Random sampling is an easy and widely used method. In random sampling, each data instance has the same probability of being chosen, and this can reduce the bias of model performance. However, it produces a high variance in model performance, if a dataset has an abnormal distribution or the size of the sample is small [3,4]. Systematic sampling is a method of extracting data by randomly arranging data and skipping at regular intervals [5]. Stratified sampling is a method of first dividing a population into layers, so that they do not overlap, and then sampling from each layer. It uses the internal structure (layers) and the distribution of a dataset [4]. D-optimal [6] and the most descriptive compound method (MDC) [7] are advanced stratified sampling methods. The potential error of the descriptor and rank sum of the distance between compounds are the internal structures of D-optimal and MDC, respectively. R-value-based sampling (RBS) [2] is a type of stratified sampling. It divides the entire dataset into n groups (layers) according to the ratio of "class overlap", and applies systematic sampling to each group. In general, the classification accuracy for a dataset is strongly influenced by the degree of overlap of the classes in the dataset [4,8]. The degree of class overlap was measured using the R-value [8]. Let us suppose a data instance p and q 1 , q 2 , . . . , q k are the k-nearest neighbor instances of p. If r is the number of instances that belong to the k-nearest neighbors and their class labels are different from that of p, the degree of overlap of p is r (0 ≤ r ≤ k). In other words, p belongs to group r. The experimental results confirm that RBS produces better training and test sets than random and several non-random sampling methods.
In the machine learning area, k-fold cross-validation has been used to overcome the overfitting problem in classification. It makes k training models, and the mean of test accuracies is considered as an evaluation measure for parameter tuning of a model or comparison of different models. The repeated holdout method, also known as Monte Carlo cross-validation, is also available for model evaluation [3,9]. During the iteration of the holdout process, the dataset is randomly divided into training and test sets, and the mean of the model accuracy gradually converges to one value [2]. The purpose of k-fold cross-validation and the holdout method is different from that of the sampling methods. Both k-fold cross-validation and holdout methods produce multiple train/test sets, and as a result, they make multiple prediction models. We cannot know which is a desirable model. Therefore, they were excluded from the discussion of the sampling issue.
In this study, we propose an improved sampling method based on RBS. We generated candidate train/test sets using the modified RBS algorithm and evaluated the distribution similarity between the candidates and the whole dataset. In the evaluation process, a data histogram and feature importance were considered. Finally, the case with the smallest deviation of the distribution was selected. We compared the proposed method with RBS, and we confirmed that the proposed method shows better performance than the previous RBS.

Materials and Methods
As mentioned earlier, the ideal training and test sets should have the same distribution as the original dataset. To achieve this goal, we propose a method called feature-weighted sampling (FWS). Our main idea is as follows: (1) Generate numerous candidate cases of train/test sets using modified RBS.
(2) Evaluate the similarity between the original dataset and candidate cases. The similarity is measured by the distance. (3) Choose the case that has smallest distance to original dataset. Figure 2 summarizes the proposed method in detail. The first phase generates n train/test set candidates with stratified random sampling. Stratified sampling uses the modified RBS method, which reflects the amorphic property of the data, called class overlap. The second step is to select one of the candidates with the distribution that is most similar to the original dataset. To evaluate the similarity of distribution, we measured the distance between the train/test sets and the original dataset in terms of distance. To calculate the distance between the original dataset and train/test sets, we tested Bhattacharyya distance [10], histogram intersection [11], and Earth Mover's Distance [12]. Finally, we adopted the Earth Mover's Distance. Feature importance was applied to the weighting feature during the distance calculation. As a result, the train/test sets that had the smallest distance from the original dataset were selected. For the evaluation of the sampling method, we devised a metric named the mean accuracy index (MAI). Using the MAI, we compared the proposed FWS and RBS. Twenty benchmark datasets and four classifiers, including k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and C50 were used for the comparison. validation and the holdout method is different from that of the sampling methods. Both 84 k-fold cross-validation and holdout methods produce multiple train/test sets, and as a 85 result, they make multiple prediction models. We cannot know which is a desirable 86 model. Therefore, they were excluded from the discussion of the sampling issue. 87 In this study, we propose an improved sampling method based on RBS. We gener-88 ated candidate train/test sets using the modified RBS algorithm and evaluated the distri-89 bution similarity between the candidates and the whole dataset. In the evaluation process, 90 a data histogram and feature importance were considered. Finally, the case with the small-91 est deviation of the distribution was selected. We compared the proposed method with 92 RBS, and we confirmed that the proposed method shows better performance than the pre-93 vious RBS. 94

95
As mentioned earlier, the ideal training and test sets should have the same distribu-96 tion as the original dataset. To achieve this goal, we propose a method called feature-97 weighted sampling (FWS). Our main idea is as follows: 98 99 1) Generate numerous candidate cases of train/test sets using modified RBS. 100 2) Evaluate the similarity between the original dataset and candidate cases. The simi-101 larity is measured by the distance. 102 3) Choose the case that has smallest distance to original dataset. 103 104 Figure 2 summarizes the proposed method in detail. The first phase generates n 105 train/test set candidates with stratified random sampling. Stratified sampling uses the 106 modified RBS method, which reflects the amorphic property of the data, called class over-107 lap. The second step is to select one of the candidates with the distribution that is most 108 similar to the original dataset. To evaluate the similarity of distribution, we measured the 109 distance between the train/test sets and the original dataset in terms of distance. To calcu-110 late the distance between the original dataset and train/test sets, we tested Bhattacharyya 111 distance [10], histogram intersection [11], and Earth Mover's Distance [12]. Finally, we 112 adopted the Earth Mover's Distance. Feature importance was applied to the weighting 113 feature during the distance calculation. As a result, the train/test sets that had the smallest 114 distance from the original dataset were selected. For the evaluation of the sampling 115 method, we devised a metric named the mean accuracy index (MAI). Using the MAI, we 116 compared the proposed FWS and RBS. Twenty benchmark datasets and four classifiers, 117 including k-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), 118 and C50 were used for the comparison.  In the candidate generation step, 1000 candidates with a pair of train/test sets were 124 generated using a modified RBS. Next, 25% and 75% of the total instances were sampled 125

Phase 1: Generate Candidates
In the candidate generation step, 1000 candidates with a pair of train/test sets were generated using a modified RBS. Next, 25% and 75% of the total instances were sampled to the test set and training set, respectively. Class overlap is the key concept of an RBS. We first summarize the class overlap and explain the modified RBS.

Concept of Class Overlap
Class overlap refers to the overlap of data instances among classes, and wide class overlap makes it difficult to classify tasks [8]. The overlap number of an instance p is calculated by counting the number of instances with different class labels in the k-nearest neighbors. Figure 3 shows the class overlap value for a data instance (red cross in Figure 3) when k = 3. If the overlap number is over the threshold, we can determine that p is located in the overlapped area. The ratio of instances located in the overlapped area is the R-value [8]. The R-value can be used to evaluate the quality of the datasets. In RBS, the overlap number of an instance is used to group the instance. If k = 3, then an instance can belong to one of the four groups. The RBS performs sampling train/test instances from the four groups.
Class overlap refers to the overlap of data instances among classes, and wide class 129 overlap makes it difficult to classify tasks [8]. The overlap number of an instance p is cal-130 culated by counting the number of instances with different class labels in the k-nearest 131 neighbors. Figure 3 shows the class overlap value for a data instance (red cross in Figure 132 3) when k = 3. If the overlap number is over the threshold, we can determine that p is 133 located in the overlapped area. The ratio of instances located in the overlapped area is the 134 R-value [8]. The R-value can be used to evaluate the quality of the datasets. In RBS, the 135 overlap number of an instance is used to group the instance. If k=3, then an instance can 136 belong to one of the four groups. The RBS performs sampling train/test instances from the 137 four groups. 138

Modified RBS 139
The original RBS adopts a stratified sampling method. It groups each instance ac-140 cording to the class overlap number and then samples each group in a stratified manner. 141 As a result, the original RBS always produces the same training and test sets. We replaced 142 the stratified sampling with random sampling in the original RBS. The modified RBS pro-143 duces various training/test sets according to the random seed. Figure 4 shows the pseu-144 docode for the modified RBS [2].

Modified RBS
The original RBS adopts a stratified sampling method. It groups each instance according to the class overlap number and then samples each group in a stratified manner. As a result, the original RBS always produces the same training and test sets. We replaced the stratified sampling with random sampling in the original RBS. The modified RBS produces various training/test sets according to the random seed. Figure 4 shows the pseudocode for the modified RBS [2].

Phase 2: Evaluate the Candidates and Select Best Train/Test Sets
The main goal of Phase 2 is to find the best training/tests from 1000 candidates. We evaluated each candidate according to the workflow shown in Figure 5. Each feature in the dataset was scaled to have a value between 0 and 1, and then histograms were generated for the whole dataset and candidate train/test sets. Based on the histogram data, the similarity in the distribution between the whole dataset and the training set, and between the whole data set and the test set was measured using the Earth Mover's Distance. The final similarity distance for each candidate was obtained by summing the obtained similarity distance for each feature, which reflects the weight relative to the importance of the feature. Once the similarity distances for all candidates were obtained, we selected the candidate with the smallest distance as the output of the FWS method. We explain the histogram generation, similarity calculation, and feature weighting in the following sections.

151
The main goal of Phase 2 is to find the best training/tests from 1000 candidates. We 152 evaluated each candidate according to the workflow shown in Figure 5. Each feature in 153 the dataset was scaled to have a value between 0 and 1, and then histograms were gener-154 ated for the whole dataset and candidate train/test sets. Based on the histogram data, the 155 similarity in the distribution between the whole dataset and the training set, and between 156 the whole data set and the test set was measured using the Earth Mover's Distance. The 157 final similarity distance for each candidate was obtained by summing the obtained simi-158 larity distance for each feature, which reflects the weight relative to the importance of the 159 feature. Once the similarity distances for all candidates were obtained, we selected the 160 candidate with the smallest distance as the output of the FWS method. We explain the 161 histogram generation, similarity calculation, and feature weighting in the following sec-162 tions.  The main goal of Phase 2 is to find the best training/tests from 1000 candidates. We 152 evaluated each candidate according to the workflow shown in Figure 5. Each feature in 153 the dataset was scaled to have a value between 0 and 1, and then histograms were gener-154 ated for the whole dataset and candidate train/test sets. Based on the histogram data, the 155 similarity in the distribution between the whole dataset and the training set, and between 156 the whole data set and the test set was measured using the Earth Mover's Distance. The 157 final similarity distance for each candidate was obtained by summing the obtained simi-158 larity distance for each feature, which reflects the weight relative to the importance of the 159 feature. Once the similarity distances for all candidates were obtained, we selected the 160 candidate with the smallest distance as the output of the FWS method. We explain the 161 histogram generation, similarity calculation, and feature weighting in the following sec-162 tions.
163 164 Figure 5. The measurement of similarity between original dataset and given train/test sets.

Generation of Histograms
The histogram represents the approximated distribution by corresponding a set of real values to an equally wide interval, which is called bin. For example, if a histogram is configured with n bin, it can be defined as a histogram = {(bin i , value i )|1 ≤ i ≤ n, where bin k < bin j when k < j}. The above definition allows the histogram to be represented as a bar for data visualization. However, it is more advantageous to use it as a pure mathematical object containing an approximate data distribution [13,14]. Histograms are mathematical tools that extract compressed characteristic information of a dataset and play an important role in various fields such as computer vision, image retrieval, and databases [12][13][14][15]. We confirmed that the histogram approach is better than the statistical quantile.
This work also views the histogram as a mathematical object and attempts to measure the quantitative similarity between the entire dataset and the candidate dataset. By transforming the real distribution into a histogram, finding train/test sets with the distribution most similar to the entire dataset can be considered the same as the image retrieval problem. Our goal was to find the most similar histogram image of the entire dataset from 1000 candidate histogram images.

Measurement of Histogram Similarity
We evaluated the similarity of histograms using distance perspective closeness. Although there are several methods and metrics to obtain similarity distances between histograms [14,15], we exploited the Earth Mover's Distance [12], which adopts a cross-bin scheme. Unlike the bin-by-bin method, cross-bin measurement evaluates not only exactly corresponding bins but also non-responding bins ( Figure 6) [12]. It induces less sensitivity to the location of bins and better reflects human-aware similarities [12]. The Earth Mover's Distance is a cross-bin method based on the optimal transport theory, and several studies have demonstrated its superiority [12,15]. In addition, this measurement method has the properties of true distance metrics that satisfy non-negativity, symmetry, and triangle order inequality [15]. In this study, the similarity between datasets is defined as the sum of the histogram distances of all features. Furthermore, the Earth Mover's Distance was calculated using the emdist package in CRAN (https://cran.r-project.org/web/packages/ emdist/index.html, accessed on 25 February 2021).
The histogram represents the approximated distribution by corresponding a set of 167 real values to an equally wide interval, which is called bin. For example, if a histogram is 168 configured with n bin, it can be defined as a ℎ = {( , )| 1 ≤ ≤ , 169 ℎ < ℎ < }. The above definition allows the histogram to be repre-170 sented as a bar for data visualization. However, it is more advantageous to use it as a pure 171 mathematical object containing an approximate data distribution [13,14]. Histograms are 172 mathematical tools that extract compressed characteristic information of a dataset and 173 play an important role in various fields such as computer vision, image retrieval, and da-174 tabases [12][13][14][15]. We confirmed that the histogram approach is better than the statistical 175 quantile. 176 This work also views the histogram as a mathematical object and attempts to measure 177 the quantitative similarity between the entire dataset and the candidate dataset. By trans-178 forming the real distribution into a histogram, finding train/test sets with the distribution 179 most similar to the entire dataset can be considered the same as the image retrieval prob-180 lem. Our goal was to find the most similar histogram image of the entire dataset from 1000 181 candidate histogram images. We evaluated the similarity of histograms using distance perspective closeness. Alt-184 hough there are several methods and metrics to obtain similarity distances between his-185 tograms [14,15], we exploited the Earth Mover's Distance [12], which adopts a cross-bin 186 scheme. Unlike the bin-by-bin method, cross-bin measurement evaluates not only exactly 187 corresponding bins but also non-responding bins ( Figure 6) [12]. It induces less sensitivity 188 to the location of bins and better reflects human-aware similarities [12]. The Earth Mover's 189 Distance is a cross-bin method based on the optimal transport theory, and several studies 190 have demonstrated its superiority [12,15]. In addition, this measurement method has the 191 properties of true distance metrics that satisfy non-negativity, symmetry, and triangle or-192 der inequality [15]. In this study, the similarity between datasets is defined as the sum of 193 the histogram distances of all features. Furthermore, the Earth Mover's Distance was cal-194 culated using the emdist package in CRAN (https://cran.r-project.org/web/pack-195 ages/emdist/index.html). Previously, we conceptually defined the similarity between datasets by the distances 200 between features; however, simple distances can be a problem. This is because each fea-201 ture not only has a different distribution of values, but also has a different degree of con-202 tribution to model accuracy. In other words, the same similarity distance between features 203 has different effects on predictive power. For example, although features A and B have 204 equally strong similarity distances, A may have a very strong effect on model accuracy, 205

Feature Weighting
Previously, we conceptually defined the similarity between datasets by the distances between features; however, simple distances can be a problem. This is because each feature not only has a different distribution of values, but also has a different degree of contribution to model accuracy. In other words, the same similarity distance between features has different effects on predictive power. For example, although features A and B have equally strong similarity distances, A may have a very strong effect on model accuracy, whereas B may have a weak effect. Therefore, when calculating the distances between each feature, we must apply the weight according to the effect of each feature.
There are many methods to evaluate the effects of features, such as information gain and chi-square. We used the Shapley value-based feature importance method [16]. The Shapley value is a method for evaluating the contribution of each feature value in an instance to the model. It takes the idea of game theory to distribute profits fairly according to the contribution of each player. Recently, Covert [16] proposed a method to measure feature importance from a global perspective of the Shapley value as a dataset rather than each instance. This method, called SAGE, has also been published as a Python package. We used this method to obtain feature importance and assign weights when calculating the similarity distances. The weighted distance between the entire dataset and the given train/test sets was defined as follows: Similarity distance between whole dataset and training set for i-th feature • d test ( f i ): Similarity distance between whole dataset and test set for i-th feature The pseudocode for proposed FWS method is described in Figure 7.
Shapley value is a method for evaluating the contribution of each feature value in an in-210 stance to the model. It takes the idea of game theory to distribute profits fairly according 211 to the contribution of each player. Recently, Covert [16] proposed a method to measure 212 feature importance from a global perspective of the Shapley value as a dataset rather than 213 each instance. This method, called SAGE, has also been published as a Python package. 214 We used this method to obtain feature importance and assign weights when calculating 215 the similarity distances. The weighted distance between the entire dataset and the given 216 train/test sets was defined as follows: The pseudocode for proposed FWS method is described in Figure 7.

Evaluation of FWS Method
To confirm the performance of the proposed sampling method, we compared it with the original RBS, because RBS has traditionally outperformed other methods. Other methods have already been compared with RBS, and we can omit the comparison with other methods. MAI was used as an evaluation metric. For the benchmark test, 20 datasets and 4 classification algorithms were employed.

Evaluation Metric: MAI
Measuring the quality of given train/test sets is a difficult issue, because we do not know which ideal train/test sets completely reflect the entire dataset. Kang [2] proposed MAI as a solution. He generated 1000 train/test sets by random sampling and measured the mean accuracies of a classification algorithm. He considered the mean accuracy as the accuracy of ideal train/test sets. In statistics, the mean of large samples converges to the mean of the population. Let us suppose that AEV is the mean accuracy from n train/test sets. The AEV can be defined as follows: where test_acc i is the test accuracy generated by the ith random sampling. MAI is defined by the following equation: where ACC refers to the test accuracy derived from the classification model for a test set in n train/test sets, and SD is the standard deviation of test accuracies (test_acc i ) in the AEV. The intuitive meaning of MAI is "how far the given ACC is from the AEV". Therefore, the smaller the MAI is, the better. We used the MAI as an evaluation metric for the train/test sets.

Results
In the first phase of generating a candidate train/test set, the MAI value was examined while adjusting the K value, which determines the sensitivity of category overlap during balanced sampling. We experimented with the influence of K. Figure 8 and Table A1 in Appendix A describe the results. The average MAI was measured according to K. In this experiment, the bin width was fixed at 0.2. As we can see, the overall performance was the best when K was 3. When the value of K increased, the number of groups also increased, and instances in a specific group tended to become sparse. When the instances of each group were insufficient, the diversity of the distribution could not be secured. Therefore, a small number of K is advantageous for the proposed method.

258
In the first phase of generating a candidate train/test set, the MAI value was examined 259 while adjusting the K value, which determines the sensitivity of category overlap during 260 balanced sampling. We experimented with the influence of K. Figure 8 and Table A1 in 261 the appendix describe the results. The average MAI was measured according to K. In this 262 experiment, the bin width was fixed at 0.2. As we can see, the overall performance was 263 the best when K was 3. When the value of K increased, the number of groups also in-264 creased, and instances in a specific group tended to become sparse. When the instances of 265 each group were insufficient, the diversity of the distribution could not be secured. There-266 fore, a small number of K is advantageous for the proposed method. The value of bin width is another important parameter for the proposed FWS. There-270 fore, we experimented with the influence of the bin width. We tested the value 0.2 (the 271 number of bins is 5), 0.1 (the number of bins is 10), and 0.05 (the number of bins is 20), and 272 K was fixed at 3. Figure 9 and Table A2 in the Appendix summarize the results. When the 273 bin width was 0.2, the performance was slightly good, but there was no significant differ-274 ence overall. In another experiment, we confirmed that 0.2 was the best for multi-class 275 datasets (number of classes > 2), whereas 0.05 was the best for binary-class datasets. There-276 fore, we used 0.05 and 0.2 as hybrid methods for the final FWS method. The value of bin width is another important parameter for the proposed FWS. Therefore, we experimented with the influence of the bin width. We tested the value 0.2 (the number of bins is 5), 0.1 (the number of bins is 10), and 0.05 (the number of bins is 20), and K was fixed at 3. Figure 9 and Table A2 in Appendix A summarize the results. When the bin width was 0.2, the performance was slightly good, but there was no significant difference overall. In another experiment, we confirmed that 0.2 was the best for multi-class datasets (number of classes > 2), whereas 0.05 was the best for binary-class datasets. Therefore, we used 0.05 and 0.2 as hybrid methods for the final FWS method.  Table 3 shows the final experimental results when k = 3 and the bin width is hybrid. 280 The 61 cases (76%) of MAI of FWS were better than those of RBS, and RBS was better than 281 FWS in 19 cases (24%). This result indicates that FWS is a more improved method than 282 original RBS and previous methods. The details are discussed in the next section.  Table 3 shows the final experimental results when k = 3 and the bin width is hybrid. The 61 cases (76%) of MAI of FWS were better than those of RBS, and RBS was better than FWS in 19 cases (24%). This result indicates that FWS is a more improved method than original RBS and previous methods. The details are discussed in the next section.

Discussion
RBS is an efficient sampling method compared to the previous methods. The proposed FWS is a more improved method than RBS. Figure 10 shows how much more advanced FWS is than RBS. As shown in Figure 10a, the average MAI value of FWS was 0.460, whereas that of RBS was 0.920. As we can see, the smaller the MAI, the better. Therefore, FWS improved MAI by 56% compared to RBS. Figure 10b compares the standard deviations of the MAI. The standard deviation of FWS was 0.403, whereas that of RBS was 0.779. In other words, this means the distribution of the MAI value of the FWS was smaller than that of the RBS. FWS yielded more stable sampling results than RBS. Figure 10c shows the range of MAI. The range was calculated as (maximum of MAI) − (minimum of MAI). It also shows the distribution of MAI values. The ranges of FWS and RBS were 2.359 and 4.619, respectively. The fluctuation range of the FWS was smaller than that of the RBS. All statistics in Figure 10 show that FWS is a more stable and accurate method than RBS. Furthermore, it is proven that the similarity of distribution between the train/test sets and whole dataset is an important factor for their ideal splitting.
fore, FWS improved MAI by 56% compared to RBS. Figure 10b compares the standard 291 deviations of the MAI. The standard deviation of FWS was 0.403, whereas that of RBS was 292 0.779. In other words, this means the distribution of the MAI value of the FWS was smaller 293 than that of the RBS. FWS yielded more stable sampling results than RBS. Figure 10c 294 shows the range of MAI. The range was calculated as (maximum of MAI) -(minimum of 295 MAI). It also shows the distribution of MAI values. The ranges of FWS and RBS were 2.359 296 and 4.619, respectively. The fluctuation range of the FWS was smaller than that of the RBS. 297 All statistics in Figure 10 show that FWS is a more stable and accurate method than RBS. 298 Furthermore, it is proven that the similarity of distribution between the train/test sets and 299 whole dataset is an important factor for their ideal splitting. In the development of the prediction model, the quality of the features determines 302 the performance of the model. In general, the influence of features is greater than that of 303 classification algorithms [17]. Therefore, considering the feature weight for distance cal-304 culation in the classification is reasonable. Figure 11 shows the influence of feature 305 weighting in the FWS method. We compared FWS with and without feature weights. In 306 the average of MAI, the "without case" was 0.634 whereas "with case" was 0.490 (Figure 307  11a). This means that feature weighting improved the performance of the FWS. In terms 308 of the standard deviation, both cases were similar (Figure 11b). The ranges of "with case" 309 and "without case" were 2.314 and 1.801, respectively (Figure 11c). This is because the 310 maximum value of "with case" was large, which is less important than the standard de-311 viation. In the development of the prediction model, the quality of the features determines the performance of the model. In general, the influence of features is greater than that of classification algorithms [17]. Therefore, considering the feature weight for distance calculation in the classification is reasonable. Figure 11 shows the influence of feature weighting in the FWS method. We compared FWS with and without feature weights. In the average of MAI, the "without case" was 0.634 whereas "with case" was 0.490 (Figure 11a). This means that feature weighting improved the performance of the FWS. In terms of the standard deviation, both cases were similar (Figure 11b). The ranges of "with case" and "without case" were 2.314 and 1.801, respectively (Figure 11c). This is because the maximum value of "with case" was large, which is less important than the standard deviation. . FWS with and without feature-weighted distance. 313 We analyzed the variance in the MAI according to the number of classes. In the result 314 of RBS, the MAI value of binary-class datasets was higher than that of multi-class datasets, 315 whereas the difference was not large in FWS ( Figure 12). This means that FWS is not in-316 fluenced by the variance in the class number. FWS is a more stable method than RBS. 317 Figure 11. FWS with and without feature-weighted distance.
We analyzed the variance in the MAI according to the number of classes. In the result of RBS, the MAI value of binary-class datasets was higher than that of multi-class datasets, whereas the difference was not large in FWS ( Figure 12). This means that FWS is not influenced by the variance in the class number. FWS is a more stable method than RBS. Figure 11. FWS with and without feature-weighted distance.
We analyzed the variance in the MAI according to the number of classes. In the resu of RBS, the MAI value of binary-class datasets was higher than that of multi-class dataset whereas the difference was not large in FWS (Figure 12). This means that FWS is not in fluenced by the variance in the class number. FWS is a more stable method than RBS. In this study, we confirmed that the similarity of distribution between the origina dataset and train/test sets is an important factor for accurate sampling. Furthermore, fea ture-weighted distance calculation can improve the sampling performance. If we use th proposed FWS for splitting train/test sets, we can more accurately evaluate the classifica tion models. In our experiment, FWS performed better than RBS in 61 of 80 cases o train/test sets. This shows that the FWS has room for further improvement and is a top for further research.  In this study, we confirmed that the similarity of distribution between the original dataset and train/test sets is an important factor for accurate sampling. Furthermore, feature-weighted distance calculation can improve the sampling performance. If we use the proposed FWS for splitting train/test sets, we can more accurately evaluate the classification models. In our experiment, FWS performed better than RBS in 61 of 80 cases of train/test sets. This shows that the FWS has room for further improvement and is a topic for further research.