Optimal Feature Aggregation and Combination for Two-Dimensional Ensemble Feature Selection

Feature selection is a way of reducing the features of data such that, when the classification algorithm runs, it produces better accuracy. In general, conventional feature selection is quite unstable when faced with changing data characteristics. It would be inefficient to implement individual feature selection in some cases. Ensemble feature selection exists to overcome this problem. However, with the advantages of ensemble feature selection, some issues like stability, threshold, and feature aggregation still need to be overcome. We propose a new framework to deal with stability and feature aggregation. We also used an automatic threshold to see whether it was efficient or not; the results showed that the proposed method always produces the best performance in both accuracy and feature reduction. The accuracy comparison between the proposed method and other methods was 0.5–14% and reduced more features than other methods by 50%. The stability of the proposed method was also excellent, with an average of 0.9. However, when we applied the automatic threshold, there was no beneficial improvement compared to without an automatic threshold. Overall, the proposed method presented excellent performance compared to previous work and standard ReliefF.


Introduction
Feature selection is a way of reducing the dimensions/features of data such that, when the classification algorithm runs, it produces better accuracy. The common thing to do is to recognize the domain of the data and to form a set of more relevant features. However, as the amount of data increases, it becomes exhausting to sort relevant features manually. There are several benefits of feature selection, i.e., facilitating data visualization and data understanding, reducing computing time and data storage, and reducing overfitting due to the phenomenon of the curse of dimensionality and improving the performance [1].
There are many ways of building feature selection algorithms, but most feature selection algorithms are categorized into three types. Filter types use feature rank to determine the relevance of each feature [2][3][4][5][6][7][8]. Feature rank is obtained by calculating the correlation between each feature and its predictor class. Consequently, this type has a minimum of computational time. The second type is the wrapper. In this type, a classification algorithm used to determine the most relevant features, which are obtained by looking at the results of the classification algorithm [9][10][11][12]. In line with the wrapper type, the embedded type also uses a classification algorithm to determine the relevant features. The difference is that the feature selection algorithm is embedded in the classification algorithm, such as decision tree, random forest, and neural network [13][14][15].
Optimal number of ensembles: because the basis of an ensemble is a partition, it is necessary to know the optimal number of partitions. Our research [32] on ensemble feature selection showed that five partitions are better than three and seven.

2.
Stability of feature selection: this relates to how well the ensemble feature selection produces the same selected features each time.

3.
Scalability: a conventional feature selection is less efficient in handling big data problems. Logically, ensemble feature selection can handle this problem because of the partition. 4.
Threshold for rankers: the problem of each feature selection algorithm that uses a filter approach is determining the threshold for the ranker. This threshold determines the number of reduced features. 5.
Feature aggregation: this problem is related to how to combine features from each subset in the ensemble to produce the most relevant features. 6.
Explainability: the main problem faced by each algorithm beyond feature selection is clarity of the results obtained. Researchers usually use two approaches, i.e., mathematical proofing or empirical proofing.
Our previous research [32], which focused on how to improve accuracy and computational time, still had a few limitations. The first involved how to calculate the stability of the ensemble feature. The second involved the determination of the threshold for the ranker. The third involved how to aggregate the subsets of features to produce the best result. The focus of this research is creating a new framework that can overcome the problems of stability, threshold, and aggregation of features.
The organization of this paper is as follows: Section 2 describes the dataset, evaluation measurement, and the proposed technique. Section 3 displays the results obtained from several experiments and contains a discussion of the results obtained. Finally, Section 4 concludes the paper.

Resources
In this research, an experiment was carried out using a Hewlett-Packard Laptop with an Intel (R) Core (TM) Processor i5-7200U central processing unit (CPU) @ 2.50 GHz, 2712 MHz, with two cores and four logical processors with 8 GB of random-access memory (RAM). This research used MATLAB with several libraries included.

Dataset
The dataset used in this research was taken from three sources: UCI Machine Learning Repository, Arizona State University feature selection dataset, NIPS 2003 challenge dataset, and Vanderbilt University's gene expression dataset. There were 14 different datasets with multivariate characteristics Information 2020, 11, 38 3 of 16 and no missing data. These datasets were chosen based on differences in the number of samples, features, and classes, as well as because the datasets had different fields of knowledge. There are three categories or fields of knowledge, for example, artificial data, image data, and medical record data. The aim was to see whether the proposed method could overcome variations of these characteristics. Table 1 shows the characteristics of the datasets and their sources. MADELON is an artificial dataset consisting of 32 clusters. MADELON has five hypercube dimensions (an analog n-dimensional square and cube) and is labeled +1 and −1 at random. Five dimensions represent the five informative features. Then, out of the five features, 15 additional combinations are made to produce a total of 20 informative and redundant sets of features. The sequence of features and patterns in this dataset is randomized. MADELON is also one of five datasets in NIPS 2003.

Image Data
In this research, the proposed method was tested on five image datasets with different criteria, one of which was the number of classes. The first dataset was the Columbia University Image Library (COIL20). COIL 20 is a face image dataset consisting of 20 objects. Each object has 72 images that were taken five degrees apart when the object rotated on a turntable. The size of each image is 32 × 32 pixels, represented by a 1024-dimensional vector.
The second data was GISETTE. GISETTE is a handwritten number recognition dataset. The problem involves differentiating between numbers four and nine. The data are processed in such a way (normalized and centered) leading to a fixed size of 28 × 28. The sequence of features and patterns in this dataset is randomized, where information from the features is not provided to avoid bias in the feature selection process. GISETTE is one of five datasets in NIPS 2003.
The third dataset was USPS. USPS is also a digit handwritten dataset. It is similar to GISETTE, but the digits used in USPS are all digits from 0-9. The digits are converted to a 16 × 16 image. Figure 1 shows sample images from the USPS dataset.
Information 2020, 11, x FOR PEER REVIEW 3 of 15 record data. The aim was to see whether the proposed method could overcome variations of these characteristics. Table 1 shows the characteristics of the datasets and their sources. MADELON is an artificial dataset consisting of 32 clusters. MADELON has five hypercube dimensions (an analog n-dimensional square and cube) and is labeled +1 and −1 at random. Five dimensions represent the five informative features. Then, out of the five features, 15 additional combinations are made to produce a total of 20 informative and redundant sets of features. The sequence of features and patterns in this dataset is randomized. MADELON is also one of five datasets in NIPS 2003.

Image Data
In this research, the proposed method was tested on five image datasets with different criteria, one of which was the number of classes. The first dataset was the Columbia University Image Library (COIL20). COIL 20 is a face image dataset consisting of 20 objects. Each object has 72 images that were taken five degrees apart when the object rotated on a turntable. The size of each image is 32 × 32 pixels, represented by a 1024-dimensional vector.
The second data was GISETTE. GISETTE is a handwritten number recognition dataset. The problem involves differentiating between numbers four and nine. The data are processed in such a way (normalized and centered) leading to a fixed size of 28 × 28. The sequence of features and patterns in this dataset is randomized, where information from the features is not provided to avoid bias in the feature selection process. GISETTE is one of five datasets in NIPS 2003.
The third dataset was USPS. USPS is also a digit handwritten dataset. It is similar to GISETTE, but the digits used in USPS are all digits from 0-9. The digits are converted to a 16 × 16 image. Figure  1 shows sample images from the USPS dataset.   The fourth dataset was YALE. YALE is a face image dataset from 15 individuals. Each individual has 11 image variations, which are center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised, and winking. The total dataset includes 165 grayscale images in GIF format. Figure 2 shows sample images from the YALE dataset. The fourth dataset was YALE. YALE is a face image dataset from 15 individuals. Each individual has 11 image variations, which are center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised, and winking. The total dataset includes 165 grayscale images in GIF format. Figure 2 shows sample images from the YALE dataset. Similar to YALE, ORL is also a face image dataset. ORL contains 10 different images each of 40 distinct subjects. The images were taken several times, varying the illumination, facial looks (open/closed eyes), facial emotions (smiling/not smiling), and facial appearances (glasses/no glasses). The images were taken against a dark background with the subjects facing the camera (with tolerance for some side movement). Figure 3 shows sample images from the ORL dataset.

Medical Record Data
The proposed method was also tested using medical record datasets. There were six datasets tested, five of which were gene expression datasets. The first one was a cardiotocography (CTG) dataset. CTG includes medical record data for fetal heart rate and uterus contraction. CTG measures the fetal heart rate and, at the same time, monitors contractions in the uterus (uterus). CTG is different from an electrocardiogram (ECG). An ECG detects the heart rate by measuring the electrical activity produced by the heart during contractions. CTG uses ultrasound waves called Doppler waves to measure fetal movements. The way it works is by sending ultrasound waves into the mother's body; then, when it hits the fetus, the ultrasound waves bounce back with varying strength. The bouncing waves are measured as the fetal heart rate. Contractions can be measured using the tocodynamometer found on CTG. The tocodynamometer measures the tension in the mother's abdominal wall.
The 11-TUMORS dataset was from the Gene Expression Model Selector. The 11-TUMORS consists of 11 types of tumors in humans placed in a microarray. The 11 classes in this dataset included prostate, bladder/ureter, breast, colorectal, gastroesophageal, kidney, liver, ovary, and pancreatic cancer, as well as lung adenocarcinoma and lung squamous cell carcinoma.
LUNG CANCER was a dataset from the Gene Expression Model Selector. This dataset consisted of four types of lung cancer and normal samples. The total data is 203 specimens with 186 lung tumors and 17 healthy lung specimens. Of these, 125 adenocarcinoma samples were associated with clinical data and with histological slides from adjacent parts.
The other gene expression datasets were TOX_171, PROSTATE_GE, GLI_85, LYMPHOMA, and SMK_CAN_187. TOX_171 dataset is a kind of influenza disease effect on plasmacytoid dendritic cells. PROSTATE_GE is a prostate cancer dataset. GLI_85 stands for glioma, which is a malignant tumor of the glial tissue of the nervous system. LYMPHOMA is a cancer of the lymph nodes. SMK_CAN_187 is cancer caused by smoking. Similar to YALE, ORL is also a face image dataset. ORL contains 10 different images each of 40 distinct subjects. The images were taken several times, varying the illumination, facial looks (open/closed eyes), facial emotions (smiling/not smiling), and facial appearances (glasses/no glasses). The images were taken against a dark background with the subjects facing the camera (with tolerance for some side movement). Figure 3 shows sample images from the ORL dataset. The fourth dataset was YALE. YALE is a face image dataset from 15 individuals. Each individual has 11 image variations, which are center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised, and winking. The total dataset includes 165 grayscale images in GIF format. Figure 2 shows sample images from the YALE dataset. Similar to YALE, ORL is also a face image dataset. ORL contains 10 different images each of 40 distinct subjects. The images were taken several times, varying the illumination, facial looks (open/closed eyes), facial emotions (smiling/not smiling), and facial appearances (glasses/no glasses). The images were taken against a dark background with the subjects facing the camera (with tolerance for some side movement). Figure 3 shows sample images from the ORL dataset.

Medical Record Data
The proposed method was also tested using medical record datasets. There were six datasets tested, five of which were gene expression datasets. The first one was a cardiotocography (CTG) dataset. CTG includes medical record data for fetal heart rate and uterus contraction. CTG measures the fetal heart rate and, at the same time, monitors contractions in the uterus (uterus). CTG is different from an electrocardiogram (ECG). An ECG detects the heart rate by measuring the electrical activity produced by the heart during contractions. CTG uses ultrasound waves called Doppler waves to measure fetal movements. The way it works is by sending ultrasound waves into the mother's body; then, when it hits the fetus, the ultrasound waves bounce back with varying strength. The bouncing waves are measured as the fetal heart rate. Contractions can be measured using the tocodynamometer found on CTG. The tocodynamometer measures the tension in the mother's abdominal wall.
The 11-TUMORS dataset was from the Gene Expression Model Selector. The 11-TUMORS consists of 11 types of tumors in humans placed in a microarray. The 11 classes in this dataset included prostate, bladder/ureter, breast, colorectal, gastroesophageal, kidney, liver, ovary, and pancreatic cancer, as well as lung adenocarcinoma and lung squamous cell carcinoma.
LUNG CANCER was a dataset from the Gene Expression Model Selector. This dataset consisted of four types of lung cancer and normal samples. The total data is 203 specimens with 186 lung tumors and 17 healthy lung specimens. Of these, 125 adenocarcinoma samples were associated with clinical data and with histological slides from adjacent parts.
The other gene expression datasets were TOX_171, PROSTATE_GE, GLI_85, LYMPHOMA, and SMK_CAN_187. TOX_171 dataset is a kind of influenza disease effect on plasmacytoid dendritic cells. PROSTATE_GE is a prostate cancer dataset. GLI_85 stands for glioma, which is a malignant tumor of the glial tissue of the nervous system. LYMPHOMA is a cancer of the lymph nodes. SMK_CAN_187 is cancer caused by smoking.

Medical Record Data
The proposed method was also tested using medical record datasets. There were six datasets tested, five of which were gene expression datasets. The first one was a cardiotocography (CTG) dataset. CTG includes medical record data for fetal heart rate and uterus contraction. CTG measures the fetal heart rate and, at the same time, monitors contractions in the uterus (uterus). CTG is different from an electrocardiogram (ECG). An ECG detects the heart rate by measuring the electrical activity produced by the heart during contractions. CTG uses ultrasound waves called Doppler waves to measure fetal movements. The way it works is by sending ultrasound waves into the mother's body; then, when it hits the fetus, the ultrasound waves bounce back with varying strength. The bouncing waves are measured as the fetal heart rate. Contractions can be measured using the tocodynamometer found on CTG. The tocodynamometer measures the tension in the mother's abdominal wall.
The 11-TUMORS dataset was from the Gene Expression Model Selector. The 11-TUMORS consists of 11 types of tumors in humans placed in a microarray. The 11 classes in this dataset included prostate, bladder/ureter, breast, colorectal, gastroesophageal, kidney, liver, ovary, and pancreatic cancer, as well as lung adenocarcinoma and lung squamous cell carcinoma.
LUNG CANCER was a dataset from the Gene Expression Model Selector. This dataset consisted of four types of lung cancer and normal samples. The total data is 203 specimens with 186 lung tumors and 17 healthy lung specimens. Of these, 125 adenocarcinoma samples were associated with clinical data and with histological slides from adjacent parts.
The other gene expression datasets were TOX_171, PROSTATE_GE, GLI_85, LYMPHOMA, and SMK_CAN_187. TOX_171 dataset is a kind of influenza disease effect on plasmacytoid dendritic cells. PROSTATE_GE is a prostate cancer dataset. GLI_85 stands for glioma, which is a malignant tumor of the glial tissue of the nervous system. LYMPHOMA is a cancer of the lymph nodes. SMK_CAN_187 is cancer caused by smoking.

Methods
Firstly, the training data were partitioned into several subsets. Then, feature selection was performed on each subset of the data. The results of feature selection and feature ranking were then aggregated to get several new subsets of selected features. Subsets of selected features were then combined to get the most optimal feature subset. Guyon and Elisseeff [1] showed that selecting a subset of features is more useful for excluding redundant features than selecting the most relevant feature. Figure 4 shows a detailed illustration of the proposed framework.
Information 2020, 11, x FOR PEER REVIEW 5 of 15

Methods
Firstly, the training data were partitioned into several subsets. Then, feature selection was performed on each subset of the data. The results of feature selection and feature ranking were then aggregated to get several new subsets of selected features. Subsets of selected features were then combined to get the most optimal feature subset. Guyon and Elisseeff [1] showed that selecting a subset of features is more useful for excluding redundant features than selecting the most relevant feature. Figure 4 shows a detailed illustration of the proposed framework.    Data normalization was carried out before partitioning the data. The purpose of data normalization is to uniform the distribution of values of the data. Equation (1) shows the simplest way of achieving data normalization.

Nsamples x Mfeature
the normalized data were then divided into training data and testing data with a ratio of 7:3. The training data with the N samples × M features dimension were then divided into several subsets. Equation (2) shows how the data partition was achieved.
where p | N and q | M; both p and q are non-zero positive integers = {1, 2, . . . , N/M}; if p = q, then the equation becomes

Feature Ranker
ReliefF [5] is a Relief [3] filter method. The ReliefF feature selection method is an improvement of Relief that can deal with noisy, multiclass datasets with low bias. This algorithm works by estimating the features according to how well they distinguish neighbor samples. ReliefF is a ranker method; thus, a threshold is needed to obtain the subset of features. The following equation shows how to calculate the weight on Relief: where W is the weight, x is the feature vector, nearHit is the feature vector closest to x with the same class, and nearMiss is the feature vector closest to x with a different class. Weight W decreases if the difference between feature vectors in the same class is higher than feature vectors in different classes, and vice versa. The calculation of diff(x, nearHit) and diff(x, nearMiss) using ReliefF is different from that using standard Relief. Whereas standard Relief uses Euclidean distance, ReliefF uses Manhattan distance. Equation (5) shows the calculation formulation using Manhattan distance using ReliefF.
After the weight W obtained, the next step is to sort W by the most significant value to get feature ranking using the following equation: sort (w i,j , "ascending").

Ranked Feature Aggregator
After ranking features in all subsets, the next step is to aggregate each of these features according to the index. Let us assume that the number of partitions in a row and column is the same (p = q). . . }; this is because the same column has the same feature index and, thus, they can be compared. Figure 5 shows an illustration of feature aggregation.  As illustrated in Figure 5, a group of new features "New.Feat.Idxj" was obtained by finding the mode value of the feature in each subset D in the i row and j column. Equation (7) shows how the feature aggregation works.
The threshold k was then applied to these groups. This threshold is a percentage value of how many features reduce. There is a difference between the use of thresholds in ensemble and nonensemble feature selection. In non-ensemble feature selection, a threshold is applied in all features. In ensemble feature selection, a threshold is applied in the subset of features.

Feature Combinator
In our previous research [32], a combination was done by combining all features in each subset. Apparently, combining all subsets of features does not produce the best performance. Thus, to solve this problem, we looked for min.loss from all possible combinations of the subsets of features. Figure  6 shows all possible feature combinations if n = 4. Equation (7) shows all possible combination for subsets of features with n subsets.
where n has the same value as p and q. If n = 4, the total possible combination is 15. As illustrated in Figure 5, a group of new features "New.Feat.Idxj" was obtained by finding the mode value of the feature in each subset D in the i row and j column. Equation (7) shows how the feature aggregation works.
The threshold k was then applied to these groups. This threshold is a percentage value of how many features reduce. There is a difference between the use of thresholds in ensemble and non-ensemble feature selection. In non-ensemble feature selection, a threshold is applied in all features. In ensemble feature selection, a threshold is applied in the subset of features.

Feature Combinator
In our previous research [32], a combination was done by combining all features in each subset. Apparently, combining all subsets of features does not produce the best performance. Thus, to solve this problem, we looked for min.loss from all possible combinations of the subsets of features. Figure 6 shows all possible feature combinations if n = 4. Equation (7) shows all possible combination for subsets of features with n subsets.
Best.FeatSubs = min.Loss(All.Comb), where n has the same value as p and q. If n = 4, the total possible combination is 15.

216
There are several ways to evaluate the performance of ensemble feature selection. The first 217 involves the overall performance of the algorithm. In this evaluation, we can use calculation metrics 218 such as accuracy, precision, recall, specificity, and F1-score.
where TP is true positive, TN is true negative, FN is false negative, and FP is false positive.

220
The second evaluation approach involves the stability of the ensemble feature selection itself.

221
There are three categories for stability measurement, which are stability by index/subset, stability by 222 rank, and stability by weight [38,39]. Stability by rank and weight has a major drawback that does

228
We used these three types of stability to see variations in their stability values. Equation (15) 229 shows a measure of stability by index/subset, i.e., Hamming distance.
where Si is subset feature i, and S2 is subset feature j. M is the total number of features in the dataset.

231
The drawback of this stability measure is that it does not depend on feature rank.

Evaluations
There are several ways to evaluate the performance of ensemble feature selection. The first involves the overall performance of the algorithm. In this evaluation, we can use calculation metrics such as accuracy, precision, recall, specificity, and F1-score.
where TP is true positive, TN is true negative, FN is false negative, and FP is false positive. The second evaluation approach involves the stability of the ensemble feature selection itself. There are three categories for stability measurement, which are stability by index/subset, stability by rank, and stability by weight [38,39]. Stability by rank and weight has a major drawback that does not allow stability calculations on two subsets of features that have different numbers of features. On the contrary, stability by index/subset can deal with different sizes of feature vectors. The mechanism involves the subset of a feature represented as a binary vector, where selected features are represented as 1 and non-selected features are represented as 0. However, stability by rank and weight is more representative when measuring the stability of ranking-based feature selection.
We used these three types of stability to see variations in their stability values. Equation (15) shows a measure of stability by index/subset, i.e., Hamming distance. where S i is subset feature i, and S 2 is subset feature j. M is the total number of features in the dataset. The drawback of this stability measure is that it does not depend on feature rank. Equation (17) shows a measure of stability by rank, i.e., Spearman's correlation.
where R i is ranked feature i, and R j is ranked feature j. The distance between the same feature in R i and R j is notated by d. The drawback of this stability measure is that it cannot handle subsets of features from different cardinality, and that two features must at the same size. Equation (18) shows a measure of stability by weight, i.e., Pearson's correlation. For Spearman and Pearson correlations, we use the interpolation method to overcome the problem of differences in the number of features.
where W i is weight feature i, and W j is ranked feature j. µ w i is the mean of W i between the same feature in Ri and Rj. The drawback of this stability measure is that two subsets of features must have the same size.

Results and Discussion
In this section, we describe some of the results obtained. We evaluated the proposed method based on several criteria. First, the overall performance was judged based on the values of accuracy, recall, specificity, precision, F1-score, and the number of features selected. In this evaluation, we compared the proposed method with previous two-dimensional (2D) ensemble methods and the standard ReliefF. The most important thing from feature selection is knowing which features/subsets of features are relevant. By using a combination method to combine subsets of features and obtain features that produce the smallest loss, we could deduce which subset of features was the most relevant. The next evaluation approach involved measuring the stability of the proposed method. The last evaluation approach involved looking at the effect of the automatic threshold on the proposed method. Table 2 shows the performance evaluation of feature selection. There were four feature selection methods compared, including ReliefF as a baseline, correlation feature selection (CFS), minimum-redundancy maximum-relevancy (mRMR), and fast correlation-based filter (FCBF). We tested them in five datasets representing each field of knowledge. From the comparison results, it was found that ReliefF had the best performance among other methods in three datasets. Therefore, ReliefF was used as a baseline in this paper.  Table 3 shows the performance evaluation of the proposed method. The proposed method was compared with the previous 2D ensemble methods and the standard ReliefF. We can see that the proposed method outperformed the two comparison methods in all datasets except one, MADELON. When viewed in the MADELON dataset, the proposed method improved the accuracy results from the previous method by 3%, although it was still inferior to the ReliefF standard by a difference of 2%. Exploring further, we found that there were some unsatisfactory results, especially for F1-score. The F1-score for the YALE and ORL datasets was very low, ranging from 0.05 to 0.17. These results were obtained because the recall was too high, but the precision was small. This problem could be overcome using other classification methods.

Overall Performance
Another point of performance evaluation was the number of relevant features selected. From these results, the proposed method produced the fewest number of features compared to the other two methods. This result relates to the aggregation and combination method used. As stated earlier, aggregation was done for each subset of features, not the full features in the data. This mechanism is akin to doing multiple thresholds in the ensemble partition. For combinations, the mechanism is to choose a subset of features that have a minimum loss, and those selected have the smallest combination, automatically having the fewest number of features. Overall, the proposed method outperformed the two other methods with a difference of 0.5-14% in terms of accuracy and reduced 50% of features compared other methods.

Subset of Relevant Features
The primary purpose of feature selection is to determine the features/subset of features that are relevant and not relevant in a dataset. Therefore, in this evaluation, we described which subsets of features were relevant in the tested dataset. Table 4 shows the results of the most relevant subsets of features (with a minimum loss) for 10 trials of each dataset.  MADELON  10  8  11  15  10  8  14  14  10  13  2 and 4  YALE  12  11  9  10  11  13  15  10  5  15  1 and 4  ORL  5  3  6  3  7  12  7  12  3  12  1 and 3  CTG  13  12  9  11  8  9  9  11  9  9  1 and 4  TOX_171  1  13  7  4  5  8  1  15  13  9  1 and 3  PROSTATE_GE 8  8  8  5  2  1  7  8  8  13  1 and 4  GLI_85  1  8  6  1  5  2  1  1  2  1  1 and 2  LYMPHOMA  1  7  8  9  11  6  4  8  7  12  1 and 4  SMK_CAN_187 2  14  9  3  3  7  2  12  15  10  2 and 4 For the MADELON dataset, the #1 run resulted in minimum loss with the 10th combination; referring to Figure 6, this means that the feature subsets contained in the combination were the second and fourth feature subsets. Then, for 10 trials, we found that the highest intersection involved the second and fourth subset features. For the CTG dataset, most intersections were in the subsets of the first and fourth features. The features listed in the first subset were the first features, and those listed in the fourth subset were the 20th and 22nd features. If observed further, the first feature in the CTG dataset was the Fetal Heart Rate (FHR) baseline, the 20th feature was the variance histogram, and the 22nd feature was the FHR pattern. These results indicate that, by using this combination, we could also determine which subsets of features were most relevant in a dataset.

Stability Measurement
Each stability measurement has its advantages and disadvantages. This evaluation was carried out to measure the stability of the proposed method. This also elaborated on the capabilities of the considered stability measures. By using Hamming distance, we converted the feature ranking into a binary representative. Table 5 shows the feature generated on the CTG dataset from the proposed method in 10 iterations.  Table 6 shows the performance comparison of stability measurement on the CTG dataset. From this result, it can be said that the measurement of stability using Hamming distance had an outstanding value. This is because the difference was based only on binary values. Spearman stability showed that, if the features had the same amounts and similarities, the result was 1. Stability using Pearson's correlation in this experiment had more variation values. Overall, the proposed method had excellent stability, ranging from 0.8-1.

Applying Automatic Threshold
We also applied an automatic threshold to the proposed method. The automatic threshold used was the mean of the ranking weight. Figures 7 and 8 show the results of a comparison between the proposed method without an automatic threshold and that using the automatic threshold. The result was not significant; in some cases, the results with an automatic threshold surpassed those without an automatic threshold, and vice versa.

Conclusions and Future Works
In this paper, we presented an improvement of the homogeneous distribution ensemble feature selection with a two-dimensional partition method. The improvement was in the feature aggregation and feature combination. From the results obtained, the proposed method optimally always produced the best performance in terms of both accuracy and feature reduction. The accuracy comparison between the proposed method and other methods was 0.5-14%, and it reduced more features than other methods by 50%. The stability of the proposed method was also excellent, with an average of 0.95. Finally, using the proposed method, we could determine which combination of subsets of features produced a better result.

Conclusions and Future Works
In this paper, we presented an improvement of the homogeneous distribution ensemble feature selection with a two-dimensional partition method. The improvement was in the feature aggregation and feature combination. From the results obtained, the proposed method optimally always produced the best performance in terms of both accuracy and feature reduction. The accuracy comparison between the proposed method and other methods was 0.5-14%, and it reduced more features than other methods by 50%. The stability of the proposed method was also excellent, with an average of 0.95. Finally, using the proposed method, we could determine which combination of subsets of features produced a better result.
Although the proposed method gave excellent performance, there were still some limitations that need to be addressed. The future work of this research will focus on how to implement an effective and efficient automatic threshold using this method. we will also study how to improve F1-scores by implementing other classification methods such as deep learning.