You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

15 November 2025

Feature Importance Ranking Using Interval-Valued Methods and Aggregation Functions for Machine Learning Applications

,
and
Faculty of Exact and Technical Sciences, University of Rzeszów, Al. Rejtana 16C, 35-959 Rzeszów, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci.2025, 15(22), 12130;https://doi.org/10.3390/app152212130 
(registering DOI)
This article belongs to the Special Issue Engineering Applications of Hybrid Artificial Intelligence Tools

Abstract

Feature selection is one of the key stages in the process of creating machine learning models and conducting data analysis. This paper presents the results of research related to the implementation of a new algorithm for feature selection and ranking based on weighted interval aggregations. It took into account interval importance values obtained from dividing the dataset into subsets. The algorithm was highly effective in identifying relevant features. The results of comparative studies with nine known methods of feature importance assessment are presented. Ten synthetic datasets and five real datasets were used for the experiments. The calculations also included tests of the relevance of the results obtained. In most experiments, the IVWFR algorithm proved to be the best, achieving the best classification results after identifying subsets of relevant features.

1. Introduction

Feature selection is one of the key stages in the process of building machine learning models and conducting data analysis. The appropriate selection of a subset of input variables not only allows for dimensionality reduction and improved computational efficiency, but also increases the interpretability of models and reduces the risk of overfitting [,]. The literature on this subject describes many approaches to feature selection, including filtering, wrapper, and embedded methods [,]. Each of them differs in the mechanism of evaluating the usefulness of individual variables, as well as the trade-off between accuracy and computational costs.
One promising direction for feature selection is methods based on feature relevance aggregation using intervals. Their basic idea is to analyze the stability and repeatability of feature ranks by determining relevance measures in multiple repetitions or data divisions, and then grouping the results in the form of confidence intervals or rank aggregations [,,,,]. This makes it possible not only to identify the most important features, but also to estimate the degree of certainty assigned to their position in the ranking.
The interval aggregation approach is used in situations where individual significance measures may be unstable, especially when there are a large number of features and a limited sample size []. The use of this type of method increases the resilience of the selection process to random data fluctuations and provides a more reliable picture of the significance of individual variables in the model [,].
The aim of this contribution is to carefully examine the method Interval-Valued Weighted Feature Ranking (IVWFR) proposed in []. IVWFR is a feature selection method that is built upon the feature importance derived from a classification model used in the method. The proposed algorithm aims to improve the stability and interpretability of feature selection by incorporating uncertainty (represented by the intervals using the idea of interval-valued fuzzy sets) and aggregating information across multiple data folds. In [], only preliminary results on the new method were provided. Moreover, the synthetic datasets were the only ones applied in the performed experiments. In this contribution, thorough analysis of the IVWFR method and its performance on different datasets is presented. Both synthetic and real datasets are applied in the experiments. Statistical tests are performed to show the significant differences in the classification metric, balanced accuracy, in favor of the new IVWFR method. The results of the IVWFR method are compared with known feature selection methods: RFECV, RFE, ANOVA, Pearson correlation, Spearman correlation, Mutual Information, Random Forest, and XGBoost. In the experiments, RF denotes Random Forest trained on all features, while RF_Importance refers to a two-stage approach: first computing feature importance, then training a new Random Forest on selected features only. Furthermore, the number of features selected by each method is compared on each dataset and the Jaccard similarity index is computed. The results show that the IVWFR method outperforms many of the existing methods. IVWFR may have comparable results with respect to well-established methods like RFECV or RFE; however, its computational complexity is very competitive against time-consuming methods such as RFE or RFECV.
The structure of this paper is as follows: In Section 2, the basic notions connected with interval-valued calculus are recalled. In Section 3, details of the considered method are provided. Section 4 describes the datasets applied in the experiments. Finally, in Section 5 and Section 6, the results of the experiments and the discussion are presented.

2. Interval Methods and Aggregation Functions

In the 1970s, Zadeh introduced interval-valued fuzzy sets (IVFSs) to represent ambiguity and uncertainty in practical applications []. Since then, IVFSs have been applied in various domains, including decision-making, pattern recognition, and control systems. Compared to conventional fuzzy sets, the use of interval-valued membership degrees allows for a more flexible and comprehensive representation of uncertainty and vagueness. For instance, due to an element’s inherent ambiguity or variability, assigning a single membership degree may not be appropriate in certain applications. In such cases, an IVFS can be employed to capture the range of possible values or the degree of uncertainty associated with an element’s membership.
One of the key advantages of IVFSs is their ability to handle imprecise and uncertain information more robustly than standard fuzzy sets. In real-world applications, IVFSs provide a more realistic representation of ambiguous and vague information, which can lead to more accurate and reliable decision-making. Furthermore, IVFSs can be effectively used to model complex and multidimensional uncertainty in situations where conventional fuzzy sets may be insufficient [].
Interval Calculus is a mathematical framework designed for studying and analyzing real number intervals. The set L I is the set of all subintervals of the unit interval I = [ 0 , 1 ] , i.e., L I = { [ x ̲ , x ¯ ] , x ̲ x ¯ , x ̲ , x ¯ [ 0 , 1 ] } , where each interval can be represented by a pair of numbers [ x ̲ , x ¯ ] , such that x ̲ x ¯ .
An interval-valued fuzzy set F in X is defined as a function F : X L I , F ( x ) = [ F ̲ ( x ) , F ¯ ( x ) ] L I for x X , where F ̲ ( x ) , F ¯ ( x ) denote fuzzy sets.
In our experiments, we represent the data with the use of intervals and we need to aggregate intervals. As a result, we use interval-valued aggregations.
Let x = [ x ̲ , x ¯ ] , x = [ y ̲ , y ¯ ] L I . A : ( L I ) n L I is called an interval-valued aggregation function (cf. []) in L I if it is increasing, i.e.,
x i , y i L I x i y i A ( x 1 , , x n ) A ( y 1 , , y n )
and A ( 0 , , 0 n × ) = 0 , A ( 1 , , 1 n × ) = 1 , where 0 = [ 0 , 0 ] , 1 = [ 1 , 1 ] . The classical partial order for intervals is in the form [ x ̲ , x ¯ ] [ y ̲ , y ¯ ] x ̲ y ̲ , x ¯ y ¯ ; however, other orders (cf. []) or comparability relations for intervals may be used in the above monotonicity condition.
In our experiments, the interval-valued aggregation function based on the weighted arithmetic mean was used
A ( x 1 , x 2 , , x n ) = i = 1 n w ^ i · x ̲ i , i = 1 n w ^ i · x ¯ i
where w ^ i denotes weights and i = 1 n w ^ i = 1 ; however, other weighted interval-valued aggregation functions may also be applied.
Below we present the methods of creating weights for aggregating n intervals, representing the importance of a feature F (obtained across n folds) denoted as
x i = [ x ̲ i , x ¯ i ] for i = 1 , 2 , , n
where x ̲ i and x ¯ i represent the lower and upper bounds of the i-th interval, respectively. The goal is to combine these intervals into a single interval, where narrower intervals carry greater weight, reflecting their higher certainty. The strategy detailed below was adapted.

3. Algorithm and Experimental Details

3.1. Interval-Valued Weighted Feature Ranking

The Interval-Valued Weighted Feature Ranking (IVWFR) algorithm, proposed in [], is a feature selection approach based on feature importance measures obtained from a classification model. In this study, the Random Forest classifier was employed for this purpose. The proposed algorithm aims to enhance the stability and interpretability of feature selection by incorporating uncertainty and aggregating importance information across multiple data folds. The IVWFR procedure is summarized in Algorithm 1.
This approach prioritizes robust feature selection and emphasizes the stability of feature importance while mitigating noise across data folds.
IVWFR is especially suitable for high-dimensional datasets. Comparing the computational efficiency of IVWFR to wrapper-based methods, we see that for a dataset with f features, n samples, and k cross-validation folds, the complexity of IVWFR is approximately O ( k · T RF ( n , f ) ) , where T RF ( n , f ) is the training time of a Random Forest classifier. In contrast, methods like RFE and RFECV perform repeated training on multiple feature subsets, resulting in a typical complexity of O ( f · T RF ( n , f ) ) or higher.
Algorithm 1: IVWFR: Interval-based Weighted Feature Ranking and Selection
Applsci 15 12130 i001

3.2. Cross-Validation and Importance Interval Estimation

The input dataset is divided into several folds using stratified k-fold cross-validation, ensuring that each fold reflects the overall data distribution. The size of each training fold can be adjusted to a desired proportion of the total dataset. For each training fold, a Random Forest classifier is trained. Rather than relying on single-point estimates of feature importance, the algorithm computes confidence intervals for the importance of each feature. These intervals are derived from the distribution of importance values across individual trees within the Random Forest, providing a quantitative measure of the uncertainty associated with each feature’s importance.

3.3. Interval-Valued Aggregation, Normalization, and Feature Scoring

The importance intervals for each feature obtained from different cross-validation folds are combined into a single representative interval using interval-valued aggregation functions. This aggregation process employs an interval version of the weighted arithmetic mean, where the weights are determined by the width of each interval and a selected representative point within it (such as the lower bound, upper bound, or midpoint). A small smoothing factor, denoted as ϵ , is incorporated to handle intervals with very small or zero widths. The aggregated intervals are then normalized to a common scale, ensuring comparability across features.
Based on the normalized intervals, a composite score is calculated for each feature. This score reflects a combination of the interval’s central tendency (representing the average importance) and its width (representing the uncertainty). The relative contribution of the center and the width in the score calculation can be adjusted, and various types of averaging methods may be employed, such as arithmetic, geometric, or harmonic means. The details of the proposed procedure are summarized below.
  • Step 1: Calculate the width of each interval
For each interval x i , the width is calculated as:
width i = x ¯ i x ̲ i
  • Step 2: Determine the representative value for each interval
A representative value r i is chosen for each interval, which can be one of the following:
  • Lower bound: r i = x ̲ i ;
  • Upper bound: r i = x ¯ i ;
  • Midpoint: r i = x ̲ i + x ¯ i 2 .
The midpoint is used as the default representative value in the presented method.
  • Step 3: Calculate the weight for each interval
The weight w i for each interval is defined as the product of the inverse of the interval width and the representative value. This approach considers both the interval’s precision and its magnitude:
w i = r i width i + ϵ = r i x ¯ i x ̲ i + ϵ
where ϵ is a small positive constant (e.g., 10 5 ) added to prevent division by zero when dealing with zero-width intervals.
  • Step 4: Normalize the weights
To ensure that the weights sum to 1, they are normalized as follows:
w ^ i = w i j = 1 n w j
  • Step 5: Calculate the resulting interval
The final aggregated interval is computed according to Equation (1) as the weighted arithmetic mean of the x i = [ x ̲ i , x ¯ i ] using the normalized weights determined as explained above.

3.4. Feature Selection and Final Model Training

Features are ranked in descending order based on their calculated scores. Given a feature F k , we assume m intervals of importance y k = [ y ̲ k , y ¯ k ] for k = 1 , 2 , , m are available. The objective is to derive a score that links both the feature’s importance (represented by the interval’s center) and its stability (represented by the interval’s width). This score will then be used to rank the features in terms of their overall significance.
  • Step 1: Interval Center and Width Calculation
For each interval y k ,
calculate the center k as the midpoint of the interval:
center k = y ̲ k + y ¯ k 2 .
Calculate the width k of interval y k = [ y ̲ k , y ¯ k ] as the difference between the upper and lower bounds (see Equation (3)), i.e., width k = y ¯ k y ̲ k .
  • Step 2: Normalization of Center and Width
To ensure comparability across features, we normalize the center and width values to the range [ 0 , 1 ] :
Normalize the center:
norm _ center k = center k min ( center ) max ( center ) min ( center )
where min ( center ) and max ( center ) are the minimum and maximum center values across all intervals and all features, respectively.
Normalize the width:
norm _ width k = 1 width k min ( width ) max ( width ) min ( width )
where min ( width ) and max ( width ) are the minimum and maximum width values across all intervals and all features, respectively. Note that we use 1 normalized width so that smaller widths result in larger values, reflecting higher stability.
  • Step 3: Score Calculation
The score k for each interval is computed as a weighted combination of the normalized center and the normalized width. The weights α and β dictate the relative influence of these two components. We investigate several averaging methods to combine these components:
  • Weighted arithmetic mean
score k ( A ) = α · norm _ center k + β · norm _ width k
  • Weighted geometric mean
score k ( G ) = ( norm _ center k ) α · ( norm _ width k ) β
  • Weighted harmonic mean
score k ( H ) = α + β α norm _ center k + ϵ + β norm _ width k + ϵ
where ϵ is a small positive constant (e.g., 10 5 ) to prevent division by zero.
These averaging methods offer different sensitivities to the magnitude of the normalized center and width:
  • Arithmetic mean—provides a linear balance between importance and stability.
  • Geometric mean—exhibits higher sensitivity to smaller values (a near-zero value in either component will significantly lower the score).
  • Harmonic mean—amplifies the impact of small values even further than the geometric mean, making it suitable when low stability or importance will drastically reduce the score.
  • Step 4: Feature Ranking
Finally, the features are ranked in descending order according to their score k values. Features with higher scores are considered more important, as they exhibit both high relevance (a large central value) and high stability (a small width, resulting in a large normalized width). A detailed numerical example illustrating the proposed algorithm can be found in [].
Subsequently, a subset of features is selected according to a chosen selection strategy. This strategy may involve choosing features whose scores exceed the average or median score. The algorithm also ensures that a minimum number of features are retained. Finally, a new IVWFR model is trained on the entire dataset using only the selected subset of features. This final model can then be used to perform predictions on previously unseen data.

3.5. Hyperparameters and Their Optimization Using Optuna

The IVWFR algorithm uses the following hyperparameters:
  • n_splits—number of folds for stratified k-fold cross-validation.
  • fold_size—desired proportion of the dataset used for each training fold.
  • random_state—seed for the random number generator, ensuring reproducibility.
  • selection_method—method for selecting the features mean and median.
  • alpha—weight given to interval’s center in feature score (average importance).
  • representative—representative point in interval: left, right or center.
  • mean_type—type of mean to combine center and width: arithmetic, geometric or harmonic.
The performance of the IVWFR algorithm depends on its hyperparameter configuration. To automate the search for optimal settings, the Optuna framework [] is employed. At the core of the optimization process lies the objective function, which receives a set of hyperparameters as input and returns a performance score that Optuna seeks to maximize.
For each set of hyperparameters (a trial), Optuna uses its suggest methods to sample values for each parameter within predefined ranges. It then performs 3-fold stratified cross-validation on the input dataset. In each fold, the IVWFR model is trained on the training portion using the current hyperparameter values, and its feature ranking performance is evaluated on the test portion using a custom metric.
The custom metric is designed to assess how well the model ranks features according to their known types. It assigns a higher score when truly relevant features are ranked at the top, redundant features appear near the top, and irrelevant features are placed lower in the ranking. Specifically, the metric awards 1 point if a relevant feature appears among the top-ranked ones, 0.5 points if it is ranked slightly lower but still within the top set of relevant and redundant features, and 0.5 points if a redundant feature is included among the top-ranked relevant and redundant features (cf. []). The final score is normalized to the range [ 0 , 1 ] , and the objective function computes the mean score across all three folds. Optuna then maximizes this average score during the optimization process.
After the optimization is complete, Optuna returns the hyperparameter combination that achieves the highest average metric value. These optimal hyperparameters are subsequently used to train the final IVWFR model for feature selection on the entire dataset.

3.6. Performance Metrics

To evaluate experiments, the following classification metrics are used: accuracy, balanced accuracy, recall, precision, and F1 score.
Accuracy [] is calculated as the number of accurate predictions divided by the total amount of predictions and represents the overall accuracy of the model.
Accuracy = Number of correct predictions Total number of predictions
Balanced accuracy provides a more informative evaluation of model performance than conventional accuracy, particularly when dealing with imbalanced class distributions. It is defined as the arithmetic mean of the recall scores for each class []. This metric allows the performance on each class to be assessed independently while accounting for class imbalance, making it especially suitable for datasets with unequal class representation.
Let
C total number of classes , i each class s index , TP i true positives for class i , FN i false negatives for class i , FP i false positives for class i .
The following formula is used for calculating multiclass balanced accuracy:
Balanced Accuracy = 1 C i = 1 C TP i TP i + FN i .
Recall [] is defined as the ratio of true positive predictions to the total number of actual positive instances. It represents the proportion of positive cases that the model correctly identifies. Several variants of this metric exist for multiclass classification, including micro, macro, and weighted versions. In this study, we employ the weighted recall, which calculates the average recall across all classes by weighting each class according to its relative frequency in the dataset:
Weighted Recall = i = 1 C TP i TP i + FN i · Instances of class i Total instances .
Another metric used in the experiments is weighted precision. It is defined as the ratio of true positive predictions to the total number of true positive and false positive predictions. Precision [] represents the proportion of positive predictions that are truly correct. The average precision across all classes is computed using weighted averaging, where the contribution of each class is weighted according to its relative frequency in the dataset:
Weighted Precision = i = 1 C TP i TP i + FP i · Instances of class i Total instances .
Finally, the F1 score [] is defined as the harmonic mean of the model’s precision and recall. Both precision and recall contribute equally to this metric. A model achieving a perfect F1 score demonstrates both perfect precision and perfect recall, meaning that it makes no false positive or false negative predictions. In this study, we use the weighted F1 score, which accounts for class imbalance by computing the class-wise F1 scores and averaging them according to the number of samples in each class. The formula for the weighted F1 score is given below:
Weighted F 1 score = 2 · Weighted Precision · Weighted Recall Weighted Precision + Weighted Recall .

4. Datasets

In the experiments, two types of datasets are used: real datasets and synthetic datasets. Characteristics of these datasets are presented in Table 1 and Table 2, respectively.
Table 1. Summary of real datasets.
Table 2. Summary of synthetic datasets. In this case, Rel, Red, and Irr denote the ground truth number of relevant, redundant, and irrelevant features in the dataset, respectively.
The synthetic datasets used in this study were obtained from the SynthSelect toolkit []. This toolkit provides a collection of ten benchmark datasets, each containing predefined relevant, redundant, and irrelevant features. These datasets exhibit varying levels of complexity, making them particularly suitable for evaluating feature selection algorithms. Among them, the Double Spiral and 5D XOR datasets are considered the most challenging in terms of correctly identifying feature types (see Table 3 and Figures 11 and 12 in []). All datasets provided by the SynthSelect toolkit were used in our experiments, and to avoid potential overfitting, cross-validation was applied during both feature selection and performance evaluation phases.
In addition, five real-world datasets were employed: Climate, Estimation of Obesity Levels, Iranian Churn, and Sonar from the UCI Machine Learning Repository, as well as the Water Potability dataset from Kaggle. Using both synthetic and real datasets enables a comprehensive assessment of the proposed method’s effectiveness.
All experiments were conducted in Python 3.13 using open-source libraries, including NumPy 2.3.3, pandas 2.3.2, SciPy 1.16.2, scikit-learn 1.7.2, XGBoost 3.0.5, Optuna 4.5.0, and matplotlib 3.10.6.

5. Results

This section presents a comparative analysis of the performance of various feature selection models on both real-world and synthetic datasets. We begin with a visual examination of the models’ performance, followed by a statistical evaluation assessing the significance of the obtained results.

5.1. Performance Analysis

Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10 present the results for the ten synthetic datasets, which were designed to evaluate model behavior under controlled conditions. Figure 1 consolidates these results to illustrate overall performance trends.
Figure 1. Summary of performance on synthetic datasets. This figure provides a consolidated view of the results across all synthetic datasets.
Similarly, Figure A11, Figure A12, Figure A13, Figure A14 and Figure A15 show the performance of each model on the five real-world datasets, while Figure 2 summarizes the results obtained on this practical, non-ideal data.
Figure 2. Summary of performance on real datasets. This figure provides a consolidated view of the results across all real datasets.
The models are ranked in descending order according to their balanced accuracy. For most of the real-world datasets, the proposed IVWFR method achieved the best performance. This observation is further confirmed in Figure 2, which aggregates the results from all real datasets.
A similar trend is observed in the case of synthetic datasets. The proposed IVWFR method appears particularly effective for the Double Spiral and 5D XOR datasets, which-as previously mentioned-are among the most challenging due to the difficulty of correctly identifying feature types. For these datasets, the performance differences in favor of IVWFR are the most pronounced. As shown in Figure 1, which consolidates all synthetic results, IVWFR consistently outperforms the competing methods.

5.2. Statistical Significance and Discussion

To assess the statistical significance of the performance differences, the Wilcoxon signed-rank test was conducted separately for the real and synthetic datasets. This non-parametric test compares pairs of models based on their balanced accuracy scores. The resulting p-values, corrected for multiple comparisons using the Bonferroni method, are reported in Table 3 and Table 4.
Table 3. p-values for synthetic datasets from the Wilcoxon signed-rank test *.
Table 4. p-values for real datasets from the Wilcoxon signed-rank test *.
For the real-world datasets (Table 4), performance varied across models; however, IVWFR demonstrated a statistically significant advantage over traditional methods such as ANOVA, Pearson, and Spearman correlation. Nonetheless, many models—including RFECV, RFE, and the baseline classifiers—did not differ significantly from one another, suggesting that for these datasets, the use of more sophisticated feature selection methods may not be crucial. The lack of significant differences between some methods on real data may result from reaching a “ceiling” in model quality, while IVWFR stands out with fewer selected features while maintaining the same effectiveness.
In contrast, the analysis of the synthetic datasets (Table 3) revealed more pronounced distinctions among the evaluated methods. In this case, the IVWFR model significantly outperformed nearly all competing approaches, indicating that the algorithm is particularly effective at identifying relevant features in datasets with complex and well-defined structures. As expected, the performance of RFE and RFECV was statistically indistinguishable, since RFE was configured to select the same number of features that RFECV determined as optimal.

5.3. Number of Features Selected

Before analyzing feature overlap, we first examine the number of features selected by each method, as this provides essential context for interpreting the Jaccard similarity scores. Table 5 and Table 6 summarize the feature counts for the real-world and synthetic datasets, respectively.
Table 5. Number of features selected for each synthetic dataset and each method. RF-Imp. is an abbreviation for RF Importance and Mutual Inf. for Mutual Information.
Table 6. Number of features selected for each real dataset and each method. RF-Imp. is an abbreviation for RF Importance and Mutual Inf. for Mutual Information.
Several key patterns emerge from these results. By experimental design, most conventional methods (from RFECV to RFE) select the same number of features, as this count is fixed to the optimum determined by RFECV. In contrast, the IVWFR method frequently selects a different number of features, reflecting its distinct selection strategy. Finally, the RF and XGBoost models show the highest feature counts because they serve as baseline classifiers that utilize all available features. This variation in the number of selected features is an important factor influencing the subsequent overlap analysis. The number of features for comparative methods is set at the optimal level determined by RFECV to enable a reliable comparison with an identical feature space size. In order to assess sensitivity, we conduct an additional analysis in which the number of selected features is changed and the quality of the models is measured. The results of this analysis for individual datasets are presented in the graphs in Figure A16, Figure A17, Figure A18, Figure A19, Figure A20, Figure A21, Figure A22, Figure A23, Figure A24, Figure A25, Figure A26, Figure A27, Figure A28, Figure A29 and Figure A30 in Appendix A.
Simultaneously with the analysis of classification quality, in the case of synthetic data, an analysis of ground truth feature identification was performed. The TPRel, TPRed and TPIrr evaluation measures were used for this purpose, which stand for true positive relevant, true positive redundant, and true positive irrelevant []. In addition, the TCC measure, or total correct classification, also defined in [], was used, combining TPRel and TPRed as the quality of identification of all relevant features in the set. The results of individual measures are presented in Figure 3, and a summary of these results for the six best feature selection methods is presented in Figure 4.
Figure 3. Results of ground truth feature identification for synthetic datasets in the context of the feature selection method used. Individual graphs show the identification of relevant (TPRel), redundant (TPRed), and irrelevant (TPIrr) features, and total correct classification (TCC), i.e., the correct identification of relevant and redundant features [].
Figure 4. Summary results of ground truth feature identification for synthetic databases in the context of the six best feature selection methods used. The results for the RFE and RFECV methods are identical and overlap.

5.4. Feature Selection Overlap Analysis

Beyond performance metrics, we also examined the degree of agreement among feature selection models regarding which features they identified as important. The Jaccard similarity index was used to quantify the overlap between the feature subsets selected by each pair of models, where a value of 1.0 indicates identical sets and a value of 0.0 indicates no shared features.

5.4.1. Overlap on Synthetic Datasets

The analysis of the synthetic datasets (Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13, Table 14, Table 15 and Table 16) makes the algorithmic differences between the models more apparent. Notably, IVWFR often selects a different number of features compared to the other methods, which are constrained by the RFECV output. This confirms that IVWFR relies on a fundamentally different principle for identifying the optimal feature set. While certain groups of methods (e.g., statistical filters) still show some degree of agreement, the overall similarity is lower than that observed for the real-world datasets. Consequently, the complex structures of the synthetic data are more effective in revealing the distinctive behavior of each feature selection strategy.
Table 7. Jaccard similarity on the 5Class Multicut dataset.
Table 8. Jaccard similarity on the 10Class Multicut dataset.
Table 9. Jaccard similarity on the Trig dataset.
Table 10. Jaccard similarity on the Hypersphere 3D dataset.
Table 11. Jaccard similarity on the 4D AND dataset.
Table 12. Jaccard similarity on the Double Spiral dataset.
Table 13. Jaccard similarity on the Cone dataset.
Table 14. Jaccard similarity on the 5D XOR dataset.
Table 15. Jaccard similarity on the y = x dataset.
Table 16. Jaccard similarity on the Yinyang dataset.

5.4.2. Overlap on Real Datasets

For the real-world datasets, the Jaccard similarity matrices (Table 17, Table 18, Table 19, Table 20 and Table 21) show that methods based on Random Forest feature importance (RFECV, RF-Imp., and RFE) often exhibit perfect or near-perfect similarity, as they rely on the same underlying feature ranking. Similarly, filter-based approaches such as ANOVA and Pearson correlation frequently produce comparable feature subsets. In contrast, the IVWFR model consistently demonstrates a more distinct selection pattern, underscoring its unique selection strategy. The lower agreement between the IVWFR and RFE/RFECV methods in feature selection results from a different selection philosophy—IVWFR focuses on stability and interval significance, thus selecting a smaller but more relevant set of features, which translates into classification efficiency.
Table 17. Jaccard similarity on the Climate dataset.
Table 18. Jaccard similarity on the Iranian Churn dataset.
Table 19. Jaccard similarity on the Sonar dataset.
Table 20. Jaccard similarity on the Water Potability dataset.
Table 21. Jaccard similarity on the Estimation of Obesity dataset.

5.4.3. Feature Importance Stability Analysis

To evaluate the robustness and repeatability of the feature selection process, a stability analysis was performed across multiple data splits during 5-fold cross-validation. For each feature selection method, two complementary stability indices were computed to quantify the consistency of the selected feature subsets across different folds (see Figure A31 and Figure A32 for synthetic datasets and Figure A33 and Figure A34 for real-world datasets).
  • Jaccard Similarity Index (J):
    J ( A i , A j ) = | A i A j | | A i A j |
    where A i and A j denote the sets of selected features obtained in two independent folds. The Jaccard index measures the proportion of common features between two selections and ranges from 0 (no overlap) to 1 (identical feature subsets).
  • Kuncheva Index ( κ ):
    κ ( A i , A j ) = | A i A j | · N d 2 d · ( N d )
    where N is the total number of available features and d is the number of features selected. The Kuncheva index corrects for random overlap between subsets, providing a more conservative estimate of stability.
Both indices were calculated for all pairs of folds and then averaged to obtain the overall stability of each feature selection method for synthetic (Figure 5) and real-world (Figure 6) datasets. Higher values of J and κ indicate greater robustness of the feature selection process to data perturbations and random partitioning.
Figure 5. Average stability across synthetic datasets.
Figure 6. Average stability across real-world datasets.
Among the real-world datasets used, one set, Water Potability, contains approximately 4.86% missing data, which affects three of the nine features. Analysis of this issue (Figure A37) indicates that a 5% missing value rate for this database leads to a loss in classification quality from approximately 61.9% to only 61.7%, which confirms its applicability. We use the following strategy for imputing missing values: each missing feature value is replaced with the mean value of that feature. This strategy is applied before feature scaling and selection and is consistent across all feature selection methods. The summary results of the simulation analysis of missing values in the range of 5–20% for all sets are presented in Figure A37, while detailed values are presented in Table A1, Table A2, Table A3, Table A4 and Table A5.

6. Discussion

The research conclusions drawn from the results obtained are very optimistic. Analysis of synthetic data (Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10) shows that the IVWFR algorithm achieves the best classification results among the selected methods, or ranks among the top results (e.g., Cone or Trig dataset). The summary classification chart (Figure 1) proves its effectiveness, where the results obtained by the IVWFR algorithm are significantly better than those of the other methods. Significant differences in the results achieved are noticeable, for example, for the 10Class Multicut dataset, for which the algorithm achieved the best classification results and, at the same time, identified 39 significant features compared to 11 features identified by the RFECV and RFE algorithms (see Table 5), which may indicate its sensitivity to less significant but important features that improve classification. A similarly sensitive result was obtained for the synthetic Double Spiral dataset, where the classification results were much better than those obtained using other methods (Figure A6). At the same time, the IVWFR algorithm identified only 152 significant features, while other methods identified between 227 and 303 features (Table 5). This may indicate the sensitivity of the algorithm in terms of removing irrelevant information noise from the dataset.
The results obtained for five real-world datasets are also very promising. The IVWFR algorithm achieved the best results or was among the best methods (Figure A11, Figure A12, Figure A13, Figure A14 and Figure A15). The summary results (Figure 2) also show that the IVWFR algorithm achieved the best classification results. At the same time, for example, for the Sonar dataset, it identified 30 significant features, while the other methods identified between 47 and 60 (Table 6) out of all 60 features in the dataset (Table 1), and the classification results were clearly better (Figure A14). Moreover, the p-values obtained for the IVWFR algorithm results also confirm its effectiveness for both synthetic data (Table 3) and real data (Table 4). The IVWFR method achieved comparable accuracy with significantly fewer features, confirming its effectiveness and ability to reduce complexity.
The results in Figure 5 and Figure 6 clearly demonstrate that for the synthetic dataset the proposed IVWFR method achieves high feature selection stability among all compared approaches. The average Jaccard index reaches 0.61 and Kuncheva index reaches 0.64 (the best result) for synthetic datasets and 0.83 (Jaccard) and 0.80 (Kuncheva) for real-world datasets, indicating that IVWFR consistently selects a subset of most of the same features across different cross-validation folds.
However, simple statistical filters (ANOVA, Pearson, Spearman correlation) exhibit similar repeatability, suggesting the same sensitivity to data perturbations. In contrast, wrapper-based methods achieve moderate (RFE) or the worst (RFECV) stability but at the cost of considerably higher computational complexity (Figure A35 and Figure A36).
The appropriate stability of IVWFR, both for synthetic and real-world data, can be attributed to its interval-valued aggregation mechanism, which smooths out the fluctuations in feature importance values obtained from multiple data partitions. By incorporating uncertainty intervals into the ranking process, the method effectively suppresses random noise and yields a more reproducible and interpretable feature set.
Future research directions will focus on developing the proposed approach, i.e., the use of innovative, adaptive methods for aggregating significance intervals; developing new weighting methods, nonlinear methods that take into account the influence of individual intervals; developing new methods for scoring significance, incorporating interval stability; and using optimization to determine feature contributions.

Author Contributions

Conceptualization, W.P. and U.B.; methodology, A.W. and U.B.; software, A.W.; validation, A.W. and W.P.; formal analysis, U.B.; investigation, A.W.; resources, A.W. and W.P.; data curation, A.W. and W.P.; writing—original draft preparation, A.W., W.P. and U.B.; writing—review and editing, U.B. and W.P.; visualization, A.W.; supervision, W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository.

Acknowledgments

This work was partially supported by the Centre for Innovation and Transfer of Natural Sciences and Engineering Knowledge of University of Rzeszów, Poland.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Performance on 10Class Multicut dataset.
Figure A2. Performance on 4D AND dataset.
Figure A3. Performance on 5Class Multicut dataset.
Figure A4. Performance on 5D XOR dataset.
Figure A5. Performance on Cone dataset.
Figure A6. Performance on Double Spiral dataset.
Figure A7. Performance on Hypersphere 3D dataset.
Figure A8. Performance on Trig dataset.
Figure A9. Performance on Yinyang dataset.
Figure A10. Performance on y = x dataset.
Figure A11. Performance on Climate dataset.
Figure A12. Performance on Estimation of Obesity Levels dataset.
Figure A13. Performance on Iranian Churn dataset.
Figure A14. Performance on Sonar dataset.
Figure A15. Performance on Water Potability dataset.
Figure A16. Performance compared to number of features using y = x dataset.
Figure A17. Performance compared to number of features using 10Class Multicut dataset.
Figure A18. Performance compared to number of features using features_4D AND dataset.
Figure A19. Performance compared to number of features using 5Class Multicut dataset.
Figure A20. Performance compared to number of features using 5D XOR dataset.
Figure A21. Performance compared to number of features using Cone dataset.
Figure A22. Performance compared to number of features using Double Spiral dataset.
Figure A23. Performance compared to number of features using Hypersphere 3D dataset.
Figure A24. Performance compared to number of features using Trig dataset.
Figure A25. Performance compared to number of features using Yinyang dataset. The results for the Pearson, Spearman, RF_Importance and Mutual_Information methods and for the IVWFR and ANOVA methods are almost identical and overlap.
Figure A26. Performance compared to number of features using Climate dataset. The results for the ANOVA and IVWFR methods and for the Pearson and Spearman correlation methods are identical and overlap.
Figure A27. Performance compared to number of features using Iranian Churn dataset. The results for the ANOVA and Pearson correlation methods are identical and overlap.
Figure A28. Performance compared to number of features using Estimation of Obesity dataset.
Figure A29. Performance compared to number of features using Sonar dataset. The results for the ANOVA and Pearson correlation methods are identical and overlap.
Figure A30. Performance compared to number of features using Water potability dataset. The results for the RF_Importance and IVWFR methods and for the RFE and RFECV methods are identical and overlap.
Figure A31. Jaccard stability across synthetic datasets.
Figure A32. Kuncheva stability across synthetic datasets.
Figure A33. Jaccard stability across real-world datasets.
Figure A34. Kuncheva stability across real-world datasets.
Figure A35. Average runtime across synthetic datasets.
Figure A36. Average runtime across real-world datasets.
Figure A37. IVWFR algorithm robustness to missing data.
Table A1. Results of performance with artificial missing data for the Estimation of Obesity dataset.
Table A1. Results of performance with artificial missing data for the Estimation of Obesity dataset.
Missing %Mean AccStd DevDegradationAvg Features
5%0.91780.00490.0376 (3.9%)8.0
10%0.89120.00320.0643 (6.7%)8.0
15%0.86350.00320.0920 (9.6%)8.0
20%0.83520.00700.1203 (12.6%)8.0
Table A2. Results of performance with artificial missing data for the Water Potability dataset.
Table A2. Results of performance with artificial missing data for the Water Potability dataset.
Missing %Mean AccStd DevDegradationAvg Features
5%0.61720.00550.0021 (0.3%)5.0
10%0.61130.00840.0080 (1.3%)5.0
15%0.59230.00870.0271 (4.4%)5.0
20%0.59480.00520.0246 (4.0%)5.0
Table A3. Results of performance with artificial missing data for the Iranian Churn dataset.
Table A3. Results of performance with artificial missing data for the Iranian Churn dataset.
Missing %Mean AccStd DevDegradationAvg Features
5%0.83770.01560.0411 (4.7%)7.0
10%0.79910.01710.0797 (9.1%)7.0
15%0.78580.00460.0930 (10.6%)7.0
20%0.77240.01660.1064 (12.1%)7.0
Table A4. Results of performance with artificial missing data for the Sonar dataset.
Table A4. Results of performance with artificial missing data for the Sonar dataset.
Missing %Mean AccStd DevDegradationAvg Features
5%0.80140.01230.0041 (0.5%)30.0
10%0.78910.02070.0163 (2.0%)30.0
15%0.79100.01900.0144 (1.8%)30.0
20%0.79310.02190.0124 (1.5%)30.0
Table A5. Results of performance with artificial missing data for the Climate dataset.
Table A5. Results of performance with artificial missing data for the Climate dataset.
Missing %Mean AccStd DevDegradationAvg Features
5%0.64050.01810.0440 (6.4%)9.0
10%0.63860.03650.0460 (6.7%)9.0
15%0.60690.02980.0777 (11.3%)9.0
20%0.59260.02320.0920 (13.4%)9.0

References

  1. Dash, M.; Liu, H. Feature selection for classification. Intell. Data Anal. 1997, 1, 131–156. [Google Scholar] [CrossRef]
  2. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  3. Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef] [PubMed]
  4. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  5. Kuncheva, L.I. A stability index for feature selection. In Proceedings of the 25th IASTED International Multi-Conference: Artificial Intelligence and Applications, Innsbruck, Austria, 12–14 February 2007; pp. 390–395. [Google Scholar]
  6. Haury, A.C.; Gestraud, P.; Vert, J.P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE 2011, 6, e28210. [Google Scholar] [CrossRef] [PubMed]
  7. Nogueira, S.; Sechidis, K.; Brown, G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
  8. Cawley, G.C.; Talbot, N.L. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
  9. Ahmad, J.; Latif, S.; Khan, I.U.; Alshehri, M.S.; Khan, M.S.; Alasbali, N.; Jiang, W. An interpretable deep learning framework for intrusion detection in industrial Internet of Things. Internet Things 2025, 33, 101681. [Google Scholar] [CrossRef]
  10. Strobl, C.; Boulesteix, A.L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 2008, 9, 307. [Google Scholar] [CrossRef] [PubMed]
  11. Meinshausen, N.; Bühlmann, P. Stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2010, 72, 417–473. [Google Scholar] [CrossRef]
  12. Kalousis, A.; Prados, J.; Hilario, M. Stability of feature selection algorithms: A study on high-dimensional spaces. Knowl. Inf. Syst. 2007, 12, 95–116. [Google Scholar] [CrossRef]
  13. Wojtowicz, A.; Bentkowska, U.; Paja, W. The interval-valued weighted feature ranking method based on aggregation functions. In Proceedings of the 2025 IEEE International Conference on Fuzzy Systems (FUZZ), Reims, France, 6–10 July 2025; pp. 1–7. [Google Scholar] [CrossRef]
  14. Zadeh, L.A. The concept of a linguistic variable and its application to approximate reasoning—I. Inf. Sci. 1975, 8, 199–249. [Google Scholar] [CrossRef]
  15. Dubois, D.; Prade, H. Internal-valued fuzzy sets, possibility theory and imprecise probability. In Proceedings of the 4th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2005) and 11èmes Rencontres Francophones sur la Logique Floue et ses Applications (LFA 2005), Barcelona, Spain, 7–9 September 2005; pp. 314–319. [Google Scholar]
  16. Komorníková, M.; Mesiar, R. Aggregation functions on bounded partially ordered sets and their classification. Fuzzy Sets Syst. 2011, 175, 48–56. [Google Scholar] [CrossRef]
  17. Bustince, H.; Fernandez, J.; Kolesárová, A.; Mesiar, R. Generation of linear orders for intervals by means of aggregation functions. Fuzzy Sets Syst. 2013, 220, 69–77. [Google Scholar] [CrossRef]
  18. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
  19. Reference Page for the Accuracy Score Implementation in Scikit-Learn. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score (accessed on 13 June 2023).
  20. Reference Page for the Recall Score Implementation in Scikit-Learn. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html (accessed on 13 June 2023).
  21. Reference Page for the Precision Score Implementation in Scikit-Learn. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html (accessed on 13 June 2023).
  22. Reference Page for the F1 Score Implementation in Scikit-Learn. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html (accessed on 13 June 2023).
  23. Mitra, R.; Ali, E.; Varam, D.; Sulieman, H.; Kamalov, F. Variable Selection in Data Analysis: A Synthetic Data Toolkit. Mathematics 2024, 12, 570. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.