You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

27 November 2024

Enhancing Explainable Artificial Intelligence: Using Adaptive Feature Weight Genetic Explanation (AFWGE) with Pearson Correlation to Identify Crucial Feature Groups

and
1
Computer Science Department, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11564, Saudi Arabia
2
Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning Theory and Applications

Abstract

The ‘black box’ nature of machine learning (ML) approaches makes it challenging to understand how most artificial intelligence (AI) models make decisions. Explainable AI (XAI) aims to provide analytical techniques to understand the behavior of ML models. XAI utilizes counterfactual explanations that indicate how variations in input features lead to different outputs. However, existing methods must also highlight the importance of features to provide more actionable explanations that would aid in the identification of key drivers behind model decisions—and, hence, more reliable interpretations—ensuring better accuracy. The method we propose utilizes feature weights obtained through adaptive feature weight genetic explanation (AFWGE) with the Pearson correlation coefficient (PCC) to determine the most crucial group of features. The proposed method was tested on four real datasets with nine different classifiers for evaluation against a nonweighted counterfactual explanation method (CERTIFAI) and the original feature values’ correlation. The results show significant enhancements in accuracy, precision, recall, and F1 score for most datasets and classifiers; this indicates the superiority of the feature weights selected via AFWGE with the PCC over CERTIFAI and the original data values in determining the most important group of features. Focusing on important feature groups elaborates the behavior of AI models and enhances decision making, resulting in more reliable AI systems.

1. Introduction

The decision-making process of the majority of AI models is difficult to understand since they employ ‘black box’ machine learning (ML) methodologies. Users mostly find it hard to trust these systems due to their lack of clarity. Moreover, the inferences of intelligent models from instances cannot be monitored in detail because the training dataset is usually too large, and the learned model is too complicated for users [1]. As a result of these difficulties, researchers have become interested in explainable artificial intelligence (XAI). By offering analytical techniques to investigate and comprehend the behavior of the model, this field of research aims to validate important ML system properties that developers optimize, such as robustness, causality, usability, and trust. Explainable ML is concerned with providing post hoc explanations for existing black box models [2]. Explainability can answer explanatory questions such as ‘what’ questions (e.g., what event happened?), ‘how’ questions (e.g., how did that event happen?), and ‘why’ questions (e.g., why did that event happen?) [3].
A counterfactual explanation is a commonly used explanation approach that determines how the output changes by changing input features [4]. Simple explanations that are reflective of human cognition may be provided via counterfactual explanations, which can also encourage spontaneous causal reasoning about the result of a specific model [5]. Moreover, counterfactual explanations encourage analyzing the features of data instances and capturing the relationships between them and their impact on the outcomes [6]. Recent studies provide counterfactual explanations by perturbing input data to understand the limitations and biases of AI models and their training data rather than changing their predictions [7]. These explanations should change a few features, making them easy to understand and follow for end users, to improve human trust in AI models [8]. In fact, counterfactual explanations serve as significant mechanisms for assisting users in understanding automated systems and their decision-making processes. They emphasize that when the purpose of XAI is to enhance human–machine team performance, rather than just explaining model prediction, it is essential to study the user’s understanding of these decisions. For that, a more user-centered and psychologically grounded approach to XAI is required to study how explanations change the way users understand AI models [9].
For the AI model itself, providing counterfactual explanations highlighting the most important features as well as a group of features that affect the decision not only enhances the explainability of AI models for users but also boosts their performance. However, to the best of our knowledge, none of the proposed explanation techniques provide counterfactuals that include the importance of individual features or groups of features in the explanation. Our proposed method, thus, aims to bridge this gap by utilizing the feature weights from counterfactual explanations to identify the most crucial group of features that may have a significant impact on the decision-making process. Feature selection (FS) is one of the important steps in ML that keeps only the most meaningful group of features of a dataset while excluding others considered either irrelevant or redundant. This process enhances the performance of any ML model by increasing its accuracy, decreasing the complexity of the computational process, and reducing the amount of overfitting that might occur during training [10,11]. Some approaches that are frequently used for FS are correlation-based methods, which evaluate the associations between features in order to determine the most important groups of features. The Pearson correlation coefficient (PCC) is a widely used method for FS in ML learning, often integrated with other techniques, such as genetic algorithms (GAs), to enhance performance [12]. Therefore, in the current research, we used the PCC together with the feature weights obtained through adaptive feature weight genetic explanation (AFWGE) [13] to identify the most important group of features underlying a model’s decision. AFWGE is a novel technique that embeds feature weights while evolving and optimizing the counterfactual explanations, aiming to produce more accurate explanations than their nonweighted counterparts, as used in [14]. Therefore, it is intriguing to explore the potential of this technique in identifying the most important group of features that collectively have a significant impact on the decision-making process, which was the primary objective of this study.
The remaining parts of this paper are organized as follows: Section 2 provides an overview of related work, while Section 3 describes the formulation of the problem. Then, the proposed approach is explained in Section 4, followed by the provision of comprehensive experimental results in Section 5. Finally, Section 6 presents a summary and recommendations for future research.

3. Problem Formulation

The feature selection problem is concerned with finding the best subgroup of features from a dataset to train the learning model, where this subgroup improves performance compared with using all features or at least provides the same performance as using all features, which can be mathematically described as follows:
Let Χ   R n × p be the dataset with n samples and p features, and let y   R n be the target variable. Define F = { 1 ,   2 ,   ,   p } as the set of all features. We want to select a subgroup S F that optimizes the performance of the learning model. Let Χ s denote the subset of Χ containing only the features in S . Let L ( Χ s , y ) be the performance metric (e.g., accuracy, precision, recall, and F1 score) of the learning model trained on Χ s and evaluated on y . The feature selection problem can be formulated as an optimization problem [38]:
max S F L Χ s , y
s u b j e c t   t o   L Χ s , y L Χ , y ϵ
where ϵ 0 is a small tolerance value that allows subgroup S to perform slightly worse than using all features F . This formulation seeks the subset S that maximizes the performance of the learning model with a guarantee that it is at least as good as when using all features within tolerance ϵ .
Subgroup S must contain the features most relevant to the prediction problem that are not redundant. Feature f i is relevant if a change in the feature’s value results in a change in the value of the predicted class variable C . Feature f i is strongly relevant if the use of f i in the predictive model eliminates the ambiguity in the classification of instances. Feature f i is weakly relevant if f i becomes strongly relevant when a subset of the features is removed from the set of available features. This implies that a feature is irrelevant if it is not strongly relevant, and it is not weakly relevant [39]. For redundancy, feature f i is redundant relative to class variable C and a second feature f j if f i has stronger predictive power for f j than for the class variable C [40]. However, to make feature selection decisions, several researchers have explained a redundant feature as one that has a high correlation with all the other features [41].

4. Proposed Method

To optimize the feature selection and identify the most important group of features, we compute the PCC of the feature weights that are produced via the adaptive feature weight genetic explanation (AFWGE) in the generated counterfactuals, as explained in detail in [13]. AFWGE is a method that extends the standard GA with adaptive feature weights, allowing for improved counterfactual explanations. For a standard GA, the approach considers a normal population of random solutions, usually called chromosomes. Every chromosome represents one potential solution to the problem under study. The solutions are iteratively updated based on which operations, such as selection, crossover, and mutation, are applied, with the goal of optimizing a fitness function. It is an iterative process that repeats until some termination condition is met, such as when a certain number of generations is reached. AFWGE follows this structure but introduces adaptive feature weights that evolve together with counterfactual solutions.
AFWGE aims to analyze the explainability of a predictive model given the model as a black box model and the data instances. First, AFWGE initializes the population of chromosomes, with each chromosome composed of two parts: the feature values and the weights assigned to those features (Figure 1). The feature weights are initialized uniformly and adaptively evolve throughout the search process with the purpose of influencing the algorithm’s behavior. After that, the rating process based on an objective function is realized; in this sense, it is assumed that those solutions are ranked as better if they produce a minimal distance concerning the original instance. This is followed by the selection step, where a subset of the fittest within the population is selected for crossover. During the crossover phase, a number of pairs of chromosomes exchange feature values over selected features and create offspring. After that, mutation is applied to the offspring after crossover. In this stage, feature values and weights are both mutated. This population is then filtered out after mutation in order to remove all the chromosomes returning the same prediction as the original instance. This ensures that only those chromosomes whose predictions differ from the ones of the original instance remain in the population, while the algorithm focuses on the construction of valid counterfactual explanations. The process iterates until it meets a stop condition, such as when the best fitness score has not improved over five generations. Finally, AFWGE provides a list of the counterfactual explanations with the final feature weights. The analysis highlights the role of feature weights as reliable indicators of feature importance, providing valuable insights for understanding and interpreting the mechanisms underlying the model.
Figure 1. AFWGE flowchart.
Therefore, when given a black box predictive model and dataset as inputs to the AFWGE method, this produces superior counterfactual explanations with the weights of the final features (Figure 2). Then, these feature weights are used as the input and combined with correlation-based redundancy reduction, where the correlation between the features is computed using the PCC [42] to obtain the most crucial group of features. This technique calculates the linear relationship between each of the two features’ weights, thus detecting possible redundancy. If two features are found to have high absolute PCC values with each other, then we retain the one that has lower absolute PCC values with the other features and eliminate the other. The PCC is represented by Equation (3):
ρ X , Y = c o v ( X , Y ) σ X σ Y = E [ X μ x Y μ y ] σ X σ Y
Figure 2. Process flow diagram of the proposed method.
The PCC, computed using Equation (3), is used to assess the linear correlation between two variables, X and Y. The function cov(X, Y) is the covariance of X and Y. σX and σY are the deviations of X and Y, respectively; μx and μy are the respective means; and E is the expected value.
We can obtain a formula for ρ(X,Y) by substituting the estimates of the covariances and variances based on a sample input into the formula above. Given paired data {(x1,y1), (x2,y2), , (xn,yn)}, consisting of n pairs, ρ(X,Y) is defined as per Equation (4), where n is sample size, xi, yi are the individual sample points indexed with i, and x ¯ and y ¯ are the sample means of variables X and Y, respectively.
ρ X , Y = i = 1 n ( x i x ¯ ) ( y i y ¯ ) i = 1 n y 1 ( x i x ¯ ) 2 i = 1             n ( y i y ¯ ) 2
ρ(X,Y) ranges from + 1 to −1, where a value of + 1 indicates that X is completely positively linearly correlated to Y. Conversely, a value of 0 signifies that X is not linearly correlated to Y at all. Finally, a value of −1 indicates that X is completely negatively linearly correlated to Y. X and Y are typically considered to exhibit an extremely strong correlation with each other when ρ(X,Y) is greater than 0.8. Moreover, X and Y exhibit a strong correlation to each other when ρ(X,Y) is greater than 0.6 [43].

5. Computational Experiments

We empirically tested the usefulness of our proposed method for users, model developers, and regulators in two steps: first, training and testing the classifier with original Data (having all the features); second, training and testing the classifier with the updated dataset that had the set of crucial features alone, obtained via our proposed method. In the following subsections, we explain all the details concerning the experimental setup and the implementation environment used to test our approach, followed by all the details of the results obtained.

5.1. Experimental Setup

5.1.1. Datasets

We took the same four real datasets used in AFWGE [13] and CERTIFAI [14]: (1) the Adult dataset [24], used to predict whether the income of an adult exceeds 5000 USD/year using U.S. census data from 1994; (2) the Diagnostic Wisconsin Breast Cancer dataset [25] to diagnose whether a breast tumor is malignant or benign; (3) Pima Indians Diabetes [26] to predict whether or not a patient has diabetes, based on certain diagnostic measurements, where all patients are women at least 21 years old and of Pima Indian heritage; and (4) the Iris dataset [27] to predict the type of the Iris plant from three types. Table 2 provides the details of the dataset considered for the experiments.
Table 2. Characteristics of the datasets.

5.1.2. Rival Methods

The aim of the experiments was to select the most important group of features underlying a counterfactual explanation by computing the PCC for the feature weights from AFWGE and then finding the most important group of features, as explained in Section 4. After that, the dataset was updated by taking the important group features that were obtained and excluding the rest. Next, we trained and tested the classifier on the updated dataset.
As previously motioned, both CERTIFAI and AFWGE use custom GAs. However, the main difference between them is that AFWGE assesses feature importance by incorporating feature weights as part of the solution and dynamically evolves them alongside feature values throughout the evolutionary process, while CERTIFAI assesses feature importance by computing the number of feature changes in the produced counterfactual explanations. Therefore, for each dataset, we compared each classifier’s performance using three different feature groups:
  • Original data feature groups: The important group of features identified using the PCC for feature values from Original Data;
  • AFWGE feature groups: The important group of features identified using the PCC for the feature weights produced via AFWGE [13];
  • CERTIFAI feature groups: The important group of features identified using the PCC for the number of feature changes, as produced via CERTIFAI [14].

5.1.3. Evaluation Metrics

To measure the performance of each classifier, we used the following widely accepted performance statistics: accuracy, precision, recall, and F1 score.

5.1.4. The Classification Models

We chose nine ML classifiers to evaluate the same four datasets: Adult, Breast Cancer, Pima Indians Diabetes, and Iris. These included three simple learners: LR, linear and radial SVM (L-SVM and R-SVM), and DT and ensemble learners (RF, which uses a set of homogenous DTs as its base classifiers). Additionally, we used AdaBoost, K-NN, Gradient Boosting, and MLP.

5.1.5. Setup

We used the scikit-learn library within the Python implementation environment on macOS. The classifier parameters were left as the default by the scikit-learn library. All datasets were evaluated with the same model configurations. To obtain the feature weights selected via AFWGE and the number of features changes of CERTIFAI, we reimplemented AFWGE and CERTIFAI in Python 3.10 on MAC studio Apple (Cupertino, CA, USA) M1 Ultra with 128 GB RAM on the same four real datasets used in AFWGE and CERTIFAI with the same parameters as in [13]. We implemented all three methods for finding the most important group of features, using either AFWGE, CERTIFAI, or Original Data in Python 3.10. All experiments were run on a MAC 1.4 GHz Quad-Core Intel Core i5 with 8 GB RAM. For the PCC threshold, we applied multiple tests with values equal to 0.6, 0.7, 0.8, 0.9, 0.95, and 0.99. These threshold values were chosen to systematically evaluate the impact of varying levels of correlation on feature reduction. These values provided a balanced range from high to extremely high correlation, allowing us to explore the trade-off between feature redundancy and model interpretability. Studies aiming to improve model performance also indicated better results with a PCC threshold of 0.6 or higher, which is a commonly used threshold in feature selection studies [44]. While these thresholds performed well across the datasets used in this study, their applicability may vary depending on the dataset characteristics, such as feature dimensionality and the degree of correlation among features. However, selecting an optimal threshold for a dataset may require more analysis to understand its correlation structure.

5.2. Experimental Results

This section presents the outcomes of training and testing the classifiers using the three different methods for calculating the most important group of features and comparing them according to the aforementioned evaluation criteria. First, we present the PCC results of each method. Then, we assess and compare the classifier metrics, each in a separate subsection: accuracy, precision (Appendix A), recall (Appendix B), and F1 score. Next, we provide an extensive evaluation of the proposed method’s performance. Finally, we provide insights into the relationship between the collective importance of feature groups and the importance of individual features.

5.2.1. PCC Results

Examining the heatmaps of the Pearson correlation coefficients for the Adult, Breast Cancer, Pima Indians Diabetes, and Iris datasets was crucial for understanding the relationships between features in each dataset. These heatmaps are visual representations of the strength and direction of the linear relationship between two features. These give insights into the relevance of features and possible redundancies. Moreover, the heatmaps demonstrate how the different inputs used to compute the PCC for each method can impact the correlation values for each dataset, as shown next. This comparison highlights how the AFWGE feature weights serve as indicators of feature correlations compared to the number of feature changes in CERTIFAI and the original feature values.
  • Adult:
For the Adult dataset, the heatmap in Figure 3 shows weak correlations, with maximum PCC values of around ±0.3, which might have been caused by the notable class imbalance. About 75% of the instances belong to the ‘income ≤ 50 K’ class, while the remaining 25% belong to the ‘income > 50 K’ class. On the other hand, the feature weights by AFWGE (Figure 4) and the number of feature changes by CERTIFAI (Figure 5) show higher correlations, with maximum PCC values of around ±0.8 and ±0.7, respectively. These high correlations are not due to the unified numerical type of features (i.e., the weights by AFWGE or the number of changes by CERTIFAI); rather, the weights of the features by AFWGE and the number of feature changes by CERTIFAI seem to provide more efficient encoding of categorical features and expressing all features, in general, to interpret correlations accurately.
Figure 3. Adult heatmap of Original Data PCC using feature values from Original Data.
Figure 4. Adult heatmap of AFWGE PCC using the feature weights produced by AFWGE.
Figure 5. Adult heatmap of CERTIFAI PCC using the number of feature changes produced by CERTIFAI.
  • Breast Cancer:
This dataset is balanced with two classes (malignant and benign), consisting entirely of numerical features related to medical measurements, and is likely to show stronger and more meaningful correlations. The heatmap of the Original Data PCC using the feature values from the Breast Cancer dataset in Figure 6 shows strong correlations, with maximum PCC values of around 0.9 and PCC values of at least 0.7 for most features; it represents 21% of all feature correlation pairs (92 pairs out of 435). Ambiguity is introduced by the high correlations among the features since feature selection methods might struggle to distinguish which feature to keep when multiple features are highly correlated. This can lead to uncertain decisions about which features are truly significant, potentially leading to the exclusion of valuable features or the inclusion of redundant ones [37]. On the other hand, the feature weights of AFWGE (Figure 7) and the number of feature changes of CERTIFAI (Figure 8) show how these high correlations are filtered by eliminating a number of them. Feature correlation pairs with PCC values of at least 0.7 were eliminated by AFWGE (15 pairs out of 435) and CERTIFAI (20 pairs out of 435). This filtration prevents ambiguity in the feature selection process.
Figure 6. Breast Cancer heatmap of Original Data PCC using feature values from Original Data.
Figure 7. Breast Cancer heatmap of AFWGE PCC using the feature weights produced by AFWGE.
Figure 8. Breast Cancer heatmap of CERTIFAI PCC using the number of feature changes produced by CERTIFAI.
  • Pima Indians Diabetes:
The Pima Indians Diabetes dataset consists of numerical features, yet it shows weak correlations among these features, as shown in Figure 9, with maximum PCC values of around 0.5. The weak correlations might have been due to dataset imbalance, with 35% labeled as diabetic and 65% of the instances labeled as nondiabetic. Nevertheless, the features might not be strongly correlated with each other, but they are still relevant for predicting diabetes. On the other hand, the feature weights of AFWGE (Figure 10) and the number of feature changes of CERTIFAI (Figure 11) show higher correlations than Original Data. However, they still do not have strong PCC values, i.e., not exceeding 0.6 for either AFWGE or CERTIFAI.
Figure 9. Pima Indians Diabetes heatmap of Original Data PCC using feature values from Original Data.
Figure 10. Pima Indians Diabetes heatmap of AFWGE PCC using the feature weights produced by AFWGE.
Figure 11. Pima Indians Diabetes heatmap of CERTIFAI PCC using the number of feature changes produced by CERTIFAI.
  • Iris
The Iris dataset’s simplicity, having a small number of features and balanced classes, shows clear and strong correlations among the features, as shown in the Original Data heatmap (Figure 12). The figure shows a maximum PCC value of around 0.9 and PCC values of at least 0.8 for most of the features, i.e., 50% of all feature correlation pairs (three pairs out of six). Again, as previously discussed in the Breast Cancer case, this high amount of feature correlations can lead to ambiguous decisions in feature selection. However, the feature weights of AFWGE (Figure 13) and the number of feature changes by CERTIFAI (Figure 14) show how these high correlations are filtered by eliminating a number of them, which helps to prevent ambiguity in the feature selection process. Feature correlation pairs with PCC values of at least 0.8 were eliminated in AFWGE and in CERTIFAI (i.e., from three pairs to one pair out of six).
Figure 12. Iris heatmap of Original Data PCC using feature values from Original Data.
Figure 13. Iris heatmap of AFWGE PCC using the feature weights produced by AFWGE.
Figure 14. Iris Heatmap of CERTIFAI PCC using the number of feature changes produced by CERTIFAI.

5.2.2. Accuracy

  • Adult dataset:
When comparing the performance using all features against the performance using only the important group of features across the three methods—Original Data, CERTIFAI, and AFWGE—on the Adult dataset, we obtained distinct outcomes, as shown in Table 3. For Original Data, there was no change in the accuracy of any of the nine classifiers. In contrast, the AFWGE method achieved enhanced accuracies for the L-SVM, R-SVM, LR, K-NN, and MLP classifiers, while the accuracies for RF, AdaBoost, DT, and Gradient Boosting remained unchanged. Meanwhile, the CERTIFAI method did not affect the accuracies of most classifiers; however, it led to an improvement in the K-NN classifier’s accuracy and a reduction in the MLP classifier’s accuracy.
Table 3. The percentage of average accuracy change when using each method on the Adult dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is indicated in bold.
The percentage change in accuracy across Original Data, AFWGE, and CERTIFAI reflects how each of the resulting groups of features affects the model’s ability to classify instances correctly. When comparing the values in Table 3, we can identify which method performs best and which performs worst. A positive percentage change in accuracy (highlighted in purple) indicates improvement, while a negative percentage change (highlighted in orange) indicates a reduction. The method with the highest percentage change (in bold) had the most significant positive impact on accuracy. In Table 2, the group of features selected by AFWGE shows the highest positive changes across multiple classifiers.
The threshold values for the PCCs were 0.6 and 0.7 for selecting the crucial group of features. The performance in terms of accuracy was the best for the most affected classifiers with the selected features at these thresholds, as shown in Figure 15. This increase in accuracy suggests that the features chosen at these PCC thresholds capture the main characteristics of the data, leading to the enhanced predictive performance of the classifiers.
Figure 15. Classifier accuracies on the Adult dataset for three rival methods using different PCC threshold values. (a) Linear SVM. (b) LR. (c) K-NN. (d) MLP.
Moreover, we conducted a Wilcoxon signed rank test to see if there was a statistically significant difference between the classifiers’ accuracies before and after using the group of features selected by AFWGE. The Wilcoxon test uses the following null (h0) and alternative (hA) hypotheses: h0: the average value of accuracies is equal between the two groups; hA: the average value of accuracies is not equal between the two groups. The test indicates that there was, indeed, a significant difference (at a significance level of α = 0.05); thus, the null hypothesis was rejected. This is verification that the average value of accuracies after using the selected feature group of AFWGE is significantly better than using all the Adult dataset features (Figure 16).
Figure 16. Classifier accuracies on the Adult dataset before and after using the group of features selected by AFWGE.
  • Breast Cancer dataset:
Table 4 shows the performance results of the three methods —Original Data, CERTIFAI, and AFWGE—on the Breast Cancer dataset. The results in Table 3 reveal the following findings: For Original Data, all nine classifiers, except L-SVM and LR, dropped in accuracy. In contrast, using AFWGE resulted in better accuracies for the L-SVM, R-SVM, LR, AdaBoost, DT, and Gradient Boosting classifiers while that for the K-NN classifier remained the same. However, there was a resultant decrease in accuracy for the RF and MLP classifiers. In contrast, when using the CERTIFAI method, the L-SVM, LR, DT, and Gradient Boosting classifiers showed enhanced accuracy, while there was a decrease in the accuracy of the R-SVM, RF, AdaBoost, K-NN, and MLP classifiers.
Table 4. The percentage of average accuracy change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Again, from Table 4, the percentage changes in accuracy for the Original Data, AFWGE, and CERTIFAI methods reveal how each method’s resulting group of features impacts the model’s ability to classify instances correctly. From this table, it is obvious that the group of features of AFWGE shows the highest positive changes across multiple classifiers.
For the Breast Cancer dataset, the most important group of features was selected when the PCC threshold was equal to 0.6, 0.7, and 0.8. The selected features at these thresholds provided the best performance in terms of accuracy for the most affected classifiers, as shown in Figure 17. The selected features at these PCC thresholds capture the important characteristics of the data, enhancing the accuracy performance of the classifiers.
Figure 17. Classifier accuracies on the Breast Cancer dataset using three rival methods with different PCC threshold values. (a) Linear SVM. (b) LR. (c) K-NN. (d) MLP.
Again, the results were subjected to a Wilcoxon signed rank test to test the statistically significant difference between the classifier’s accuracies before and after using AFWGE’s selected group of features. Additionally, the test indicated that there was a significant difference at a significance level of α = 0.05; thus, the null hypothesis was rejected. This is verification that the average value of accuracy after using AFWGE’s selected crucial features is significantly better than that using all the Breast Cancer dataset features (Figure 18).
Figure 18. Classifier accuracies on the Breast Cancer dataset before and after using the group of features selected by AFWGE.
  • Pima Indians Diabetes dataset:
Table 5 demonstrates the accuracy results on the Pima Indians Diabetes dataset for the three compared methods. When evaluating the results in Table 4, we observed distinct patterns, whereas, for Original Data, there was no change in the accuracy for any of the nine classifiers. In contrast, AFWGE improved the accuracies of the L-SVM, LR, RF, and MLP classifiers, though it had no effect on the R-SVM, AdaBoost, or DT classifiers. However, the accuracies of the K-NN and Gradient Boosting classifiers went down. For the CERTIFAI approach, the accuracy for the LR, AdaBoost, and K-NN classifiers increased, and the accuracy of the L-SVM, R-SVM, RF, DT, Gradient Boosting, and MLP classifiers decreased.
Table 5. The percentage of average accuracy change using each method for the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Again, from Table 5, the most important group of features of AFWGE shows the highest positive changes across multiple classifiers for the Pima Indians Diabetes dataset, indicating the best performance among the three rival methods.
As shown in Figure 19, the selected features at a PCC threshold of 0.6 provided the best performance in terms of accuracy for the most affected classifiers.
Figure 19. Classifier accuracies on the Pima Indians Diabetes dataset using three rival methods with different PCC threshold values. (a) Linear SVM. (b) LR. (c) K-NN. (d) MLP.
Once more, we performed a Wilcoxon signed rank test between the classifiers’ accuracies before and after using the most important group of features of AFWGE. The results, again, showed that there was a significant difference at α = 0.05; thus, the null hypothesis was rejected. Thus, we can conclude that the average value of accuracy after AFWGE is significantly better than using all the Pima Indians Diabetes dataset’s features (Figure 20).
Figure 20. Classifier accuracies on the Pima Indians Diabetes dataset before and after using the group of features selected by AFWGE.
  • Iris dataset:
Prior to utilizing the selected group of features, all classifiers employed on the Iris dataset demonstrated exceptionally high accuracies, nearly equivalent to 1.0, often indicating overfitting. Certain characteristics of the Iris dataset, such as its small size, low dimensionality, and distinct class separability, can contribute to overfitting. However, when the Original Data method was used, the accuracies of all nine classifiers were reduced, as shown in Table 6. Similarly, the AFWGE method resulted in a moderate decrease in accuracies for all nine classifiers, with the exception of the R-SVM classifier, which maintained the same accuracy. The CERTIFAI method also led to reduced accuracies for all nine classifiers. Nevertheless, the results in Table 6 demonstrate that the most important group of features of AFWGE exhibited the minimum reduction in accuracy across multiple classifiers.
Table 6. The percentage of average accuracy change using each method for the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
As shown in Figure 21, the most important group of features was selected when the PCC threshold was equal to 0.9, 0.95, and 0.99. This limited the reduction in accuracy at these PCC thresholds; however, it effectively minimized overfitting, demonstrating that the selected features retained the essential characteristics of the data without allowing the classifiers to overfit. Therefore, the selected features are the most important features, supporting both generalizability and interpretability.
Figure 21. Classifier accuracies on the Iris dataset using the three rival methods with different PCC threshold values. (a) Linear SVM. (b) LR. (c) K-NN. (d) MLP.
We ran another Wilcoxon signed rank test to check whether there was a statistically significant difference between the classifiers’ accuracies before and after using the most important group of features of AFWGE. Once again, the findings demonstrated a significant difference at a significance level of α = 0.05; thus, the null hypothesis was rejected. Therefore, it can be concluded that the average value of the accuracies after using AFWGE is significantly better than using all the Iris dataset features since it reduces overfitting (Figure 22).
Figure 22. Classifier accuracies on the Iris dataset before and after using the group of features selected by AFWGE.

5.2.3. Precision

The results demonstrate a clear improvement in classifiers’ performance when using the selected group of features, particularly with the AFWGE method. This highlights the effectiveness of AFWGE in enhancing feature selection. The details and comprehensive analysis of these results are provided in the Appendix (See Appendix A).

5.2.4. Recall

Using the selected set of features, especially the AFWGE method, significantly improves classifier performance, according to the results. This demonstrates how effectively AFWGE improves feature selection. Results are thoroughly analyzed and detailed in the Appendix (See Appendix B).

5.2.5. F1-Score

The results show that classifier performance is much enhanced by using the selected group of features, specially the AFWGE method. This illustrates that AFWGE enhances feature selection effectively. The results are presented in the appendix with analysis (See Appendix C).

5.3. Results: Analysis and Discussion

The detailed results presented above show how the classifiers’ performance improved using the resulting selected group of features, especially when using the AFWGE method. The important group of features obtained through the weights of AFWGE with the PCC improved all performance metrics for all the tested classifiers. Since AFWGE provided the highest positive percentage change for all the classifiers’ performance metrics, we used the Wilcoxon test to evaluate the significance of AFWGE’s superiority over the Original Data and CERTIFAI methods. The results of this testing are presented in Table 7, where the method(s) with statistically significant performance is (are) indicated in each case.
Table 7. Superior method(s) based on the Wilcoxon test for each performance criterion for all nine classifiers.
Moreover, a summary of the statistical significance results using the Wilcoxon test to indicate whether performance was significantly enhanced when using the group of features selected by AFWGE is shown in Table 8. From this table, it is demonstrated that AFWGE provided a significant improvement across all four performance metrics for all nine classifiers on the four datasets, except for precision and recall on the Pima Indians Diabetes dataset. Nonetheless, on this dataset, the Wilcoxon test indicated that AFWGE provided a significant improvement in the F1 score, which is defined as the harmonic mean of precision and recall and is derived from the confusion matrix; thus, it can completely describe the outcome of a classification task [45].
Table 8. Summary of the Wilcoxon test for all performance criteria used to assess the enhancement in performance before and after using the selected group of features for the AFWGE method and each dataset. Yes denotes a significant enhancement, and No denotes no significant enhancement.
In fact, out of the four datasets, only the Pima Indians Diabetes dataset has an imbalance toward the negative class (nondiabetic), since 65% of the instances are classified as not having diabetes, and 35% are classified as having diabetes. An ML model’s ability to accurately detect instances of the minority class (diabetic) can be negatively affected by this imbalance if the model becomes biased toward the majority class (nondiabetic). The overall accuracy and F1 score may have improved for this reason; these measures can conceal weak performance when applied to minority classes. Because the majority class is bigger and controls the measure, accuracy might become better. Since the F1 score is the average of precision and recall, it can show growth even if one of the parts is much lower, in case the other part becomes good enough. It is more important to correctly identify the minority class when using precision and recall measures. Therefore, if the model performance worsens when correctly predicting the positive class (diabetic), it has a stronger effect on precision and recall [46].
Finally, it is worth noting that lower metric results do not necessarily mean worse performance. For example, on the Iris dataset, most performance criteria were approximately equal to one (i.e., 100%) before using the selected group of features, which indicated overfitting. In fact, FS is a widely recognized task in ML, which has the aim of reducing the chances of overfitting a model on a dataset [47]. Therefore, the performance criteria with the Iris dataset were reduced significantly using the selected group of features.

5.4. Analysis of the Importance of a Group of Features vs. Individual Feature Importance

The strategy for identifying the contribution of each feature through feature importance vs. the contribution of a group of features was also significant in the current analysis. Table 9 shows the selected group of features for each dataset obtained via the AFWGE method with the PCC, as explained previously. Mostly, correlation analysis is performed to identify hidden patterns in data (e.g., as in [37]) to distinguish the most important group of features, i.e., the group of features that collectively have a crucial impact on performance. For example, as can be seen for the Pima Indians Diabetes dataset, the feature importance ranking using AFWGE [13], LIME (local interpretable model-agnostic), and LPI (local permutation importance) explanations [48] of “Blood Pressure” is higher than for “Insulin”. However, the most important group of features of AFWGE inferred from our FS results, as shown in Table 9, does not reveal the same information. From this table, we can see that the best performance of the classifiers on the Pima Indians Diabetes dataset included “Insulin” instead of “Blood Pressure”, although the latter had higher importance than “Insulin” as an individual feature (as mentioned above).
Table 9. The most important group of features selected by AFWGE on each dataset.
Thus, we inferred that the process of determining which group of features in a learning model’s dataset is the most important for performance, and this is distinct from the process of simply calculating the importance of each individual feature and selecting the highest among them. The concept of feature importance focuses on the contribution of individual features to the predictions made by the model. On the other hand, determining which features constitute an important group requires an understanding of how different combinations of features interact with one another and collectively influence the model’s performance. As previously mentioned, an example is the case of the Pima Indians Diabetes dataset, where “Insulin” was included in the most important group of features instead of “Blood Pressure”, even though it had lower weight as an individual feature of importance. It seems that, somehow, “Insulin”, together with other selected important features, influenced the decision of the model and compensated for the presence of “Blood Pressure” in the selected feature set. When using the PCC, we could identify the feature groups that best captured the most important patterns to gain a deeper understanding of the data’s fundamental structure. For instance, in high-stakes domains such as healthcare or finance, users can rely on AFWGE-derived feature weights to focus on the most relevant relationships among the features, making counterfactual explanations more actionable and aligned with domain-specific requirements.
The utilization of the counterfactual explanations of AFWGE, with its embedded feature weights, contribute to the effectiveness of the PCC approach. Thus, it is possible to identify the most important group of features more effectively by combining these weights with correlation analysis. AFWGE provides interpretability by adaptively learning feature weights throughout the optimization process. This continuous adaptation allows AFWGE to capture the dynamic influence of each feature within the counterfactual generation process. Furthermore, the incorporation of the PCC allows the method to reduce feature redundancy, providing more stable and relevant insights for end users to understand feature relationships and dependencies. Thus, our proposed method helps end users focus on the most critical features to recognize which features most significantly influence the model’s predictions rather than being confused by redundant or less informative ones.
This helps to ensure that the selected groups of features are both individually relevant and collectively significant in improving the learning model’s performance.

6. Conclusions

Feature selection (FS) is an important process in machine learning (ML), where the goal is to find and select the most relevant features from a dataset to enhance the model’s performance and reduce overfitting. Our proposed method leverages the feature weights obtained using the adaptive feature weight genetic explanation (AFWGE) method, combined with the Pearson correlation coefficient (PCC), to identify the most important group of features, thereby optimizing FS and enhancing model performance. Extensive testing was performed using the proposed approach on four real-world datasets that encompass both continuous and categorical features, each with nine separate classifiers. By adopting a comprehensive approach, we were able to evaluate the effectiveness of the proposed method with different models and with various types of data.
The results demonstrated that the important group of features identified by AFWGE, together with the PCC, offers a significant improvement in terms of accuracy, precision (Appendix A), recall (Appendix B), and F1 score on most datasets and for most classifier models. Moreover, the results demonstrated the superiority of weights from AFWGE over the number of feature changes from CERTIFAI and the Original Data values with the PCC in enhancing the model’s performance.
This study aimed to improve the interpretability of AI models through crucial features’ selection. Machine learning interpretability is becoming increasingly important, especially as ML algorithms grow more complex. In many cases, less performant but explainable models are preferred over more performant black box models, highlighting the need for transparency in decision making. This is why research around XAI has recently become a growing field achieving significant advances. Would users, model developers, and regulators feel confident using a machine learning model if they cannot explain what it does? XAI techniques in FS, such as AFWGE, aim to address this by explaining why certain features are selected or considered important by the model, helping to clarify the relationship between input features and the model’s decision-making process. This is where our AFWGE feature weights with the PCC can be highly beneficial; they provide interpretability by indicating the influence of each feature on the target, making it applicable to any ML model. Focusing on the most important group of features for reversing the decision and producing the desired outcome will help not only individuals, model developers, and regulators to enhance the model’s behavior but also with understanding its behavior and making more informed decisions. This effort will help bridge the gap between model accuracy and explainability, leading to more trustworthy AI systems.
Future work will involve extending the proposed method assessment against various existing XAI methods, with additional testing on higher-dimensional and noisier datasets to ensure the method’s robust applicability and performance across diverse scenarios. Moreover, primarily due to the adaptive nature of feature weights and the repeated evaluations required for optimization, this resulted in increased processing times for the AFWGE method. Thus, one of the future goals is optimizing computational efficiency, possibly through leveraging external libraries to parallelize the process. Additionally, we plan to improve the accuracy of the techniques and investigate the performance of hybrid approaches to identify the crucial group of features by integrating AFWGE with other XAI techniques, such as LIME or SHAP.

Author Contributions

Conceptualization, E.A. and M.H.; methodology, E.A. and M.H.; software, E.A.; validation, E.A. and M.H.; formal analysis, E.A. and M.H.; investigation, E.A.; resources, E.A.; data curation, E.A.; writing—original draft preparation, E.A.; writing—review and editing, M.H.; visualization, E.A. and M.H.; supervision, M.H.; project administration, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available: [UCI Machine Learning & kaggle] [10.24432/C5XW20, 10.24432/C5DW2B, www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 1 July 2024), 10.24432/C56C76] [24,25,26,27].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Precision

  • Adult Dataset:
In analyzing the precision of the classifiers using the three methods, we observed the following results, presented in Table A1. With the Original Data method, there was no change in the precision of any of the nine classifiers. The AFWGE method, however, led to enhanced precision for the L-SVM, R-SVM, LR, K-NN, and MLP classifiers, while the precision for the RF, AdaBoost, DT, and Gradient Boosting classifiers remained unchanged. On the other hand, the CERTIFAI method produced no change in the precision of any of the nine classifiers. In Table A1, the most important group of features selected by AFWGE shows the highest positive changes across multiple classifiers.
Table A1. The percentage of average precision change using each method on the Adult dataset for each classifier, where a positive change is in purple, and the best for each classifier is in bold.
Table A1. The percentage of average precision change using each method on the Adult dataset for each classifier, where a positive change is in purple, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM0.0%1.45%0.0%
R-SVM0.0%136.91%0.0%
LR0.0%3.0%0.0%
RF0.0%0.0%0.0%
AdaBoost0.0%0.0%0.0%
DT0.0%0.0%0.0%
K-NN0.0%24.53%0.0%
Gradient Boosting0.0%0.0%0.0%
MLP0.0%5.6%0.0%
Once again, the Wilcoxon signed rank test for the precision results before and after using the most important group of features selected by AFWGE indicated there was a significant difference at a significance level of α = 0.05; thus, the null hypothesis was rejected. Therefore, the average precision achieved after using the features selected by AFWGE was significantly better than that using all the Adult dataset’s features (Figure A1).
Figure A1. Classifier precisions on the Adult dataset before and after using the group of features selected by AFWGE.
  • Breast Cancer Dataset:
In evaluating the precision of the classifiers for the Breast Cancer dataset, we found varied results, as shown in Table A2. For the Original Data method, the precisions of all nine classifiers were reduced, except for those of the L-SVM, LR, and AdaBoost classifiers, which showed enhanced precision. The AFWGE method resulted in enhanced precision for all nine classifiers, except for the RF classifier, which experienced a reduction, and the K-NN classifier, which showed no change. With the CERTIFAI method, the precisions of the L-SVM, LR, AdaBoost, DT, and Gradient Boosting classifiers were enhanced, while the precisions of the R-SVM, RF, K-NN, and MLP classifiers were reduced. Once again, the results in Table A2 demonstrate that the group of features selected by AFWGE exhibited the highest positive changes across multiple classifiers.
Table A2. The percentage of average precision change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A2. The percentage of average precision change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM4.09%12.49%2.29%
R-SVM−1.29%3.12%−0.08%
LR1.38%11.54%2.69%
RF−2.81%−1.89%−0.37%
AdaBoost0.42%7.41%0.43%
DT−1.54%3.78%0.59%
K-NN−4.0%0.0%−1.85%
Gradient Boosting−0.87%1.96%0.29%
MLP−4.67%3.12%−1.46%
The Wilcoxon signed rank test for the precision results before and after the group of features selected by AFWGE showed, once again, that there was a significant difference at α = 0.05. Thus, it was verified that using the group of features selected by AFWGE significantly enhanced the average precision as compared to using all the Breast Cancer dataset’s features (Figure A2).
Figure A2. Classifier precisions on the Breast Cancer dataset before and after using the group of features selected by AFWGE.
  • Pima Indians Diabetes Dataset:
Table A3 shows the precision results on the Pima Indians Diabetes dataset. It can be observed from this table that the Original Data method did not produce any change in the precision for any of the nine classifiers. The AFWGE method, on the other hand, resulted in enhanced precision for the L-SVM, LR, and AdaBoost classifiers; the precisions of the K-NN, Gradient Boosting, and MLP classifiers were reduced; and no change was observed for the R-SVM, RF, or DT classifiers. With the CERTIFAI method, the precisions of the LR, AdaBoost, K-NN, and MLP classifiers were enhanced, whereas the precisions of the L-SVM, R-SVM, RF, DT, and Gradient Boosting classifiers were reduced. Again, Table A3 demonstrates that using the group of features selected by AFWGE achieved high positive precision changes across multiple classifiers.
Table A3. The percentage of average precision change using each method on the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A3. The percentage of average precision change using each method on the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM0.0%3.23%−0.18%
R-SVM0.0%0.0%−0.28%
LR0.0%4.55%1.01%
RF0.0%0.0%−1.29%
AdaBoost0.0%0.55%0.34%
DT0.0%0.0%−0.38%
K-NN0.0%−1.23%1.44%
Gradient Boosting0.0%−7.31%−0.61%
MLP0.0%−11.36%1.56%
Nevertheless, the Wilcoxon signed rank test for the precisions before and after using the group of features selected by AFWGE this time showed that there was no significant difference at a significance level of α = 0.05; thus, we failed to reject the null hypothesis. There was no verification that the average precision after using the group of features selected by AFWGE was significantly better than that using all the Pima Indians Diabetes dataset features (Figure A3).
Figure A3. Classifier precisions for the Pima Indians Diabetes dataset before and after using AFWGE’s selected group of features.
  • Iris Dataset:
On the Iris dataset, once again, before using any important group of features, all classifiers applied exhibited very high precision (almost equal to 1.0, indicating overfitting). On the contrary, when using the Original Data method, the precision of all nine classifiers was reduced, as shown in Table A4. Similarly, the AFWGE method resulted in reduced precisions for all nine classifiers, except for the R-SVM classifier, which maintained the same precision. The CERTIFAI method also led to reduced precision for all nine classifiers. Nevertheless, as shown in Table A4, the group of features selected by AFWGE provided the minimum negative changes across multiple classifiers.
Table A4. The percentage of average precision change using each method on the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
Table A4. The percentage of average precision change using each method on the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM−9.66%−3.57%−5.2%
R-SVM−8.47%0.0%−2.82%
LR−8.93%−1.39%−3.75%
RF−12.04%−1.28%−4.73%
AdaBoost−18.41%−7.5%−10.3%
DT−18.52%−1.39%−6.94%
K-NN−9.68%−2.9%−4.84%
Gradient Boosting−12.81%−3.57%−6.25%
MLP−11.16%−3.28%−5.54%
The Wilcoxon signed rank test for this experiment showed, once again, that the results were significantly different at α = 0.05. Therefore, we concluded that the precision after using the selected features from AFWGE was significantly better than that using all the Iris dataset’s features due to reduced overfitting (Figure A4).
Figure A4. Classifier precisions for the Iris dataset before and after using the group of features selected by AFWGE.

Appendix B

Recall

  • Adult Dataset:
In evaluating the recall of the classifiers shown in Table A5 for the Adult dataset, we observed no change in the recall for all nine classifiers using the Original Data method. On the other hand, AFWGE led to enhanced recall for the L-SVM, R-SVM, LR, K-NN, and MLP classifiers, while the recall for the RF, AdaBoost, DT, and Gradient Boosting classifiers remained unchanged. With the CERTIFAI method, the recalls of the L-SVM and LR classifiers were enhanced, the recalls of the K-NN and MLP classifiers were reduced, and there was no change in the recall of the R-SVM, RF, AdaBoost, or DT classifiers. Again, we found that using the most important group of features selected by AFWGE exhibited the highest positive changes across multiple classifiers with respect to the recall.
Table A5. The percentage of average recall change using each method on the Adult dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A5. The percentage of average recall change using each method on the Adult dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM0.0%2.11%1.53%
R-SVM0.0%96.1%0.0%
LR0.0%2.53%0.59%
RF0.0%0.0%0.0%
AdaBoost0.0%0.0%0.0%
DT0.0%0.0%0.0%
K-NN0.0%27.83%−0.05%
Gradient Boosting0.0%0.0%0.0%
MLP0.0%4.34%−4.05%
To determine if there was a statistically significant difference between the classifiers’ recalls before and after using the most important group of features selected by AFWGE, we once again carried out a Wilcoxon signed rank test. As before, the results indicated a significant difference at a significance level of α = 0.05. Thus, it was concluded that the average recall when using the most important group of features selected by AFWGE was significantly better than that using all the Adult dataset’s features (Figure A5).
Figure A5. Classifier recalls on the Adult dataset before and after using the group of features selected by AFWGE.
  • Breast Cancer dataset:
In assessing the recall of the classifiers using the three methods on the Breast Cancer dataset shown in Table A6, we noted that for the Original Data method, the recall of all nine classifiers was reduced, except for the LR and Gradient Boosting classifiers, which exhibited improvements. The AFWGE method showed no change in the recall for most classifiers, except for the DT and Gradient Boosting classifiers, which experienced enhancements, while the MLP classifier’s recall was reduced. For the CERTIFAI method, the recalls of the LR, RF, DT, and Gradient Boosting classifiers were enhanced, whereas the recalls of the L-SVM, R-SVM, AdaBoost, K-NN, and MLP classifiers were reduced. It is obvious from the aforementioned results that the most important group of features selected by AFWGE had a positive effect across multiple classifiers.
Table A6. The percentage of average recall change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A6. The percentage of average recall change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM−0.63%0.0%−0.31%
R-SVM−1.89%0.0%−0.31%
LR1.28%0.0%0.32%
RF−0.33%0.0%0.33%
AdaBoost−5.35%0.0%−1.26%
DT−1.63%1.96%0.65%
K-NN−3.46%0.0%−0.63%
Gradient Boosting0.65%1.96%0.65%
MLP−14.42%−3.85%−3.53%
Regarding statistical significance, once again, applying the Wilcoxon signed rank test to the recall before and after using the most important group of features selected by AFWGE indicated a significant difference at a significance level of α = 0.05. Thus, the average recall after using the group of features selected by AFWGE was, again, significantly better than that using all the Breast Cancer dataset’s features (Figure A6).
Figure A6. Classifier recalls on the Breast Cancer dataset before and after using AFWGE’s selected group of features.
  • Pima Indians Diabetes Dataset:
In analyzing the recall of the classifiers in Table A7 on the Diabetes dataset, we observed the following outcomes: For the Original Data method, there was no change in the recall of any of the nine classifiers. The AFWGE method resulted in enhanced recalls for the RF and MLP classifiers, reduced recalls for the AdaBoost and K-NN classifiers, and no change in the recall for the L-SVM, R-SVM, LR, DT, or Gradient Boosting classifiers. For the CERTIFAI method, the recalls of all nine classifiers were reduced, except for that of the MLP classifier, which saw an enhancement. Overall, again, the results in Table A7 indicate that using AFWGE had the most positive impact on recall across multiple classifiers.
Table A7. The percentage of average recall change using each method on the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A7. The percentage of average recall change using each method on the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM0.0%0.0%−1.98%
R-SVM0.0%0.0%−2.5%
LR0.0%0.0%−1.55%
RF0.0%6.82%−1.52%
AdaBoost0.0%−2.33%−0.39%
DT0.0%0.0%−2.54%
K-NN0.0%−2.7%−3.15%
Gradient Boosting0.0%0.0%−1.02%
MLP0.0%50.0%46.43%
With respect to statistical significance, when applying the Wilcoxon signed rank test to the classifiers’ recalls before and after using the most important group of features selected by AFWGE, this time, the test indicated that there was no significant difference at a significance level of α = 0.05; thus, we failed to reject the null hypothesis. Therefore, it was concluded that there was no verification that the average recall after using AFWGE was significantly better than that using all the Pima Indians Diabetes dataset’s features (Figure A7).
Figure A7. Classifier recalls on the Pima Indians Diabetes dataset before and after using the group of features selected by AFWGE.
  • Iris Dataset:
The results of the recalls in Table A8 indicate that the classifiers trained on the Iris dataset exhibited overfitting due to the extremely high recalls close to 1.0, without using any important set of features. In fact, when using the Original Data method, the recalls of all nine classifiers were reduced. Similarly, using the AFWGE method resulted in reduced recalls for all nine classifiers, except for the R-SVM classifier, which maintained the same recall. The CERTIFAI method also led to reduced recalls for all nine classifiers. Nevertheless, it is clear that the most important group of features selected by AFWGE produced the least negative changes across multiple classifiers.
Table A8. The percentage of average recall change using each method on the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
Table A8. The percentage of average recall change using each method on the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM−10.23%−4.17%−5.72%
R-SVM−8.84%0.0%−2.95%
LR−9.3%−1.39%−3.87%
RF−12.12%−1.52%−4.88%
AdaBoost−20.08%−12.5%−13.64%
DT−19.0%−1.39%−7.1%
K-NN−9.68%−2.9%−4.84%
Gradient Boosting−13.01%−4.17%−6.65%
MLP−10.97%−3.17%−5.42%
When the Wilcoxon signed rank test was applied to the AFWGE results, the outcome showed that there was a significant difference at a significance level of α = 0.05. Thus, it was verified, once again, that the average recall after using the most important group of features selected by AFWGE was significantly better than that using all the Iris dataset’s features due to a reduction in overfitting (Figure A8).
Figure A8. Classifier recalls on the Iris dataset before and after using the group of features selected by AFWGE.

Appendix C

F1 Score

  • Adult Dataset:
Table A9 presents the evaluation results for the F1 scores of the classifiers when the three methods were applied to the Adult dataset. For the Original Data method, there was no change in the F1 scores for any of the nine classifiers. In contrast, the AFWGE method led to enhanced F1 scores for the L-SVM, R-SVM, LR, K-NN, and MLP classifiers, while the F1 scores for the RF, AdaBoost, DT, and Gradient Boosting classifiers remained unchanged. With the CERTIFAI method, the F1 scores of the L-SVM and LR classifiers were enhanced, the F1 scores of the K-NN and MLP classifiers were reduced, and there was no change in the F1 scores of the R-SVM, RF, AdaBoost, or DT classifiers. Again, the results here indicated that the most important group of features of AFWG had the highest positive changes across multiple classifiers with respect to the F1 score.
Table A9. The percentage of average F1 score change using each method on the Adult dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A9. The percentage of average F1 score change using each method on the Adult dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM0.0%5.17%1.74%
R-SVM0.0%129.68%0.0%
LR0.0%5.3%1.59%
RF0.0%0.0%0.0%
AdaBoost0.0%0.0%0.0%
DT0.0%0.0%0.0%
K-NN0.0%25.3%−0.09%
Gradient Boosting0.0%0.0%0.0%
MLP0.0%6.23%−3.14%
As in the previous experiments, the results were subjected to a Wilcoxon signed rank test to check whether there was a statistically significant difference between the classifiers’ F1 scores before and after using the most important group of features of AFWGE. A significant difference at a significance level of α = 0.05 was demonstrated by the test results; thus, the null hypothesis was rejected. This was verification that the average value of the F1 scores after using AFWGE’s selected features was significantly better than that using all the Adult dataset’s features (Figure A9).
Figure A9. Classifier F1 scores on the Adult dataset before and after using the group of features selected by AFWGE.
  • Breast Cancer Dataset:
Comparing the F1 scores of the classifiers on the Breast Cancer dataset using the three methods in Table A10, we noticed the following outcomes: With the Original Data method, all classifiers except L-SVM and LR showed reduced F1 scores. Using the AFWGE method increased all nine classifiers’ F1 scores, except for those of the RF, K-NN, and MLP classifiers, where a reduction was noted. In contrast, the CERTIFAI method led to enhanced F1 scores for the L-SVM, LR, DT, and Gradient Boosting classifiers, while the F1 scores for the R-SVM, RF, AdaBoost, K-NN, and MLP classifiers were reduced. As can be seen in Table 7, it is evident, again, that the most important group of features of AFWGE showed the highest positive changes in F1 scores across multiple classifiers.
Table A10. The percentage of average F1 score change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A10. The percentage of average F1 score change using each method on the Breast Cancer dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM1.77%5.52%0.96%
R-SVM−1.6%0.89%−0.23%
LR1.3%5.66%1.47%
RF−1.57%−0.93%−0.06%
AdaBoost−2.46%3.7%−0.44%
DT−1.71%1.96%0.59%
K-NN−3.75%−1.89%−1.29%
Gradient Boosting−0.12%1.96%0.43%
MLP−9.68%−1.37%−2.59%
Again, the Wilcoxon signed rank test was applied to the F1 scores before and after using the most important group of features of AFWGE, indicating a significant difference at a significance level of α = 0.05. It was evident that the average F1 scores obtained when using the features selected by AFWGE were significantly better than those obtained when using all the Breast Cancer dataset’s features (Figure A10).
Figure A10. Classifier F1 scores on the Breast Cancer dataset before and after using the group of features selected by AFWGE.
  • Pima Indians Diabetes dataset:
Assessing the F1 scores of the classifiers using the three methods on the Diabetes dataset, we observed distinct results, as shown in Table A11. For the Original Data method, there was no change in the F1 scores of any of the nine classifiers. The AFWGE method resulted in enhanced F1 scores for the L-SVM, LR, RF, and MLP classifiers, while the F1 scores for the AdaBoost, K-NN, and Gradient Boosting classifiers were reduced, and no change was observed for the R-SVM or DT classifiers. In contrast, the CERTIFAI method led to reduced F1 scores for all nine classifiers, except for the MLP classifier, which saw an enhancement. Once again, Table 8 indicates the advantage of using the most important group of features of AFWGE since this method exhibited the highest positive changes in F1 scores across multiple classifiers.
Table A11. The percentage of average F1 score change using each method on the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
Table A11. The percentage of average F1 score change using each method on the Pima Indians Diabetes dataset for each classifier, where a positive change is in purple, a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM0.0%1.53%−1.17%
R-SVM0.0%0.0%−1.56%
LR0.0%2.22%−0.37%
RF0.0%3.08%−1.4%
AdaBoost0.0%−0.9%−0.03%
DT0.0%0.0%−1.43%
K-NN0.0%−1.98%−1.22%
Gradient Boosting0.0%−6.76%−0.8%
MLP0.0%26.11%21.03%
Regarding statistical significance, when the Wilcoxon signed rank test was applied to the classifiers’ F1 scores before and after using the most important group of features from AFWGE, the test indicated that there was, indeed, a significant difference at a significance level of α = 0.05. In other words, the average F1 score after using the AFWGE’s selected features was significantly better than that using all the Pima Indians Diabetes dataset’s features (Figure A11).
Figure A11. Classifier F1 scores on the Pima Indians Diabetes dataset before and after using the group of features selected by AFWGE.
  • Iris Dataset:
All the models used on the Iris dataset had very high F1 scores, almost equal to 1.0, before the selected group of features was used, which indicated overfitting. However, as shown in Table 9, when the Original Data method was used, the F1 scores for all nine classifiers were reduced. Similarly, the AFWGE method resulted in reduced F1 scores for all nine classifiers, except for the R-SVM classifier, which maintained the same F1 score. The CERTIFAI method also led to reduced F1 scores for all nine classifiers. Nevertheless, from Table A12, it is, again, evident that the most important group of features of AFWGE provided minimum negative changes across multiple classifiers.
Table A12. The percentage of average F1 score change using each method on the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
Table A12. The percentage of average F1 score change using each method on the Iris dataset for each classifier, where a negative change is in orange, and the best for each classifier is in bold.
ClassifierOriginal DataAFWGECERTIFAI
L-SVM−10.31%−4.38%−5.87%
R-SVM−8.85%0.0%−2.95%
LR−9.33%−1.45%−3.91%
RF−12.1%−1.46%−4.85%
AdaBoost−20.89%−14.84%−15.21%
DT−18.81%−1.45%−7.08%
K-NN−9.68%−2.9%−4.84%
Gradient Boosting−13.08%−4.38%−6.79%
MLP−11.01%−3.2%−5.45%
We used the Wilcoxon signed rank test to see if there was a statistically significant difference between the classifiers’ F1 scores before and after using the most important group of features of AFWGE. This time, the test results also confirmed that there was a significant difference at a significance level of α = 0.05. Therefore, it was concluded that the average F1 scores after using the group of features selected by AFWGE were significantly better than those using all the Iris dataset’s features, since it reduced overfitting (Figure A12).
Figure A12. Classifier F1 scores on the Iris dataset before and after using the group of features selected by AFWGE.

References

  1. Gunning, D.; Vorm, E.; Wang, J.Y.; Turek, M. DARPA’s Explainable AI (XAI) Program: A Retrospective. Appl. AI Lett. 2021, 2, e61. [Google Scholar] [CrossRef]
  2. Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
  3. Miller, T. Explanation in Artificial Intelligence: Insights from the Social Sciences. arXiv 2018, arXiv:1706.07269. [Google Scholar] [CrossRef]
  4. Mittelstadt, B.; Russell, C.; Wachter, S. Explaining Explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA; 2019; pp. 279–288. [Google Scholar] [CrossRef]
  5. Celar, L.; Byrne, R.M.J. How People Reason with Counterfactual and Causal Explanations for Artificial Intelligence Decisions in Familiar and Unfamiliar Domains. Mem. Cognit. 2023, 51, 1481–1496. [Google Scholar] [CrossRef]
  6. Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V. Emotion-Aware Response Generation Using Affect-Enriched Embeddings with LLMs. arXiv 2024, arXiv:2410.01306. [Google Scholar]
  7. Abid, A.; Yuksekgonul, M.; Zou, J. Meaningfully Debugging Model Mistakes Using Conceptual Counterfactual Explanations. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Volume 162, pp. 66–88. [Google Scholar]
  8. Akula, A.R.; Wang, K.; Liu, C.; Saba-Sadiya, S.; Lu, H.; Todorovic, S.; Chai, J.; Zhu, S.-C. CX-ToM: Counterfactual Explanations with Theory-of-Mind for Enhancing Human Trust in Image Recognition Models. iScience 2022, 25, 103581. [Google Scholar] [CrossRef]
  9. Warren, G.; Byrne, R.M.J.; Keane, M.T. Categorical and Continuous Features in Counterfactual Explanations of AI Systems. In Proceedings of the 28th International Conference on Intelligent User Interfaces, ACM, Sydney, NSW, Australia, 27 March 2023; pp. 171–187. [Google Scholar]
  10. Turati, G.; Dacrema, M.F.; Cremonesi, P. Feature Selection for Classification with QAOA. In Proceedings of the 2022 IEEE International Conference on Quantum Computing and Engineering (QCE), Broomfield, CO, USA, 18–23 September 2022; pp. 782–785. [Google Scholar]
  11. Parr, T.; Wilson, J.D.; Hamrick, J. Nonparametric Feature Impact and Importance. Inf. Sci. 2020, 653, 119563. [Google Scholar] [CrossRef]
  12. Kumar, S.; Bhushan, B.; Bhambhu, L.; Thakur, M.; Mohapatra, U.M.; Choubey, D.K. Medical Datasets Classification Using a Hybrid Genetic Algorithm for Feature Selection Based on Pearson Correlation Coefficient. In Proceedings of the 2022 International Conference on Machine Learning, Computer Systems and Security (MLCSS), Bhubaneswar, India, 5–6 August 2022; pp. 214–218. [Google Scholar]
  13. AlJalaud, E.; Hosny, M. Counterfactual Explanation of AI Models Using an Adaptive Genetic Algorithm With Embedded Feature Weights. IEEE Access 2024, 12, 74993–75009. [Google Scholar] [CrossRef]
  14. Sharma, S.; Henderson, J.; Ghosh, J. CERTIFAI: A Common Framework to Provide Explanations and Analyse the Fairness and Robustness of Black-Box Models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–9 February 2020; pp. 166–172. [Google Scholar]
  15. Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. SSRN Electron. J. 2017, 31, 841. [Google Scholar] [CrossRef]
  16. Ustun, B.; Spangher, A.; Liu, Y. Actionable Recourse in Linear Classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 10–19. [Google Scholar] [CrossRef]
  17. Yeh, I.-C.; Lien, C. The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients. Expert Syst. Appl. 2009, 36, 2473–2480. [Google Scholar] [CrossRef]
  18. Schleich, M.; Geng, Z.; Zhang, Y.; Suciu, D. GeCo: Quality Counterfactual Explanations in Real Time. arXiv 2021, arXiv:2101.01292. [Google Scholar] [CrossRef]
  19. Karimi, A.-H.; Barthe, G.; Balle, B.; Valera, I. Model-Agnostic Counterfactual Explanations for Consequential Decisions. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy, 26–28 August 2020; Volume 108, pp. 12–27. [Google Scholar]
  20. Mahajan, D.; Tan, C.; Sharma, A. Preserving Causal Constraints in Counterfactual Explanations for Machine Learning Classifiers; Proc CausalML Workshop NeurIPS. arXiv 2019, arXiv:1912.03277. [Google Scholar]
  21. Wexler, J.; Pushkarna, M.; Bolukbasi, T.; Wattenberg, M.; Viegas, F.; Wilson, J. The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Trans. Vis. Comput. Graph. 2019, 26, 56–65. [Google Scholar] [CrossRef]
  22. Guo, C.; Gardner, J.R.; You, Y.; Wilson, A.G.; Weinberger, K.Q. Simple Black-Box Adversarial Attacks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
  23. Dastile, X.; Celik, T.; Vandierendonck, H. Model-Agnostic Counterfactual Explanations in Credit Scoring. IEEE Access 2022, 10, 69543–69554. [Google Scholar] [CrossRef]
  24. Becker, B.; Kohavi, R. Adult. UCI Mach. Learn. Repos. 1996, 10, C5XW20. [Google Scholar]
  25. Wolberg, W.; Mangasarian, O.; Street, N.; Street, W. Breast Cancer Wisconsin (Diagnostic). UCI Mach. Learn. Repos. 1995, 10, C5DW2B. [Google Scholar]
  26. Pima Indians Diabetes Database. 1988. Available online: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 4 June 2024).
  27. Fisher, R.A. Iris. 1988. Available online: https://archive.ics.uci.edu/dataset/53/iris (accessed on 4 June 2024).
  28. Ketu, S.; Mishra, P.K. Scalable Kernel-Based SVM Classification Algorithm on Imbalance Air Quality Data for Proficient Healthcare. Complex Intell. Syst. 2021, 7, 2597–2615. [Google Scholar] [CrossRef]
  29. Sharma, A.; Mishra, P.K. Performance Analysis of Machine Learning Based Optimized Feature Selection Approaches for Breast Cancer Diagnosis. Int. J. Inf. Technol. 2022, 14, 1949–1960. [Google Scholar] [CrossRef]
  30. Mishra, S. Financial Management and Forecasting Using Business Intelligence and Big Data Analytic Tools. Int. J. Financ. Eng. 2018, 5, 1850011. [Google Scholar] [CrossRef]
  31. Doreswamy; Hooshmand, M.K.; Gad, I. Feature Selection Approach Using Ensemble Learning for Network Anomaly Detection. CAAI Trans. Intell. Technol. 2020, 5, 283–293. [Google Scholar] [CrossRef]
  32. Yaping, Z.; Changyin, Z. Gene Feature Selection Method Based on Relieff and Pearson Correlation. In Proceedings of the 2021 3rd International Conference on Applied Machine Learning, Changsha, China, 23–25 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 15–19. [Google Scholar]
  33. Kononenko, I.; Simec, E.; Robnik-Sikonja, M. Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF. Appl. Intell. 1997, 7, 39–55. [Google Scholar] [CrossRef]
  34. Ponniah, T. Machine Learning Model for Breast Cancer Data Analysis Using Triplet Feature Selection Algorithm. IETE J. Res. 2023, 69, 1789–1799. [Google Scholar]
  35. Rasool, A.; Bunterngchit, C.; Tiejian, L.; Islam, M.d.R.; Qu, Q.; Jiang, Q. Improved Machine Learning-Based Predictive Models for Breast Cancer Diagnosis. Int. J. Environ. Res. Public. Health 2022, 19, 3211. [Google Scholar] [CrossRef]
  36. Al-Shalabi, L. New Feature Selection Algorithm Based on Feature Stability and Correlation. IEEE Access 2022, 10, 4699–4713. [Google Scholar] [CrossRef]
  37. Atmakuru, A.; Di Fatta, G.; Nicosia, G.; Badii, A. Improved Filter-Based Feature Selection Using Correlation and Clustering Techniques. In Proceedings of the Machine Learning, Optimization, and Data Science; Pontignano, Italy, 18–22 September 2022, Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P.M., Umeton, R., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 379–389. [Google Scholar]
  38. Sharma, M.; Kaur, P. A Comprehensive Analysis of Nature-Inspired Meta-Heuristic Techniques for Feature Selection Problem. Arch. Comput. Methods Eng. 2021, 28, 1103–1127. [Google Scholar] [CrossRef]
  39. Blum, A.L.; Langley, P. Selection of Relevant Features and Examples in Machine Learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef]
  40. Koller, D.; Sahami, M. Toward Optimal Feature Selection; Stanford InfoLab: Stanford, CA, USA, 1996. [Google Scholar]
  41. Ooi, C.H.; Chetty, M.; Teng, S.W. Differential Prioritization in Feature Selection and Classifier Aggregation for Multiclass Microarray Datasets. Data Min. Knowl. Discov. 2007, 14, 329–366. [Google Scholar] [CrossRef]
  42. Rasool, A.; Jiang, Q.; Qu, Q.; Kamyab, M.; Huang, M. HSMC: Hybrid Sentiment Method for Correlation to Analyze COVID-19 Tweets. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery, Proceedings of the 15th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD 2019), Kunming, China, 20–22 July 2019; Xie, Q., Zhao, L., Li, K., Yadav, A., Wang, L., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 991–999. [Google Scholar]
  43. Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily Activity Feature Selection in Smart Homes Based on Pearson Correlation Coefficient. Neural Process. Lett. 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
  44. Fathi, H.; AlSalman, H.; Gumaei, A.; Manhrawy, I.I.M.; Hussien, A.G.; El-Kafrawy, P. An Efficient Cancer Classification Model Using Microarray and High-Dimensional Data. Comput. Intell. Neurosci. 2021, 2021, 7231126. [Google Scholar] [CrossRef]
  45. Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
  46. He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley: Hoboken, NJ, USA, 2013; ISBN 978-1-118-64633-5. [Google Scholar]
  47. Ahmad, H.F.; Mukhtar, H.; Alaqail, H.; Seliaman, M.; Alhumam, A. Investigating Health-Related Features and Their Impact on the Prediction of Diabetes Using Machine Learning. Appl. Sci. 2021, 11, 1173. [Google Scholar] [CrossRef]
  48. Rahnama, A.H.A.; Butepage, J.; Geurts, P.; Bostrom, H. Evaluating Local Explanations Using White-Box Models. arXiv 2022, arXiv:2106.02488. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.