Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems using Feature Importance Fusion

When machine learning supports decision-making in safety-critical systems, it is important to verify and understand the reasons why a particular output is produced. Although feature importance calculation approaches assist in interpretation, there is a lack of consensus regarding how features' importance is quantified, which makes the explanations offered for the outcomes mostly unreliable. A possible solution to address the lack of agreement is to combine the results from multiple feature importance quantifiers to reduce the variance of estimates. Our hypothesis is that this will lead to more robust and trustworthy interpretations of the contribution of each feature to machine learning predictions. To assist test this hypothesis, we propose an extensible Framework divided in four main parts: (i) traditional data pre-processing and preparation for predictive machine learning models; (ii) predictive machine learning; (iii) feature importance quantification and (iv) feature importance decision fusion using an ensemble strategy. We also introduce a novel fusion metric and compare it to the state-of-the-art. Our approach is tested on synthetic data, where the ground truth is known. We compare different fusion approaches and their results for both training and test sets. We also investigate how different characteristics within the datasets affect the feature importance ensembles studied. Results show that our feature importance ensemble Framework overall produces 15% less feature importance error compared to existing methods. Additionally, results reveal that different levels of noise in the datasets do not affect the feature importance ensembles' ability to accurately quantify feature importance, whereas the feature importance quantification error increases with the number of features and number of orthogonal informative features.


Introduction
Extensive advances in Machine Learning (ML) have demonstrated its potential to successfully address complex problems in safety-critical areas, such as in healthcare [1,2,3], aerospace [4,5,6], driver distraction [7,8,9], civil engineering [10,11], and manufacturing [12,13]. Historically, however, many ML models, specially those involving neural networks are viewed as 'black boxes', where little is known about how the decision-making process takes place. The lack of adequate interpretability and verification of ML models [14,15] has therefore prevented an even wider adoption and integration of those approaches in high-integrity systems. For those domains, where mistakes are unacceptable due to safety, security and economical issues they may cause, the need to accurately interpret the predictions and inference of ML models becomes imperative. data attributes for ML decision making is a problem for safety-critical systems, as the explanation offered for the outcomes obtained is likely to be unreliable.
There is therefore the need for more reliable and accurate ways of establishing feature importance. One possible strategy is to combine the results of multiple feature importance quantifiers, as a way to reduce the variance of estimates, leading to more a more robust and trustworthy interpretation of the contribution of each feature to the final ML model prediction. In this paper, we propose a general, adaptable and extensible Framework, which we have named as Multi-Method Ensemble (MME) for feature importance fusion with the objective of reducing variance in current feature importance estimates. The MME Framework is divided in four main parts: 1. The application of traditional data pre-processing and preparation approaches for computational modelling; 2. Predictive modelling using ML approaches, such as random forest (RF), gradient boosted trees (GBT), support vector machines (SVM) and deep neural networks (DNN); 3. Feature importance calculation for the ML models adopted, including values obtained from Permutation Importance (PI) [23], SHapley Additive exPlanations (SHAP) [24] and Integrated Gradient (IG) [25]; 4. Feature importance fusion strategy using ensembles.
For part 4, we also introduce a fusion metric, namely Rank correlation with mAjority voTE (RATE) and compare its performance to existing ensemble methods from the literature. The Framework as well as the ensemble strategies are tested on datasets considering different noise levels, number of important features and number of total features. In order to make sure our results are reliable, we conduct our experiments on synthetic data, where we determine beforehand the features and their importance. We compare the performance of our Framework to the more common method that uses a single feature importance method with multiple ML models to ensemble feature importance, which we named as Single Method Ensemble (SME). Additionally, we contrast and evaluate feature importance obtained from both training and test datasets to assess how sensitive ensemble feature importance determination methods are to the sets investigated and how much agreement is achieved. We also explore how different characteristics within the data affect feature importance ensembles.
The remainder of the paper is structured as follows: Section 2 provides a background on different feature importance technique and literature review on the current state of ensemble feature importance techniques. Section 3 then introduces the methodology of how the dataset is generated and the proposed ensemble technique. Section 4 presents the results along with discussions on how different dataset affect the feature importance techniques and the performance of proposed ensemble feature importance. Finally, in Section 5, the conclusions and potential future work are suggested.

Background
Early approaches to feature importance quantification utilise interpretable models, such as linear regression [26] and logistic regression [27]; or ensembles, such as generalised linear models (GLM) [28], and decision trees (DT) [29] to determine how each feature contributes to the model's output [30]. As data problems become more challenging and convoluted, simpler and interpretable models need to be replaced by complex ML solutions. For those, the ability to interpret predictions without the use of additional tools becomes far more difficult. Model-agnostic interpretation methods are commonly used strategies to help determining the feature importance from complex ML models. They are a class of techniques that determine feature importance, while treating models as black-box functions. Our objective in this paper is to propose a Framework for the ensemble using these tools. In this section we review the current literature on ensemble feature importance, including the the basic concepts and rationale for choosing model-agnostic approaches in our Framework experiments.

Related Work
One of the earliest ensemble techniques to calculate feature importance, Random Forest (RF), propose by Breiman [29]. RF is a ML model that forms from an ensemble of decision trees via random subspace methods [31]. Besides prediction, RF computes the overall feature importance by averaging those determined by each decision tree in the ensemble. RF feature importance is quantified depending on how many times a feature branches out in the decision tree, based on the Gini impurity metric. Alternatively, decision trees also calculate feature importance as Mean Decrease Accuracy (MDA) or more commonly known as PI by permutating the subset of features in each decision tree and calculating how much accuracy decreases as a consequence. Using the knowledge of ensembling feature importance from weak learners, De Bock et al. [32] proposes an ensemble learning based on generalised additive models (GAM) to estimate feature importance and confidence interval of prediction output. Similarly to Bagging, the average of each weaker additive model generates the ensemble predictions. The feature importance scores are generated using the following steps: (i) Generate output and calculate performance for individual predictions based on a specific performance criterion; (ii) permutate each feature and recalculate error for OOB predictions; (iii) calculate the partial importance score based on OOB predictions; (iv) Repeat step (i) to (iii) for each additive model and different forms of evaluation. The authors argue that the importance of each feature should be optimised according to the performance criteria most relevant to feature to obtain the most accurate feature importance score. The GAM ensemble-based feature importance is subsequently applied to identify essential features in churn prediction to determine customers likely to stop paying for particular goods and services. To determine the ten most relevant features, the authors use Receiver Operating Characteristic and top-decile lift. The authors observed that the sets of important features overlapped, but their rank order was different when using ROC and lift. The different rank orders show that Decrease Impurity (MDI) based on the Gini importance; SFS is based on maximum feature scores using Bayesian Information Criterion [34]. The top ten features are selected for evaluation. While Zhai and Chen used multiple ML models and one feature importance approach, to the best of our knowledge, there has not been further investigations to improve feature importance quantification using multiple models and multiple feature importance methods. Finally, as we can see in this section, there are minimal in-depth systematic investigation of how ensemble feature importance fusion works. Therefore, it is imperative that we investigate interpretability methods and ensemble feature importance fusion under different data conditions. The magnitude of difference between baseline perf ormance and error in Algorithm 1 signifies the importance of a feature. A feature has high importance if the performance of ML deviate significantly from the baseline after a shuffling; it therefore has low importance if the performance does not change significantly. PI can be run on train or test data but test data is usually chosen to avoid retraining of ML models to save computational overhead. If the computational cost is not an important factor to be considered, a drop-feature importance approach can be adopted to achieve greater accuracy. This is because in PI there is a possibility where a shuffled feature does not differ much from unshuffled feature for some instances. In contrast drop-feature importance excludes a feature, as oppose to performing its permutation. However, as it requires the ML model to be retrained every time a different features is dropped, it is computationally expensive for high dimensional data. In this paper we use PI over drop-feature importance to reduce the computational cost.

Permutation Importance
Additionally, we choose to use PI over MDI using Gini Importance. MDI tends to disproportionately increase the importance of continuous or high-cardinality categorical variables [23], leading to bias and unreliable feature importance measurements. PI is a model-agnostic approach to feature importance, and it can be used on GBT, RF, and DNN.

Shapley Additive Explanations
Other widespread feature importance model-agnostic approaches in addition to PI and MDA is SHAP. SHAP is a ML interpretability method that uses and following Equation 1 we calculate the SHAP value of feature x of an instance as follows: The process is repeated for each feature to obtain the feature importance. Another approach is model-agnostic feature importance method is Local interpretable model-agnostic explanations (LIME) [37]. We did not choose LIME as a model interpretation method because it does not have a good global approximation of feature importance and it is only able to provide feature importance for individual instances. Furthermore, LIME is sensitive to small perturbations in the input leading to different feature importance for similar input data [38].

Integrated Gradients
IG is a gradient-based method for feature importance. It determines the feature importance A in deep learning models by calculating the change in output, f (x) relative to the change in input x. Additionally, the change in input features is approximated using an information-less baseline, b. The baseline input is a vector of zero in the case of regression to ensure that the baseline prediction is neutral and it functions as a counterfactual. The features importance are denoted by the difference between the characteristics of the deep learning model's output when features and baseline are used. The formula for feature importance using a baseline is as follows: The individual feature is denoted by the subscript i. Equation 3 can be also written in the form of gradient-based importance as: IG obtains feature importance values by accumulating gradients of the features interpolated between the baseline value and the input. To interpolate between the baseline and the input, a constant, α, with the value ranging from zero to one is used as follows: is the final form of IG used to calculate feature importance in a deep learning model.

The Proposed Feature Importance Fusion Framework
This section introduces our proposed MME Framework where we employ multiple ML models and feature importance methods. We also introduce the different ensemble strategies investigated, including median, mode, box-whiskers, Modified Thompson Tau test, and majority vote. We also present the rank correlation with majority vote (RATE) fusion as a new ensembling strategy.

Ensemble Strategies
Within our MME Framework, the importance calculated is stored in a matrix, V , and the ensemble strategy, which can be obtained in several ways, is Modified Thompson Tau test is a statistical anomaly detection method using t-test to eliminate values that are above two standard deviations.
RATE is our novel fusion approach that combines the statistical test, feature rank and majority vote. RATE combines the advantage of using a statistical approach to rank feature importance and anomaly removal with majority vote.
The steps used in RATE are illustrated in Figure 3. The input to RATE is the feature importance matrix, V . V is the matrix that has the individual feature importance from different models and it has the shape of N * M where N are the feature importance vectors from different importance calculation methods and M is the number of features. We calculate the pairwise rank correlation between each feature importance vectors in the matrix V to obtain the rank and the general correlation coefficient values [39]. Using the rank and correlation value we determine whether the correlation between the vectors is statistically significant (p-value less than 0.05

Experimental Design
This section introduces our experimental design, including the benchmark datasets and how their generation is modulated. Subsequently, we discuss the data pre-processing, ML models employed, the evaluation metrics used and the experiments conducted.

Data Generation
The data investigated are generated using Python's Scikit-learn library [40] with different characteristics to mimic a variety of real-world scenarios. The  Table 1. We add Gaussian distributed noise with different standard deviation to the output as it has a more significant effect on prediction accuracy than that in the features [41]. Although noise increases ML models' estimated error [41,42]   The data preprocessing in our case consists of normalising the input data to the range between zero and one. Normalisation accelerates the learning process and reduces model error for neural networks [43]. Additionally, it allows for equal weighting of all features and therefore reduces bias during learning.

Machine Learning Models
The ML approaches employed in our experiments are RF, GBT, and DNN; the hyperparameters adopted are shown in Table 2. We optimise the models using random hyperparameters search [44], using conditions at the upper and lower limits of the parameters used for the datasets generation. The models are not optimised for individual datasets. We keep the model's hyperparameters constant as it might be a factor that affects feature importance accuracy. We therefore limit our objectives to investigating how data characteristics affect interpretability methods and how the appropriate fusion of different feature importance methods produces less biased estimates. Furthermore, we try to minimise overfitting in the deep learning model by adopting dropout [45] and L2 kernel regularisation [46].
The feature importance methods employed by each ML models are listed in Table 3. For SHAP, we employ weighted k-means to summarise the data before estimating the values of SHAP. Each cluster is weighted by the number of points they represent. Using k-means to summarise the data has the advantage of lower computational cost but slightly decreasing the accuracy of SHAP values.
However, we compare the SHAP values from data with and without k-means for several datasets and found the SHAP values to be almost identical.

Evaluation Metrics
To evaluate the performance of the ensemble feature importance and also

Single-Method Ensemble vs our Multi-Method Ensemble Framework
In order to compare results of SME and our MME Framework, Figure 4 shows the average MAE of all SME and the multiple fusion implementations of our MME Framework across all datasets. Results using the RMSE and R 2 metrics are shown in the supplementary material. The ensemble method with the least feature importance error is our MME Framework using majority vote for fusion, followed by the MME Framework with mean and RATE. SME such as SHAP, PI and IG produce the worst results. The circles and bars in the figure represent the feature importance errors on the training and test datasets, respectively. The feature importance errors on the training dataset are slightly lower than errors on the test dataset.

Effect of Noise Level, Informative Level, and Number of Features on All
Feature Importance Figure 5 shows the feature importance results of SME and the MME Framework with different fusion strategies averaged across three different noise levels in the data. RMSE and R 2 results are found in the supplementary material.
The best performing ensemble method averaged across all noise levels is MME Framework using majority vote. MME Framework that uses majority vote outperforms the best SME method, SHAP, by 14.2%. In addition, Table 4 and Figure 6 show how the feature importance errors change for all SME and MME Framework methods as the noise level increases. In Table 5 we observe that the MAE decreases marginally, from a noise level of 0 standard deviation to 2 standard deviations, and then it increases again 4 times the standard deviation.
The addition of 2 standard deviation noise to the dataset improves the generalisation performance of ML models leading to lower errors [47]. However, the feature importance errors increase when the noise in the dataset reaches 4 times the standard deviation, indicating that the noise level has negatively impacted ML models performance. Overall, however, the noise levels have little effect on the feature importance errors. Results also reveal that MME Framework with majority vote achieves the best feature importance estimates for our data.     Figure 7 presents the feature importance MAE error for all SME and MME Framework methods. The best performing method is MME with majority vote followed by MME with mean and RATE. All MME Framework methods except for that with mode outperform SME by more accurately quantifying the feature importance. The error bar (standard deviation) for the effect of features informative levels in Figure 7 has a smaller range than the effect of noise and the effect of number of features. The low standard deviation indicates that the feature informative levels explain most of the variances. From Table 5 we observe that when 20% of the features are informative, the best feature importance methods are MME Framework with Modified Thompson tau test and median, and they outperform the best SME method, SHAP, by 9.1%. For 40% features informative level, MME Framework with RATE (Kendall and Spearman) and mean have the lowest error, with 6.9% improvement over SME with SHAP. For 60% features informative level, MME Framework using RATE (Kendall and Spearman) have the best results, with a 4.1% improvement over SME with SHAP. Furthermore, MME Framework with majority vote obtains the lowest error for both 80% and 100% features informative level. The best performing SME is using PI. MME with majority votes outperforms SME's best results by 25.4% and 23.0% for 80% and 100% features informative level, respectively.      The error bar for the effect of feature number in Figure 9 has a smaller range compared to the feature importance errors of the effect of noise but larger than the effect of number of features. From Table 6 we observe that when there are 20 and 40 features the method with the lowest feature importance errors is MME Framework with majority vote, and it outperforms the best SME method -SHAP by 8.2% and 26.2% respectively. For 100 features, MME Framework using mode is the most accurate method, and it outperforms the best SME method, PI by 20.5%. For SME, PI has lower feature importance errors compared to SHAP as the number of informative features and features increases.

Discussion
Overall, results show that our MME Framework outperforms SME for our case studies. In particular, the MME Framework with majority vote as an One advantage of using our MME Framework is that it avoids the worst-case scenario for feature importance estimates. This can also be a disadvantage as the best performing feature importance estimate is moderated by worse methods. However, for real-world, save critical systems where the ground truth is unknown, it is important not to rely solely on a single method and the potential bias it might produce. In scenarios where all feature importance determined by different methods disagree the best option for safety is to further investigate the reasons for disagreement.
For our experiments, we keep the hyperparameters of each ML models constant as their case-based optimisation is likely to affect feature importance estimates for different data characteristics. In the future, it is necessary to further investigate the interplay between parameter optimisation and feature importance. In addition, other factors such as covariate drift on the test dataset and imbalanced datasets should be included in the framework tests.

Conclusions and Future Work
In this paper, we presented a novel framework to perform feature importance fusion using an ensemble of different ML models and different feature importance methods called MME. Overall, our framework performed better than tra- When the number of features and the number of informative features are low, the performance of MME Framework is only marginally better than SME.
We showed that the MME Framework avoids the worst-case scenario in predicting feature importance. Despite its advantages, there are several shortcomings in the proposed method and our experimental approach that can be improved in future work. One improvement would be to include more feature importance methods and ML models into the MME Framework to increase the number of methods to investigate their impact on variance and in overall consensus of importance. Another possibility is to integrate the feature importance ensemble into a neural network to internally induce bias in the neural networks to focus on relevant features to decrease training time and improve accuracy.
Finally, for the experimental aspect, several other factors that could potentially affect the feature importance estimates such as hyperparameter optimisations of ML models, imbalance of the datasets, the presence of anomalies in data, and the efficacy of the framework in real-world complex data should be investigated.  Majority vote 0.77±0.12 0.76±0.12 0.76±0.13 Table 2: Summary of feature importance R 2 between different SME and MME for different noise level.      Table 4: Summary of feature importance R 2 between different SME and MME for different percentage of informative level.     Table 6: Summary of feature importance R 2 between different SME and MME for different number of features.