# Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems Using Feature Importance Fusion

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- The application of traditional data pre-processing and preparation approaches for computational modelling;
- Predictive modelling using ML approaches, such as random forest (RF), gradient-boosted trees (GBT), support vector machines (SVM), and deep neural networks (DNN);
- Feature importance fusion strategy using ensembles.

## 2. Related Work

## 3. Background

#### 3.1. Permutation Importance

Algorithm 1: Algorithms of permutation importance. |

#### 3.2. Shapley Additive Explanations

#### 3.3. Integrated Gradients

## 4. Methodology

#### 4.1. The Proposed Feature Importance Fusion Framework

#### 4.1.1. Ensemble Feature Importance

#### 4.1.2. Ensemble Strategies

#### 4.2. Data Generation

#### 4.3. Machine Learning Models

#### 4.4. Evaluation Metrics

## 5. Results and Discussion

#### 5.1. Single-Method Ensemble vs. Our Multi-Method Ensemble Framework

#### 5.2. Effect of Noise Level, Informative Level, and Number of Features on All Feature Importance

#### 5.3. Discussion

## 6. Conclusions and Future Work

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Uddin, M.Z.; Hassan, M.M.; Alsanad, A.; Savaglio, C. A body sensor data fusion and deep recurrent neural network-based behavior recognition approach for robust healthcare. Inf. Fusion
**2020**, 55, 105–115. [Google Scholar] [CrossRef] - Miotto, R.; Wang, F.; Wang, S.; Jiang, X.; Dudley, J.T. Deep learning for healthcare: Review, opportunities and challenges. Briefings Bioinform.
**2018**, 19, 1236–1246. [Google Scholar] [CrossRef] [PubMed] - Cruz-Roa, A.; Gilmore, H.; Basavanhally, A.; Feldman, M.; Ganesan, S.; Shih, N.N.; Tomaszewski, J.; González, F.A.; Madabhushi, A. Accurate and reproducible invasive breast cancer detection in whole-slide images: A Deep Learning approach for quantifying tumor extent. Sci. Rep.
**2017**, 7, 46450. [Google Scholar] [CrossRef] [Green Version] - Rengasamy, D.; Morvan, H.P.; Figueredo, G.P. Deep learning approaches to aircraft maintenance, repair and overhaul: A review. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 150–156. [Google Scholar]
- Rengasamy, D.; Jafari, M.; Rothwell, B.; Chen, X.; Figueredo, G.P. Deep Learning With Dynamically Weighted Loss Function for Sensor-based Prognostics and Health Management. Sensors
**2020**, 20, 723. [Google Scholar] [CrossRef] [Green Version] - Yang, T.; Chen, B.; Gao, Y.; Feng, J.; Zhang, H.; Wang, X. Data mining-based fault detection and prediction methods for in-orbit satellite. In Proceedings of the 2013 2nd International Conference on Measurement, Information and Control, Harbin, China, 16–18 August 2013; Volume 2, pp. 805–808. [Google Scholar]
- Mafeni Mase, J.; Chapman, P.; Figueredo, G.; Torres Torres, M. Benchmarking Deep Learning Models for Driver Distraction Detection. In Machine Learning, Optimization, and Data Science (LOD) 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Eraqi, H.; Abouelnaga, Y.; Saad, M.; Moustafa, M. Driver Distraction Identification with an Ensemble of Convolutional Neural Networks. J. Adv. Transp.
**2019**, 2019, 4125865. [Google Scholar] [CrossRef] - Mafeni Mase, J.; Agrawal, U.; Pekaslan, D.; Torres Torres, M.; Figueredo, G.; Chapman, P.; Mesgarpour, M. Capturing Uncertainty in Heavy Goods Vehicle Driving Behaviour. In Proceedings of the IEEE International Conference on Intelligent Transportation Systems, Rhodes, Greece, 20–23 September 2020. [Google Scholar]
- Farrar, C.R.; Worden, K. Structural Health Monitoring: A Machine Learning Perspective; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Catbas, F.N.; Malekzadeh, M. A machine learning-based algorithm for processing massive data collected from the mechanical components of movable bridges. Autom. Constr.
**2016**, 72, 269–278. [Google Scholar] [CrossRef] [Green Version] - Zhang, B.; Liu, S.; Shin, Y.C. In-Process monitoring of porosity during laser additive manufacturing process. Addit. Manuf.
**2019**, 28, 497–505. [Google Scholar] [CrossRef] - Wang, J.; Ma, Y.; Zhang, L.; Gao, R.X.; Wu, D. Deep learning for smart manufacturing: Methods and applications. J. Manuf. Syst.
**2018**, 48, 144–156. [Google Scholar] [CrossRef] - Seshia, S.A.; Sadigh, D.; Sastry, S.S. Towards verified artificial intelligence. arXiv
**2016**, arXiv:1606.08514. [Google Scholar] - Brundage, M.; Avin, S.; Wang, J.; Belfield, H.; Krueger, G.; Hadfield, G.; Khlaaf, H.; Yang, J.; Toner, H.; Fong, R.; et al. Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv
**2020**, arXiv:2004.07213. [Google Scholar] - Pham, M.; Goering, S.; Sample, M.; Huggins, J.E.; Klein, E. Asilomar survey: Researcher perspectives on ethical principles and guidelines for BCI research. Brain-Comput. Interfaces
**2018**, 5, 97–111. [Google Scholar] [CrossRef] - Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef] [Green Version] - Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access
**2018**, 6, 52138–52160. [Google Scholar] [CrossRef] - Gunning, D. Explainable Artificial Intelligence (XAI); DARPA: Arlington County, VA, USA, 2017; Volume 2. [Google Scholar]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion
**2020**, 58, 82–115. [Google Scholar] [CrossRef] [Green Version] - Chakraborty, S.; Tomsett, R.; Raghavendra, R.; Harborne, D.; Alzantot, M.; Cerutti, F.; Srivastava, M.; Preece, A.; Julier, S.; Rao, R.M.; et al. Interpretability of deep learning models: A survey of results. In Proceedings of the 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), San Francisco, CA, USA, 4–8 August 2017; pp. 1–6. [Google Scholar]
- Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics
**2010**, 26, 1340–1347. [Google Scholar] [CrossRef] - Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform.
**2007**, 8, 25. [Google Scholar] [CrossRef] [Green Version] - Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
- Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3319–3328. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell.
**1998**, 20, 832–844. [Google Scholar] - De Bock, K.W.; Van den Poel, D. Reconciling performance and interpretability in customer churn prediction using ensemble learning based on generalized additive models. Expert Syst. Appl.
**2012**, 39, 6816–6826. [Google Scholar] [CrossRef] - Zhai, B.; Chen, J. Development of a stacked ensemble model for forecasting and analyzing daily average PM2.5 concentrations in Beijing, China. Sci. Total Environ.
**2018**, 635, 644–658. [Google Scholar] [CrossRef] [PubMed] - Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Freund, Y.; Schapire, R.E. Experiments with a New Boosting Algorithm. In Proceedings of the Thirteenth International Conference, Bari, Italy, 3–6 July 1996; Volume 96, pp. 148–156. [Google Scholar]
- Schwarz, G. Estimating the dimension of a model. Ann. Stat.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Ruyssinck, J.; Huynh-Thu, V.A.; Geurts, P.; Dhaene, T.; Demeester, P.; Saeys, Y. NIMEFI: Gene regulatory network inference using multiple ensemble feature importance algorithms. PLoS ONE
**2014**, 9, e92709. [Google Scholar] - Manzo, M.; Pellino, S. Voting in Transfer Learning System for Ground-Based Cloud Classification. Mach. Learn. Knowl. Extr.
**2021**, 3, 542–553. [Google Scholar] [CrossRef] - Nguyen, D.; Smith, N.A.; Rose, C. Author age prediction from text using linear regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Portland, OR, USA, 24 June 2011; pp. 115–123. [Google Scholar]
- Shevade, S.K.; Keerthi, S.S. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics
**2003**, 19, 2246–2253. [Google Scholar] [CrossRef] [PubMed] - Song, L.; Langfelder, P.; Horvath, S. Random generalized linear model: A highly accurate and interpretable ensemble predictor. BMC Bioinform.
**2013**, 14, 5. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kira, K.; Rendell, L.A. A practical approach to feature selection. In Machine Learning Proceedings 1992; Elsevier: Amsterdam, The Netherlands, 1992; pp. 249–256. [Google Scholar]
- Shapley, L.S. A value for n-person games. Contrib. Theory Games
**1953**, 2, 307–317. [Google Scholar] - Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell.
**2020**, 2, 2522–5839. [Google Scholar] [CrossRef] - Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Alvarez-Melis, D.; Jaakkola, T.S. On the robustness of interpretability methods. arXiv
**2018**, arXiv:1806.08049. [Google Scholar] - Kendall, M.G. Rank Correlation Methods; Griffin: Oxford, UK, 1948. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Twala, B. Impact of noise on credit risk prediction: Does data quality really matter? Intell. Data Anal.
**2013**, 17, 1115–1134. [Google Scholar] [CrossRef] - Kalapanidas, E.; Avouris, N.; Craciun, M.; Neagu, D. Machine learning algorithms: A study on noise sensitivity. In Proceedings of the 1st Balcan Conference in Informatics, Thessaloniki, Greek, 21–23 November 2003; pp. 356–365. [Google Scholar]
- Sola, J.; Sevilla, J. Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Trans. Nucl. Sci.
**1997**, 44, 1464–1468. [Google Scholar] [CrossRef] - Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Kukačka, J.; Golkov, V.; Cremers, D. Regularization for deep learning: A taxonomy. arXiv
**2017**, arXiv:1710.10686. [Google Scholar] - Bishop, C.M. Training with Noise is Equivalent to Tikhonov Regularization. Neural Comput.
**1995**, 7, 108–116. [Google Scholar] [CrossRef]

**Figure 1.**A graph representation of power set for features $\{x,y,z\}$. The Ø symbol represents the null set, which is the average of all outputs. Each vertex represents a possible combination of features, and the edge shows the addition of new features previously not included in previous group of features.

**Figure 2.**The four stages of the proposed feature importance fusion MME Framework. The first stage pre-processes the data, and the second step trains the data on multiple ML models. The third step calculates feature importance from the each trained ML model using multiple feature importance methods. Finally, the fourth step fuses all feature importance generated from the third step using an ensemble strategy to generate the final feature importance values.

**Figure 3.**The working of RATE feature importance ensemble strategy. The feature importance (FI) vectors undergoes a rank correlation pairwise comparison to determine if the similarity between FI vectors is statistically significant (p-value < 0.05). A value of ‘TRUE’ is assigned if the two vectors are similar; otherwise, a ‘FALSE’ value is assigned in a truth table. Each row of the truth table goes through a majority vote to determine if the FI vector is accounted for when calculating the final FI.

**Figure 4.**Average feature importance error between SME and our MME Framework with the training and test datasets.

Parameters | Description | Parameters’ Value |
---|---|---|

Noise | Standard deviation of Gaussian noise applied to the output. | 0, 2, 4 |

Informative level (%) | Percentage of informative features. Non-informative features do not contribute to the output. | 20, 40, 60, 80, 100 |

Number of features | Total number of features used to generate output values. | 20, 60, 100 |

**Table 2.**Hyperparameter values for Random Forest, Gradient Boosted Trees, Support Vector Regression, and Deep Neural Network for all experiments.

Models | Hyperparameters | Values |
---|---|---|

Random Forest | Number of trees | 700 |

Maximum depth of trees | 7 levels | |

Minimum samples before split | 2 | |

Maximum features | $\sqrt{p}$ | |

Bootstrap | True | |

Gradient Boosted Trees | Number of trees | 700 |

Learning rate | 0.1 | |

Maximum depth of trees | 7 levels | |

Loss function | Least square | |

Maximum features | $\sqrt{p}$ | |

Splitting criterion | Friedman MSE | |

Support Vector Regressor | Kernel | Linear |

Regularisation parameter | 2048 | |

Gamma | 1 × 10${}^{-7}$ | |

Epsilon | 0.5 | |

Deep Neural Network | Number of layers | 8 |

Number of nodes for each layer | 64, 64, 32, 16, 8, 6, 4, 1 | |

Activation function for each layer | ReLU, except for output is linear | |

Loss function | MSE | |

Optimiser | Rectified Adam with LookAhead | |

Learning rate | 0.001 | |

Kernel regulariser | L2 (0.001) | |

Dropout | 0.2 |

Models | Interpretability Methods |
---|---|

Random Forest | Permutation Importance, SHAP |

Gradient Boosted Trees | Permutation Importance, SHAP |

Support Vector Regressor | Permutation Importance, SHAP |

Deep Neural Network | Permutation Importance, SHAP, and Integrated Gradient |

**Table 4.**Summary of feature importance MAE between different methods using SME and the MME Framework for different noise levels.

Models | Noise Level (Standard Deviation) | |||
---|---|---|---|---|

0 (${10}^{-2}$) | 2 (${10}^{-2}$) | 4 (${10}^{-2}$) | ||

SME | PI | 10.1 ± 2.0 | 9.8 ± 1.9 | 10.7 ± 2.6 |

SHAP | 9.8 ± 2.2 | 9.7 ± 2.2 | 10.0 ± 2.3 | |

IG | 15.8 ± 9.5 | 16.7 ± 9.5 | 16.5 ± 9.5 | |

MME | RATE (Kendall) | 8.8 ± 3.2 | 8.8 ± 3.2 | 9.4 ± 3.6 |

RATE (Spearman) | 8.8 ± 3.2 | 8.8 ± 3.2 | 9.4 ± 3.6 | |

Median | 9.5 ± 3.7 | 9.0 ± 3.4 | 10.1 ± 4.0 | |

Mean | 8.8 ± 3.2 | 8.7 ± 3.2 | 9.4 ± 3.6 | |

Mode | 12.2 ± 3.4 | 10.7 ± 3.0 | 11.1 ± 3.0 | |

Box-whiskers | 9.1 ± 3.3 | 9.1 ± 3.3 | 9.5 ± 3.6 | |

Tau test | 8.9 ± 3.3 | 8.8 ± 3.2 | 9.5 ± 3.6 | |

Majority vote | 8.1 ± 2.7 | 8.6 ± 2.8 | 8.6 ± 3.0 |

**Table 5.**Summary of feature importance MAE between different SME and our the MME Framework for different percentages of informative level.

Models | Feature Informative Level (%) | |||||
---|---|---|---|---|---|---|

20 (${10}^{-2}$) | 40 (${10}^{-2}$) | 60 (${10}^{-2}$) | 80 (${10}^{-2}$) | 100 (${10}^{-2}$) | ||

SME | PI | 2.7 ± 0.4 | 6.7 ± 0.8 | 11.2 ± 2.0 | 13.8 ± 1.0 | 16.5 ± 1.3 |

SHAP | 2.2 ± 0.4 | 5.8 ± 0.8 | 9.7 ± 1.0 | 14.2 ± 1.5 | 17.3 ± 1.9 | |

IG | 6.3 ± 7.2 | 10.7 ± 7.2 | 15.8 ± 7.2 | 22.2 ± 7.2 | 26.7 ± 7.2 | |

MME | RATE (Kendall) | 2.1 ± 0.4 | 5.4 ± 1.1 | 9.3 ± 1.5 | 12.3 ± 1.6 | 15.9 ± 3.0 |

RATE (Spearman) | 2.1 ± 0.5 | 5.4 ± 1.1 | 9.3 ± 1.5 | 12.3 ± 1.6 | 15.9 ± 3.0 | |

Median | 2.0 ± 0.6 | 5.9 ± 1.4 | 10.2 ± 1.8 | 13.0 ± 2.4 | 16.7 ± 3.5 | |

Mean | 2.1 ± 0.5 | 5.4 ± 1.1 | 9.4 ± 1.5 | 12.3 ± 1.6 | 15.7 ± 3.0 | |

Mode | 7.0 ± 2.0 | 9.0 ± 3.1 | 12.4 ± 3.1 | 13.3 ± 1.9 | 15.0 ± 3.0 | |

Box-whiskers | 2.1 ± 0.6 | 5.6 ± 1.1 | 9.5 ± 1.4 | 12.9 ± 1.7 | 16.1 ± 2.9 | |

Tau test | 2.0 ± 0.6 | 5.6 ± 1.2 | 9.5 ± 1.5 | 12.4 ± 1.7 | 15.7 ± 3.0 | |

Majority vote | 3.1 ± 0.9 | 6.5 ± 1.5 | 9.4 ± 1.6 | 10.3 ± 2.1 | 12.7 ± 3.4 |

**Table 6.**Summary of feature importance MAE between different SME and our MME Framework for different number of features.

Models | Number of Features | |||
---|---|---|---|---|

20 (${10}^{-2}$) | 60 (${10}^{-2}$) | 100 (${10}^{-2}$) | ||

90SME | PI | 7.5 ± 1.8 | 10.8 ± 2.0 | 12.2 ± 2.4 |

SHAP | 6.1 ± 1.5 | 10.3 ± 2.2 | 13.1 ± 2.4 | |

IG | 15.8 ± 9.5 | 16.1 ± 9.5 | 17.1 ± 9.5 | |

MME | RATE (Kendall) | 6.3 ± 2.3 | 9.4 ± 3.1 | 11.3 ± 3.8 |

RATE (Spearman) | 6.37 ± 2.3 | 9.4 ± 3.1 | 11.3 ± 3.8 | |

Median | 6.0 ± 2.3 | 10.6 ± 3.8 | 12.0 ± 3.9 | |

Mean | 6.2 ± 2.2 | 9.3 ± 3.1 | 11.3 ± 3.8 | |

Mode | 14.8 ± 3.6 | 9.6 ± 2.3 | 9.7 ± 2.4 | |

Box-whiskers | 6.5 ± 2.5 | 9.8 ± 3.2 | 11.4 ± 3.7 | |

Tau test | 6.1 ± 2.4 | 9.7 ± 3.2 | 11.4 ± 3.7 | |

Majority vote | 5.6 ± 1.9 | 7.6 ± 1.5 | 12.1 ± 3.3 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rengasamy, D.; Rothwell, B.C.; Figueredo, G.P.
Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems Using Feature Importance Fusion. *Appl. Sci.* **2021**, *11*, 11854.
https://doi.org/10.3390/app112411854

**AMA Style**

Rengasamy D, Rothwell BC, Figueredo GP.
Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems Using Feature Importance Fusion. *Applied Sciences*. 2021; 11(24):11854.
https://doi.org/10.3390/app112411854

**Chicago/Turabian Style**

Rengasamy, Divish, Benjamin C. Rothwell, and Grazziela P. Figueredo.
2021. "Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems Using Feature Importance Fusion" *Applied Sciences* 11, no. 24: 11854.
https://doi.org/10.3390/app112411854