There are many real-world problems where machine learning enables accurate predictions and, subsequently, the efficient use of resources, including in agriculture [
1], medicine [
2], and logistics [
3], just to name a few examples in addition to geomatics. In the field of geomatics, artificial intelligence and machine learning are traditionally used in remote sensing and image processing and have started to transform the entire field of geomatics with respect to the collection, management, and analysis of spatial data [
4,
5,
6]. The possible machine learning models used are plentiful, and in addition to deep learning, random forests are used as an explainable and easy-to-train alternative [
7]. Random forests consist of multiple binary trees that provide predictions by dividing the data following a cascade of simple yes or no questions regarding the input. Although improving accuracy is one research direction across all fields, it is not the only desirable goal when designing a machine learning solution. High accuracy often comes at the cost of an extensive amount of input features, creating an entry barrier for end users and hindering real-world adoption of the solutions, as acquiring some features might require expensive specialized sensors or human labor.
Given an existing machine learning solution based on random forests, we want to answer the question of whether we can find a decrease in the number of features without losing the accuracy of our modeling in the process. The naive solution to this problem includes training and testing of a dedicated machine learning model on all different subsets of features and the evaluation of the accuracies that a model trained exclusively on this subset can offer. As the number of possible combinations of subsets of features is exponential in the number of features, this is not feasible for real-world problems. The question comes to mind about whether we can attribute importance to each feature and greedily select the features that should be included for the final prediction model.
Shapley values [
8] have gained recognition as a unified framework for feature attribution, where feature attributions indicate how much each feature in a machine learning model contributed to the predictions for each data point. In approaches of explainable AI (XAI), where explainable predictions are requested, predictions often come along with feature attribution information. Shapley values are developed as a solution to fairly attribute resources in cooperative games in the field of game theory by Shapley [
9], and with relevance in very diverse fields [
10,
11], the concept can be translated for feature attribution in machine learning by defining a game where input features are considered players and the prediction of the model is the game. Although generally an NP-hard problem, the calculation of Shapley value feature attributions in polynomial time for random forests was solved by Lundberg et al. [
12], allowing the adoption of the approach in many machine learning pipelines. Naturally, feature attributions will be used as a tool to select meaningful features, as is successfully implemented within related research [
13,
14,
15]. All of these works deploy a definition of Shapley values for feature attribution, where the value function used to evaluate subsets of features is a so-called conditional value function that estimates the expected conditional output of the model for a data point, assuming only a subset of the features is known to the model, while the missing information is inferred from the data distribution within the training data.
Although the right selection of the value function can solve most issues by using Shapley values as a feature selection tool, there is one problem remaining: lost potential due to the nature of Shapley values as a model averaging procedure. By definition, Shapley value feature attributions are calculated by considering all possible subsets of features during the calculation of the Shapley value of every singular feature. In particular, this includes subsets of features that are not part of the optimal feature selection. Therefore, in other words, a feature selection based on the Shapley value feature attributions is influenced by features that should not be considered for feature selection. To evaluate the importance of this problem, we developed a novel solution for feature selection based on the conditional evaluation of the expected model output, called Conditional Feature Selection (CFS). The idea of the algorithm is to perform a feature selection similar to what a feature selection based on Shapley values would produce if the problem of model averaging did not exist. The trade-off for this improvement comes in the form of exponential runtime in the number of features, as without model averaging, every subset of features needs to be looked at. We performed an extensive evaluation on multiple real-world geomatics datasets to better understand the trade-off between runtime and the mitigation of the model averaging problem. We use this understanding to highlight the differences between a feature selection based on CFS and a greedy feature selection based on Shapley values. Following the results of our experiments, we see that, especially for real-world problems with complex relations between the features, Shapley values can still be successfully used as a feature selection tool.
1.2. Related Work
Feature selection, in general, is a very well-researched topic within the field of machine learning [
16,
17,
18]. Historically, approaches to feature selection are divided into three different families [
19].
Filter approaches, where the feature subset is selected based on some quality measurement that is independent of any machine learning algorithm. An example of a filter method is a correlation-based feature selection, where features are greedily selected based on maximizing the correlation with the target variable while minimizing the correlation to already-selected features [
20].
Wrapper methods, where feature selection is wrapped around a machine learning algorithm to generate a feature subset based on algorithm performance. Exemplarily, Huang et al. [
21] use a random forest to evaluate the value of a feature based on the performance of the model when the values of a feature are randomly permuted. The highest-ranked features are then used for feature selection.
Embedded methods, where we directly compute the importance of the features based on their contribution to the machine learning algorithm. A famous example of embedded feature selection for random forests is a greedy feature selection based on the internal Gini importances of random forests [
22].
Another related research direction is feature extraction, where the existing feature space is altered to create new more expressive representing features in a lower dimension. Although procedures like principal component analysis (PCA) [
23] are a very common paradigm in machine learning, we opt to focus on feature selection rather than feature extraction, as we lose explainability when we alter the feature space [
24].
Furthermore, as, throughout our manuscript, we focus on problems where we already find a functional machine learning algorithm that should be further optimized, we will focus on wrapper methods and embedded methods for this literature review. As random forests have been used for several decades now, feature selection methods that wrap random forests or decision trees are a widely investigated research topic.
Deng and Runger [
25] do not select a subset of features before creating the model, but rather encourage the correct choice of features during model creation. This is implemented by penalizing the selection of new features within computing random forests when the feature information content is similar to a previously selected feature. Another wrapping method is proposed by Genuer et al. [
26]. After creating a random forest, they evaluate the importance of features by calculating the difference in error when the feature values are randomly permuted. The deviation in error is then directly correlated with the importance of the feature, and the most important features are selected. This idea is further explored by the work of Gazzola and Jeong [
27] by adding a clustering of data points to take advantage of the structure of dependencies between the input features. The advantages of selecting features based on the knowledge already provided by the previously selected features are suggested by Alsahaf et al. [
28]. Features are selected sequentially according to the internal importance of an extreme gradient boosting model, where, in each step, the probability of new features being selected is adjusted by the number of misclassified training examples for a model based on the current feature selection. A different approach is explored by Shih et al. [
29], which is not directly interested in finding a subset of features that are well suited to retraining a machine learning model with a smaller input feature space, and similar to our work, they focus on finding explanations based on individual data instances. The concept of sufficient reasons explains a prediction by finding a subset of features already sufficient such that the model output will be the same, no matter the other feature values. Arenas et al. [
30] extend this approach by not demanding the exact same model output but a high probability of equality. Lastly, a wrapper approach is proposed for the selection of features in random forests by Zhou et al. [
31]. They calculate feature weights dependent on the layers of the decision trees and select the feature with the largest weights as a partition for further model creation.
With the rise of Shapley values in explainable AI (XAI), it also became an interesting research topic to explore the possibilities of Shapley values for feature selection. One of the main benefits of Shapley values in explainability is the ability to provide unified feature attribution values for any kind of machine learning model. Unified explanations are important, since inconsistencies between XAI and modeling techniques can have undesirable effects [
32], while model-agnostic methods allow for a widespread adoption of the explanations and allow for an improved comparability of different machine learning methods. A similar benefit is given by the popular LIME framework as introduced by Ribeiro et al. [
33], where an interpretable model is learned locally around a prediction to produce explanations for any classifier. As with Shapley values, LIME was also investigated as a feature selection tool [
34], where, in a comparison, the top-rated features for both a LIME and a Shapley value-based feature selection improved the performance of random forests. In another effort to create model-agnostic explanations, the authors of LIME introduced anchors to explain models [
35]. An anchor is a locally computed sufficient set of features and conditions such that the model output will remain the same.
One of the first works to explore the Shapley values for feature selection is [
36]. After estimating the Shapley value feature importance via sampling permutations, they introduce the Contribution Selection Algorithm (CSA) to greedily select the most important features. They show the best results when using the algorithm to eliminate the unimportant features, instead of selecting the most important ones. They also investigated their CSA approach on additional datasets in a follow-up work [
37]. Marcílio and Eler [
13] give a general survey of the topic. The most important features according to TreeSHAP [
12] are greedily selected, and the accuracies reached are evaluated on several datasets, showing the capabilities compared with other popular feature selection methods. A broad overview on the use of the Shapley value in machine learning by Rozemberczki et al. [
38] also discusses the possibilities of Shapley values in feature selection. They see the greatest disadvantages in the high computational complexity and note that the axioms defining the Shapley value might not hold when approximation techniques are used. Chu and Chan [
39] eliminate the problem of additional features altering the Shapley value and, therefore, the final selection of features. They do so by proposing an iterative feature selection approach based not only on Shapley values but also on higher-order interactions. Although flaws in Shapley value feature selection are prevalent, as will be further reviewed in
Section 2.3 of this work, we see multiple cases of Shapley value feature selection that solve real-world problems. Fang et al. [
40] successfully decrease the number of features needed for air pollution forecasting in China, by selecting the most important features of an ensemble model, including a random forest, according to the output of Shapley value-based SAGE explanations [
14]. Zacharias et al. [
15] designed a framework for easy access to a wrapped feature selection based on Shapley values and random forests. The result is a feature selection process that includes user feedback. They are able to report nearly unchanged accuracy for reduced feature counts on a variety of datasets. An overview of the different approaches for using Shapley values for feature selection is given in
Table 1. Lastly, we focus on procedures for Shapley value approximations. Strumbelj and Kononenko [
41] give a different procedure to obtain feature attributions with the Shapley value. As is commonly done, they randomly sample permutations to reduce computational complexity compared with calculating the exact Shapley value. The novelty of their approach is obtaining explanations by also randomly sampling an instance from the data that serves as the base value of the explanation in every iteration of the algorithm. Štrumbelj and Kononenko [
42] extend on this idea by improving the sampling algorithm via quasi-random and adaptive sampling and are able to show improved explainability in an experiment with human participants.
In total, we see that there exist multiple approaches to calculate feature attributions with Shapley values that differ especially in terms of selecting the
value function that is used to evaluate subsets of features for the purpose of feature selection. In addition, we see different approaches to take advantage of the calculated values to perform feature selection, from greedy selections [
40] to more elaborate algorithms [
39]. Our work follows up on the established research and is the first to conceptually define the conditions that the definition of Shapley values should meet so that the resulting values can be used for feature selection. Furthermore, we give the first algorithm to evaluate the impact of the model averaging problem, which is intrinsic to any definition of Shapley values in machine learning, and we give an extensive analysis on multiple real-world examples from the field of geomatics.