Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

: Linear regression is often used as a diagnostic tool to understand the relative contributions of operational variables to some key performance indicator or response variable. However, owing to the nature of plant operations, predictor variables tend to be correlated, often highly so, and this can lead to signiﬁcant complications in assessing the importance of these variables. Shapley regression is seen as the only axiomatic approach to deal with this problem but has almost exclusively been used with linear models to date. In this paper, the approach is extended to random forests, and the results are compared with some of the empirical variable importance measures widely used with these models, i.e., permutation and Gini variable importance measures. Four case studies are considered, of which two are based on simulated data and two on real world data from the mineral process industries. These case studies suggest that the random forest Shapley variable importance measure may be a more reliable indicator of the inﬂuence of predictor variables than the other measures that were considered. Moreover, the results obtained with the Gini variable importance measure was as reliable or better than that obtained with the permutation measure of the random forest.


Introduction
Insight into the underlying physical phenomena in process systems is key to the development of reliable process models. These models are in turn vital to the development of advanced process control systems and optimization of process operations. With the availability of large amounts of plant data becoming more commonplace, there is a growing interest in the use of operational data to gain deeper insights into process systems and plant operations [1].
Linear regression is well established as a diagnostic tool to understand the relative contributions or the statistical significance of operational variables to some key performance indicator or response variable on process plants. However, owing to the nature of plant operations, predictor variables tend to be correlated, and this can lead to significant complications in assessing the importance of these variables. Despite a number of methods that have been proposed to surmount the problem, Shapley regression is seen as the only axiomatic approach to deal with it [2][3][4]. As a consequence, linear Shapley regression models and variants thereof [2,5,6] have seen steady growth in application over the last few decades.
However, while linear models can be used to explicitly capture nonlinear process behavior, this would require a sound understanding of the underlying process phenomena, which may not be available. Although there is no reason why the same approach cannot be used with machine learning

Methodology
Shapley regression has its origin in game theory from the 1950s and has been reinvented multiple times with different terminology, including dominance analysis [5,19] and regression analysis in a game theory approach [2]. In essence, the fraction of the variance of the response variable explained by the model can be decomposed as indicated in Equation (1).
If the m predictor variables are denoted by a set V = {x 1 , x 2 , . . . x m }, then S is some subset of the set of predictor variables V and |S| is the number of elements in the set S. V\ x j is the set of predictor variables, excluding variable x j . R 2 (S) is the R 2 -value of the regression of the predictor variables in S on the response y and R 2 j is the marginal contribution of variable x 2 to the overall model R 2 -value. It is further assumed that R 2 (∅) = 0. The . This is to compensate for the fact that there are twice as many two-variable coalitions as one-or three-variable ones and to ensure that the sum of the weights remain unity, i.e., 1(1/3) + 2(1/6) + 1(1/3) = 1. Evaluation of all possible coalitions is computationally expensive, since the number of models that needs to be evaluated, is equal to 2 m . For 20 predictor variables or more, it means more than a million models need to be evaluated. This is a huge computational cost, which is exacerbated when Monte Carlo sampling is also performed to determine the statistical significance of the effects Minerals 2020, 10, 420 3 of 17 of variables. As a consequence, the Shapley regression framework has started to gain acceptance in practice relatively recently, as computational power has become more widespread.

Axiomatic Properties of Shapley Values
It can be proved that Shapley regression is the only attribution method that satisfies the following axioms of credit attribution [20]: As far as regression models are concerned, the variable contributions to the variance explained by the model all add up to the overall variance explained by the model, i.e., m j=1 R 2 j = R 2 model . If a variable never adds any marginal value, its credit portion is zero, i.e., if R 2 S ∪ x q = R 2 (S), for all S ⊆ {x 1 , x 2 , . . . x m }\ x q , then R 2 q = 0. If two variables add the same marginal value to any subset to which they are added, then their credit portion is identical, i.e., p . If a model consists of two additive models, then addition of the pay-offs of a variable in the two submodels should equal the pay-off of the variable in the overall model. With random forests, for example, the predicted value of the response is equal to the average of the predictions of the individual trees in the forest. This means that, for a specific variable, the Shapley value can be calculated for each tree individually and these values can be averaged to get the Shapley value for the random forest.
As a result, the Shapley decomposition of the variance explained by the variables in a model is considered to be the ground truth or gold standard against which other methods can be measured [20].

Random Forests
Developed in the 1990s, random forests [21] have become known for their state-of-the-art capability in classification or regression, and their ability to handle categorical or continuous variables, as well as dealing with missing data [22]. In addition, in most implementations, so-called out-of-bag or generalization errors are automatically calculated and their performance is not particularly sensitive to the few hyperparameters that are required to tune the models. As a consequence, the popularity of these models in the process industries is growing rapidly, with applications, for example, in predictive modeling [11,23], fault diagnosis and root cause analysis [24,25], and change point detection [26], as well as diagnostic modeling [27,28]. Random forests consist of ensembles of decision trees, as briefly summarized below.

Decision Trees
Decision trees are built based on recursive partitioning of a data set to perform. Given a training set D with n samples, consisting of m predictor variables X ∈ R n×m and a target variable y ∈ R n×1 , the classification and regression tree (CART) algorithm recursively partitions the input space X to obtain a tree predictor T D (x) (withŷ the estimated response as a function of the predictors): The algorithm accomplishes this by repeatedly seeking a binary partitioning of the input space X that increases the target purity in the subspaces formed by the partition. The partition is defined by a hyperplane perpendicular to one of the coordinate axes of X. The Gini criterion is often used as a measure of the purity for classification, while a mean squared error criterion is used for regression. Recursive binary partitioning of each new subspace is terminated when some level of subspace response homogeneity is achieved. At that point, the predicted value for a particular subspace is typically obtained via the majority vote in the case of classification or the average in the case of regression of the training targets in the subspace.

Ensembles of Decision Trees
Random forests are ensemble algorithms, where the training data seen by each tree are generated by bagging. That is, a different bootstrapped sample D(θ k ) of size n try × m try from the training set (D) of size n × m is generated for each tree T D(θ k ) . The predictions by the individual trees, T D(θ k ) (x i ), are subsequently averaged over all the trees in the random forest to generate the prediction of the model, as indicated in Equations (3) and (4) for regression and classification, respectively. ii.ŷ( The construction of a random forest, generally with K trees, as shown in Figure 1, proceeds as follows: 1.
From the data, draw n try bootstrap samples.

2.
Grow a tree for each of the sets of bootstrap samples. For each tree (k = 1, 2, . . . K), randomly select m try variables for splitting at each node of the tree. Each terminal node in this tree should have no fewer than n lea f cases.

3.
Aggregate information from the K trees for new data prediction, according to Equations (3) and (4).

4.
Compute an out-of-bag (OOB) error rate based on the data not in the bootstrap sample.

Splitting Criteria
The Gini index in Equation (5) is based on the squared proportions of the classes and favors larger partitions. Perfect classification is associated with a Gini index of zero, while an even distribution of classes would yield an index Gini = 1 − 1 C , where C is the number of classes.

Ensembles of Decision Trees
Random forests are ensemble algorithms, where the training data seen by each tree are generated by bagging. That is, a different bootstrapped sample ( ) of size × from the training set ( ) of size × is generated for each tree ( ) . The predictions by the individual trees, ( ) ( ), are subsequently averaged over all the trees in the random forest to generate the prediction of the model, as indicated in Equations (3) and (4) for regression and classification, respectively. ii.
The construction of a random forest, generally with trees, as shown in Figure 1, proceeds as follows: 1. From the data, draw bootstrap samples. 2. Grow a tree for each of the sets of bootstrap samples. For each tree ( = 1, 2, … ), randomly select variables for splitting at each node of the tree. Each terminal node in this tree should have no fewer than cases. 3. Aggregate information from the trees for new data prediction, according to Equations (3) and (4). 4. Compute an out-of-bag (OOB) error rate based on the data not in the bootstrap sample.

Splitting Criteria
The Gini index in Equation (5)

Variable Importance Measures
Three variable importance measures are considered in this investigation, as defined by Equations (1), (7) and (9) and described in more detail below.

Variable Importance Measures
Three variable importance measures are considered in this investigation, as defined by Equations (1), (7) and (9) and described in more detail below.

Shapley Variable Importance with Random Forests and Linear Regression Models
The Shapley variable importance measure is derived from Equation (1), where the R 2 -values were generated by both linear regression and random forest regression models in this study. Each random forest model that was constructed was based on a random selection of the data, as well as a random selection of the subset of variables seen by that specific random forest in the coalition. Likewise, the regression results were collected on an independent test set randomly constructed from the data, which was different for each model. For the random forests, the Shapley variable importance measures are denoted as VI RF shap x j = R 2 j as calculated by Equation (1). For comparative purposed, the linear regression models were evaluated on the same basis, i.e., the models were fitted on a training data set and the regression results reported on a test set. The results were essentially the same as when the models were fitted to all the data, as would normally be done in the case of linear regression (not reported here). Likewise, for the multiple linear regression models, the Shapley variable importance measures are denoted as VI MLR shap x j = R 2 j as generally calculated by Equation (1).

Permutation Variable Importance
The permutation variable importance of each variable is the increase in the mean square error (MSE) in the model when the particular variable is permuted. That is, the association between the predictor and the target is destroyed by randomly shuffling the observations of the particular variable.
More formally, the permutation variable importance VI (k) perm x j can be computed in the k'th tree for variable x j as indicated by Equation (6).
In Equation (6), y j,i are the i'th response variable observation in the OOB sample seen by the k'th tree in the random forest, the estimate of the this response by the k'th tree and the estimate of this response by the k'th tree when the j'th variable is permuted, respectively. n OOB is the number of samples in the out of bag (OOB) data set seen by each of the K trees in the forest.
Unlike univariate screening methods, the permutation variable importance accounts for both the individual impact of each predictor variable, as well as multivariate interaction with other predictor variables [21].

Gini Variable Importance
The third approach to estimating the importance of individual predictors is based on the changes in the node impurities at each split in each tree in the random forest. This Gini importance or mean decrease in the impurity of the node is the difference between a node's impurity and the weighted sum of the impurities of the two descendent nodes.
More formally, the importance of variable x j for predicting a response variable y, can be determined by summing the decreases in impurity (∆I) for all the nodes t, where variable x j is split. These impurity decreases are weighted by the fractions of samples in the nodes p(t) and averaged over all trees (k = 1, 2, . . . K) in the forest [13].
Minerals 2020, 10, 420 In Equation (8), the number of nodes in the k'th tree in the random forest is T k , p(t) = n t n is the fraction of the samples reaching node t, while v(s t ) is the variable used in the split s t . When the Gini index is used as the impurity function I x j , the variable importance measure VI gini x j is referred to as the Gini variable importance.

Linear System with Weakly Correlated Predictors
The data in the first case study were generated by the same model used by Olden et al. [29] to investigate variable importance analysis with multilayer perceptrons. A 1000 samples with six random variables were generated with a mean vector of zero and a covariance matrix (Σ) as follows.
The first five variables (x 1 , x 2 , . . . x 5 ) constituted the predictors, while the last variable (y,) was the target or response variable, linearly related to the predictor variables. The predictor variables were mildly correlated, as indicated by the covariance matrix in Equation (1), with bivariate correlation coefficients of r x i −x j = 0.2 for all i, j. The first predictor variable (x 1 ), was strongly correlated with the target variable, i.e., r x 1 −y = 0.8, the subsequent predictors having progressively weaker correlations, i.e., r x 2 −y = 0.6, r x 3 −y = 0.4, r x 4 −y = 0.2 with the fifth predictor having no correlation with the target at all (r x 5 −y = 0). Figure 2 shows a scatter plot matrix of the predictor data set (left), as well as the relationship between the response and the two most influential predictors.
In Equation (8), the number of nodes in the 'th tree in the random forest is , ( ) = is the fraction of the samples reaching node , while ( ) is the variable used in the split . When the Gini index is used as the impurity function ( ), the variable importance measure ( ) is referred to as the Gini variable importance.

Linear System with Weakly Correlated Predictors
The data in the first case study were generated by the same model used by Olden et al. [29] to investigate variable importance analysis with multilayer perceptrons. A 1000 samples with six random variables were generated with a mean vector of zero and a covariance matrix ( ) as follows.
The first five variables ( , , … ) constituted the predictors, while the last variable (y,) was the target or response variable, linearly related to the predictor variables. The predictor variables were mildly correlated, as indicated by the covariance matrix in Equation (1)   Random forest models were fitted to the data, with the following optimal parameters: = 2, = 80% of the data, minimum leaf size, = 5. On average, the random forest model Random forest models were fitted to the data, with the following optimal parameters: m try = 2, n try = 80% of the data, minimum leaf size, n lea f = 5. On average, the random forest model could explain 93% of the variance of the response variable. The effect of the parameter m try on the performance of the model is shown in Figure 3 Minerals 2020, 10, 420 7 of 17 could explain 93% of the variance of the response variable. The effect of the parameter on the performance of the model is shown in Figure 3. The whiskers extend to the most extreme data points not considered outliers, which are indicated by '+' markers. More specifically, points were drawn as outliers if they were larger than + ( − ) or smaller than − ( − ) , where and are the 25th and 75th percentiles, respectively. W = 1.5 was used, corresponding to approximately 2.7 standard deviations of a normal distribution. The notches in the boxes can be used to compare the median values of the importance measures. That is, non-overlapping notches indicate a difference between the median values of the variables with 95% certainty. The whiskers extend to the most extreme data points not considered outliers, which are indicated by '+' markers. More specifically, points were drawn as outliers if they were larger than Q 3 + W(Q 3 − Q 1 ) or smaller than Q 1 − W(Q 3 − Q 1 ), where Q 1 and Q 3 are the 25th and 75th percentiles, respectively. W = 1.5 was used, corresponding to approximately 2.7 standard deviations of a normal distribution. The notches in the boxes can be used to compare the median values of the importance measures. That is, non-overlapping notches indicate a difference between the median values of the variables with 95% certainty.
The Shapley regression values obtained with the random forest are the only ones that could correctly identify the ranking of the importance of the variables, although the Gini importance measure came close to correct identification, as well. The Gini measure could rank variables x 1 to x 3 correctly, but the relative importance of variables x 4 and x 5 was swopped around by a small margin. The Shapley regression values obtained with the linear model yielded the same results as the Gini measure. Although significant, it is a relatively minor error again, bearing in mind that variable x 4 explains approximately 4% of the variance of the response only, as opposed to the zero percent of variable x 5 .
The permutation index could correctly identify the relative importance of variables x 4 and x 4 , but rated the relative importance of the fifth predictor as markedly higher than that of the fourth and also higher than that of the third predictor with 95% confidence. As discussed by Gregorutti et al. [30], the higher the interpredictor correlation and the larger the number of correlated predictors, the less important the predictor will appear to be, at least for additive models. On this basis, the variable importances of the last two variables would be fairly similar and would be difficult to distinguish from one another by any model. However, it does not explain the higher value of the last variable shown in Figure 4 (top, right). The Shapley regression values obtained with the random forest are the only ones that could correctly identify the ranking of the importance of the variables, although the Gini importance measure came close to correct identification, as well. The Gini measure could rank variables to correctly, but the relative importance of variables and was swopped around by a small margin. The Shapley regression values obtained with the linear model yielded the same results as the Gini measure. Although significant, it is a relatively minor error again, bearing in mind that variable explains approximately 4% of the variance of the response only, as opposed to the zero percent of variable x .
The permutation index could correctly identify the relative importance of variables and , but rated the relative importance of the fifth predictor as markedly higher than that of the fourth and also higher than that of the third predictor with 95% confidence. As discussed by Gregorutti et al. [30], the higher the interpredictor correlation and the larger the number of correlated predictors, the less important the predictor will appear to be, at least for additive models. On this basis, the variable importances of the last two variables would be fairly similar and would be difficult to distinguish from one another by any model. However, it does not explain the higher value of the last variable shown in Figure 4 (top, right).

Nonlinear System with Strongly Correlated Predictors
In the second case study, a simulated data set with 1000 samples was generated, where ∈ ℝ × and ∈ ℝ × , where = + . The matrix consisted of multivariate random numbers with a mean vector = 0 0 0 0 0 and a covariance matrix . As indicated by Equation (2), the first four ( , , , ) variables were strongly correlated with each other, while the last variable ( ) was independent from the other four.

Nonlinear System with Strongly Correlated Predictors
In the second case study, a simulated data set with 1000 samples was generated, where X ∈ R 1000×5 and y ∈ R 1000×1 , where y = x 2 4 + x 4 x 5 . The matrix X consisted of multivariate random numbers with a mean vector x = [0 0 0 0 0] and a covariance matrix Σ. As indicated by Equation (2), the first four (x 1 , x 2 , x 3 , x 4 ) variables were strongly correlated with each other, while the last variable (x 5 ) was independent from the other four.
The data are shown in Figure 5. As before, the random forest models were constructed with the same hyperparameters that were used in the first case study, except for m try = 4. The effect of this parameter on the performance of the model is shown in Figure 6. On average, the random forest model could account for approximately 96% of the variance in the response variable, while the linear model was completely unable to capture the relationship. Σ = 1 0.9 0.9 0.9 0 0.9 1 0.9 0.9 0 0.9 0.9 1 0.9 0 0.9 0.9 0.9 1 0 0.9 0.9 0.   The random forest Shapley variable importance indicator and the permutation index indicator were both able to identify the two important variables, and , although they differed in the ranking of these two variables. Interestingly, as indicated in Figure 7, the Gini variable importance    The random forest Shapley variable importance indicator and the permutation index indicator were both able to identify the two important variables, and , although they differed in the ranking of these two variables. Interestingly, as indicated in Figure 7, the Gini variable importance The random forest Shapley variable importance indicator and the permutation index indicator were both able to identify the two important variables, x 4 and x 5 , although they differed in the ranking of these two variables. Interestingly, as indicated in Figure 7, the Gini variable importance measure could not distinguish the important variable x 5 from the unimportant variables x 1 , x 2 , and x 3 .
As expected, it is clear that a linear model could not give a reliable indication of the importance of the predictors. measure could not distinguish the important variable from the unimportant variables , , and .
As expected, it is clear that a linear model could not give a reliable indication of the importance of the predictors.

Free Swelling Index of Coal
Chelgani et al. [31] studied the free swelling index of coal. In this case study, the same data set is used, but only eight predictors variables ( , … ) were considered, as indicated in Table 1. The predictor data matrix, ∈ ℝ × and the associated response vector, the free swelling index, ∈ ℝ × contained 3691 samples, which can be visualized based on a principal component score plot of the variables in Figure 8. Table 1. Predictor variables in coal data set.

Moisture Vol Ash H C N O S
The 100-tree random forest model fitted to the data, with the following optimal hyperparameters: = 5, = 80% of the data, minimum leaf size, = 5 was fitted to the data. This forest could explain approximately 82.5% of the variance of the free swelling index. The correlation coefficient matrix of the predictor data set is given in Table 2. From this table, it can be seen that variables , , and show significant correlation with the free swelling index (FSI).

Free Swelling Index of Coal
Chelgani et al. [31] studied the free swelling index of coal. In this case study, the same data set is used, but only eight predictors variables (x 1 , x 2 . . . x 8 ) were considered, as indicated in Table 1. The predictor data matrix, X ∈ R 3691×8 and the associated response vector, the free swelling index, y ∈ R 3691×1 contained 3691 samples, which can be visualized based on a principal component score plot of the variables in Figure 8. Table 1. Predictor variables in coal data set.
The 100-tree random forest model fitted to the data, with the following optimal hyperparameters: m try = 5, n try = 80% of the data, minimum leaf size, n lea f = 5 was fitted to the data. This forest could explain approximately 82.5% of the variance of the free swelling index. The correlation coefficient matrix of the predictor data set is given in Table 2. From this table, it can be seen that variables x 1 , x 5 , and x 7 show significant correlation with the free swelling index (FSI).  With the full set of predictor variables, the random forest could explain approximately 82.7% of the variance of the free swelling index on average on unseen test data set. As indicated by the Shapley analysis, in Figure 9 (top, left), the three most important variables are the oxygen content ( ), carbon content ( ), and the moisture ( ) in the coal. The latter two variables are practically of equal importance.
Shapley analysis with a linear regression model gave markedly similar results (bottom, right, Figure 9), as did the random forest based on the Gini variable importance measure (bottom, left). However, except for the most important variable, the random forest model using the permutation importance index (top, right) yielded different results, suggesting, for example, that the nitrogen concentration ( ) is comparatively important, while this did not appear to be the case with the other models. Figure 10 shows the partial dependence plots of the individual variables, as generated by the random forest model. Figure 11 shows the bivariate partial dependence plot of the two most important variables on the free swelling index of the coal, i.e., the oxygen and carbon content of the coal. The observed data are indicated by 'o' markers in this figure.  With the full set of predictor variables, the random forest could explain approximately 82.7% of the variance of the free swelling index on average on unseen test data set. As indicated by the Shapley analysis, in Figure 9 (top, left), the three most important variables are the oxygen content (x 7 ), carbon content (x 5 ), and the moisture (x 1 ) in the coal. The latter two variables are practically of equal importance.
Shapley analysis with a linear regression model gave markedly similar results (bottom, right, Figure 9), as did the random forest based on the Gini variable importance measure (bottom, left). However, except for the most important variable, the random forest model using the permutation importance index (top, right) yielded different results, suggesting, for example, that the nitrogen concentration (x 6 ) is comparatively important, while this did not appear to be the case with the other models. Figure 10 shows the partial dependence plots of the individual variables, as generated by the random forest model. Figure 11 shows the bivariate partial dependence plot of the two most important variables on the free swelling index of the coal, i.e., the oxygen and carbon content of the coal. The observed data are indicated by 'o' markers in this figure.

Consumption of Leaching Reagent in a Gold Circuit
In the final case study, a small data set related to the leaching of gold ore is considered, as described in Aldrich [32]. The consumption of lixiviant ( ) depending on seven predictors variables ( , … ) were considered. The variables represented the percentage extraction of the gold ( ), the residual grade of the ore ( ), the cyanide concentration in the leach tank ( ), the gold grade of the ore ( ), the source of the ore ( ), and the agitation rate of the tank ( ), as well as the leach temperature ( ).
The predictor data matrix, ∈ ℝ × and the associated response vector, consumption of lixiviant, ∈ ℝ × contained 54 samples. The correlation of the variables are summarized in Table  3. The 100-tree random forest model, with the following hyperparameters: of the data, minimum leaf size, = 3 was fitted to the data. Figure 12 shows the decrease in the mean square error of the random forest model, as a function of the number of trees in the forest.  Figure 13 shows the results obtained with the different variable importance measures. The Shapley random forest variable importance measure identified variables and as significantly more important than the other variables.
Although the percentage extraction of the gold ( ) was identified as the most important variable by all the measures, as well, the results differ as far as the other variables are concerned. For example, the linear Shapley regression, permutation, and Gini variable importance indicators all flagged

Consumption of Leaching Reagent in a Gold Circuit
In the final case study, a small data set related to the leaching of gold ore is considered, as described in Aldrich [32]. The consumption of lixiviant (y) depending on seven predictors variables (x 1 , x 2 . . . x 7 ) were considered. The variables represented the percentage extraction of the gold (x 1 ), the residual grade of the ore (x 2 ), the cyanide concentration in the leach tank (x 3 ), the gold grade of the ore (x 4 ), the source of the ore (x 5 ), and the agitation rate of the tank (x 6 ), as well as the leach temperature (x 7 ).
The predictor data matrix, X ∈ R 54×7 and the associated response vector, consumption of lixiviant, y ∈ R 54×1 contained 54 samples. The correlation of the variables are summarized in Table 3. The 100-tree random forest model, with the following hyperparameters: m try = 4, n try = 80% of the data, minimum leaf size, n lea f = 3 was fitted to the data. Figure 12 shows the decrease in the mean square error of the random forest model, as a function of the number of trees in the forest.  Figure 13 shows the results obtained with the different variable importance measures. The Shapley random forest variable importance measure identified variables x 1 and x 5 as significantly more important than the other variables.
Although the percentage extraction of the gold (x 1 ) was identified as the most important variable by all the measures, as well, the results differ as far as the other variables are concerned. For example, the linear Shapley regression, permutation, and Gini variable importance indicators all flagged variable x 2 as the second most important variable. Figure 14 shows a bivariate partial dependence plot if the most important variables identified by the VI RF shap measure.
Minerals 2020, 10, 420 14 of 17 variable as the second most important variable. Figure 14 shows a bivariate partial dependence plot if the most important variables identified by the measure.    Figure 14 shows a bivariate partial dependence plot if the most important variables identified by the measure.

Discussion and Conclusions
In this paper, the use of random forest models in a Shapley regression framework was investigated, as a means to determine the importance of the predictors in these models. The results were compared with the widely used Gini and permutation importance measures used with random forests, as well as with multiple linear regression models, the latter also in the same framework.
Overall, the results suggest that the Shapley random forest variable importance measure yielded more reliable results than the other measures, at least with regard to the first two case studies based on simulated data.
The Gini importance measure is known to be biased towards variables with many categories or numeric values and even variables with many missing values [33]. However, in all the case studies considered here, the variables were of the same type, so this bias was not manifest. This may explain why the Gini importance measure arguably performed better than the permutation measure in two of the four case studies, and compared with mixed results in the other two case studies.
The random forest and multiple linear regression model used in a Shapley regression framework gave similarly reliable results for the linear system considered, as would be expected. However, the linear model broke down when it failed to capture the relationships in the nonlinear system in Case study 2, in particular.
Although the Shapley random forest measure arguably gave the most reliable results of the measures considered, these results would need to be validated by further work. Future work will also be extended to the use of other machine learning algorithms, such as multilayer perceptrons, which are also widely used as tools in variable importance analysis.

Discussion and Conclusions
In this paper, the use of random forest models in a Shapley regression framework was investigated, as a means to determine the importance of the predictors in these models. The results were compared with the widely used Gini and permutation importance measures used with random forests, as well as with multiple linear regression models, the latter also in the same framework.
Overall, the results suggest that the Shapley random forest variable importance measure yielded more reliable results than the other measures, at least with regard to the first two case studies based on simulated data.
The Gini importance measure is known to be biased towards variables with many categories or numeric values and even variables with many missing values [33]. However, in all the case studies considered here, the variables were of the same type, so this bias was not manifest. This may explain why the Gini importance measure arguably performed better than the permutation measure in two of the four case studies, and compared with mixed results in the other two case studies.
The random forest and multiple linear regression model used in a Shapley regression framework gave similarly reliable results for the linear system considered, as would be expected. However, the linear model broke down when it failed to capture the relationships in the nonlinear system in Case study 2, in particular.
Although the Shapley random forest measure arguably gave the most reliable results of the measures considered, these results would need to be validated by further work. Future work will also be extended to the use of other machine learning algorithms, such as multilayer perceptrons, which are also widely used as tools in variable importance analysis.