Feature Interaction in Terms of Prediction Performance

There has been considerable development in machine learning in recent years with some remarkable successes. Although there are many high-performance methods, the interpretation of learning models remains challenging. Understanding the underlying theory behind the specific prediction of various models is difficult. Various studies have attempted to explain the working principle behind learning models using techniques like feature importance, partial dependency, feature interaction, and the Shapley value. This study introduces a new feature interaction measure. While recent studies have measured feature interaction using partial dependency, this study redefines feature interaction in terms of prediction performance. The proposed measure is easy to interpret, faster than partial dependency-based measures, and useful to explain feature interaction, which affects prediction performance in both regression and classification models.


Introduction
The emerging technology report published by Gartner in 2017 identifies that machine learning is currently located at the peak of inflated expectations [1]. Machine learning techniques have been adopted by an increasing number of people, which has led to success stories in many fields. Schwartz states that machine learning is no longer restricted to the experts in the Harvard Business Review [2]. Although machine learning has seen considerable success in recent years, most experts in this field still consider the prediction models generated by machine learning algorithms to be "black boxes". They do not have knowledge about how specific prediction results are obtained and the underlying principle behind the functioning of machine learning models. The concepts behind regression and decision tree models are relatively easy to understand, whereas the interpretation of those based on neural networks and support vector machines are more complex. Thus, researchers have concentrated their efforts on explaining the working mechanism of learning models and their prediction results [3].
Feature importance, the partial dependence plot (PDP) [4], feature interaction, and the Shapley value [5] have been proposed to achieve model interpretation. All of the above stated features can be calculated and interpreted only after model fitting. Feature impact or feature importance [6][7][8] measures the extent to which the feature or variable influences the prediction results of the given model. A prediction model is built on multiple features, wherein each feature makes a unique contribution toward the prediction. A model can be better understood by having knowledge about the influence of each feature in the model. Various methods have been proposed to measure feature importance. Recently permutation-based feature importance, which measures the importance of a feature f1 by permuting its feature values, was introduced by Breiman [9] and Fisher et al. [10]. This disconnects f1 from the other features and changes the predictive performance of the given model. The importance/significance of f1 increases with a greater degree of change in the prediction performance.
The partial dependence plot, which was introduced by Friedman, shows the marginal effect of one or two features on the predictive outcome of a machine learning model [4]. The PDP helps in The partial dependence plot, which was introduced by Friedman, shows the marginal effect of one or two features on the predictive outcome of a machine learning model [4]. The PDP helps in determining the transition/change in the predictive performance in relation to the change in the feature data values (observations). Furthermore, it is possible to obtain the range of data values in a feature that is either useful for prediction or shows low prediction performance (see Figure 1). The light yellow region indicates the data range of two features that produces a high prediction performance, whereas the dark blue region represents a low prediction performance data range.
Features in a prediction model tend to collaborate in a prediction task, which indicates the presence of feature interaction. Various methods for measuring feature interaction have been proposed. Initially, statistics-based methods were presented. Hastie and Tibshirani [11] suggested measuring feature interaction based on the analysis of variance (ANOVA) test. After the ANOVA test, the corresponding p-value for each pair of features is computed to measure feature interaction. This approach requires extensive computation time. Loh and Lou et al. [12,13] tested pairwise feature interactions based on the χ 2 test. Sorokina et al. proposed a grove-based method to detect statistical interactions [11]. Decision-tree-based models, such as "random forest", measure feature interaction using a tree structure. If features f1 and f2 are located on the same path of a decision tree, they can be regarded as interacting [14][15][16].
Recently, partial dependency-based methods have been proposed. The H-statistics method by Friedman and Popescu [17] is a typical method. The H-statistics used to measure the interaction between feature A and B are defined by where () is a partial dependence function. The variable interaction network method, introduced by Hooker, decomposes the prediction function into the main effects and feature interactions [7]. Greenwell et al. introduced the partial dependence-based feature importance and interaction measures [8], which facilitates the measurement between two features. This measures feature interaction between feature A and B with the following expressions: Interact(A,B) = (SA + SB)/2 where sd is a standard deviation; i(A) is a feature importance of A; j = 1, 2, 3, …, n; and Ai is an i-th feature value of A. The light yellow region indicates the data range of two features that produces a high prediction performance, whereas the dark blue region represents a low prediction performance data range.
Features in a prediction model tend to collaborate in a prediction task, which indicates the presence of feature interaction. Various methods for measuring feature interaction have been proposed. Initially, statistics-based methods were presented. Hastie and Tibshirani [11] suggested measuring feature interaction based on the analysis of variance (ANOVA) test. After the ANOVA test, the corresponding p-value for each pair of features is computed to measure feature interaction. This approach requires extensive computation time. Loh and Lou et al. [12,13] tested pairwise feature interactions based on the χ 2 test. Sorokina et al. proposed a grove-based method to detect statistical interactions [11]. Decision-tree-based models, such as "random forest", measure feature interaction using a tree structure. If features f 1 and f 2 are located on the same path of a decision tree, they can be regarded as interacting [14][15][16].
Recently, partial dependency-based methods have been proposed. The H-statistics method by Friedman and Popescu [17] is a typical method. The H-statistics used to measure the interaction between feature A and B are defined by where PD() is a partial dependence function. The variable interaction network method, introduced by Hooker, decomposes the prediction function into the main effects and feature interactions [7]. Greenwell et al. introduced the partial dependence-based feature importance and interaction measures [8], which facilitates the measurement between two features. This measures feature interaction between feature A and B with the following expressions: where sd is a standard deviation; i(A) is a feature importance of A; j = 1, 2, 3, . . . , n; and A i is an i-th feature value of A. The Shapley value, which was introduced by Shapley, shows the contribution of each feature toward the interpretation of a single prediction result [5]. Each feature in the Shapley value is considered to be a "player" in a game. The Shapley value ensures a fair distribution of the "payout" among all features. Figure 2 shows an example of a Shapley value plot.
Appl. Sci. 2019, 9, x FOR PEER REVIEW  3 of 13 The Shapley value, which was introduced by Shapley, shows the contribution of each feature toward the interpretation of a single prediction result [5]. Each feature in the Shapley value is considered to be a "player" in a game. The Shapley value ensures a fair distribution of the "payout" among all features. Figure 2 shows an example of a Shapley value plot. Molnar has presented an in-depth discussion of model interpretability that covers various methods discussed above [18] as well as implementing the iml package [19] for R and Python that contains various functions related to model interpretability.
In this paper, a new measure for feature interaction based on prediction performance is proposed. This study tries to understand feature interaction as a variation of prediction performance. In recent studies, feature interaction has been measured using partial dependency, wherein feature combination is used to measure the partial dependency variance [7,8,17]. Although the concepts of partial dependency are clearly known, it is difficult to intuitively understand how it impacts prediction performance. This necessitates the need for more lucid feature interaction measures to explain prediction performance. Prediction performance has emerged as a new candidate in feature interaction measurement. While various combinations of feature interactions exist in a model, this study will focus on the interaction between any two features.

Proposed Interaction Measure
The terms "prediction performance" and "feature interaction" are discussed below prior to introducing the proposed method. In this study, two types of prediction tasks, namely, regression and classification, were considered. In the regression task, the target value of a prediction is a continuous value. In the classification task, the target value of a prediction is a categorical value. The classification accuracy used to measure classification performance is defined as follows: The root mean square error (RMSE) used to measure regression performance is defined as follows: Thus, in the context of this study, the prediction performance refers to the classification accuracy or RMSE with respect to the type of prediction task. However, classification accuracy or RMSE is not the only performance measure. The proposed method is highly compatible with other measures such as the area under the curve (AUC) value.
Feature interaction can be described as a phenomenon that involves two or more features meeting and influencing each other during a prediction task. Owing to the presence of feature interaction, the overall prediction performance of a model is not equal to a simple sum of the performance of each constituent feature. Feature interaction can be measured by determining the Molnar has presented an in-depth discussion of model interpretability that covers various methods discussed above [18] as well as implementing the iml package [19] for R and Python that contains various functions related to model interpretability.
In this paper, a new measure for feature interaction based on prediction performance is proposed. This study tries to understand feature interaction as a variation of prediction performance. In recent studies, feature interaction has been measured using partial dependency, wherein feature combination is used to measure the partial dependency variance [7,8,17]. Although the concepts of partial dependency are clearly known, it is difficult to intuitively understand how it impacts prediction performance. This necessitates the need for more lucid feature interaction measures to explain prediction performance. Prediction performance has emerged as a new candidate in feature interaction measurement. While various combinations of feature interactions exist in a model, this study will focus on the interaction between any two features.

Proposed Interaction Measure
The terms "prediction performance" and "feature interaction" are discussed below prior to introducing the proposed method. In this study, two types of prediction tasks, namely, regression and classification, were considered. In the regression task, the target value of a prediction is a continuous value. In the classification task, the target value of a prediction is a categorical value. The classification accuracy used to measure classification performance is defined as follows: Number of correctly predicted cases Total number of predicted cases .
The root mean square error (RMSE) used to measure regression performance is defined as follows: Thus, in the context of this study, the prediction performance refers to the classification accuracy or RMSE with respect to the type of prediction task. However, classification accuracy or RMSE is not the only performance measure. The proposed method is highly compatible with other measures such as the area under the curve (AUC) value. Feature interaction can be described as a phenomenon that involves two or more features meeting and influencing each other during a prediction task. Owing to the presence of feature interaction, the overall prediction performance of a model is not equal to a simple sum of the performance of each constituent feature. Feature interaction can be measured by determining the variance in the prediction performance. The following notations will be defined prior to introducing the formal definition of feature interaction:  Figure 3 illustrates a permuting feature denoted by "weight". Feature permutation neutralizes the influence of a given feature in a prediction model. Feature permutation involves a random shuffle of the feature values. Figure 3 illustrates a permuting feature denoted by "weight". Feature permutation neutralizes the influence of a given feature in a prediction model.

Definition 1. The reduced performance error Err obtained by removing feature Fi is defined as
In a classification task, PPM represents the classification accuracy for the model M. The use of a dataset with a permuted feature Fi (DS_Perm(Fi)) as opposed to the original dataset DS may result in reduced accuracy. Thus, the reduced accuracy can be interpreted as an "error" caused by the permuted feature Fi. A large value of Err(Fi) indicates the importance of Fi in the given prediction task. A negative Err(Fi) value suggests that feature Fi hinders/affects accurate prediction. In such cases, Fi can be removed from the training set DS to improve the performance of the model. In a regression task, PPM represents the RMSE value for the model M. The dataset consisting of a permuted feature Fi (DS_Perm(Fi)) may result in a higher RMSE compared to the original dataset DS. This indicates that the equation describing Err(Fi) is different for the regression and classification tasks.
Previous studies have used the concept of "permuted feature" to measure the feature impact [6][7][8]. In this study, the "permuted feature" is used as a measure of performance reduction, implying that both feature impact and performance reduction refer to the same concept.

Definition 2. The reduced performance Err obtained by removing the feature set {Fi, Fj} ⊂ FU is defined as
The reduced performance of two features can be defined in the context of a single feature. We can permute the two features and consider the modified dataset DS_Perm({Fi, Fj}). Definition 2 can also be extended to include three or more features of Err().
In a classification task, PP M represents the classification accuracy for the model M. The use of a dataset with a permuted feature F i (DS_Perm(F i )) as opposed to the original dataset DS may result in reduced accuracy. Thus, the reduced accuracy can be interpreted as an "error" caused by the permuted feature F i . A large value of Err(F i ) indicates the importance of F i in the given prediction task. A negative Err(F i ) value suggests that feature F i hinders/affects accurate prediction. In such cases, F i can be removed from the training set DS to improve the performance of the model. In a regression task, PP M represents the RMSE value for the model M. The dataset consisting of a permuted feature F i (DS_Perm(F i )) may result in a higher RMSE compared to the original dataset DS. This indicates that the equation describing Err(F i ) is different for the regression and classification tasks.
Previous studies have used the concept of "permuted feature" to measure the feature impact [6][7][8]. In this study, the "permuted feature" is used as a measure of performance reduction, implying that both feature impact and performance reduction refer to the same concept.

Definition 2.
The reduced performance Err obtained by removing the feature set {F i , F j } ⊂ F U is defined as The reduced performance of two features can be defined in the context of a single feature. We can permute the two features and consider the modified dataset DS_Perm({F i , F j }). Definition 2 can also be extended to include three or more features of Err().

Definition 3. The feature interaction Interact between features F i and F j is defined as
In Definition 3, feature interaction is defined as the difference in the performance reduction. Err only when there is no interaction between the features F i and F j . In the context of this study, the prediction performance is influenced by the feature interaction. The prediction errors caused by F i and F j independently decrease/increase when there is interaction between these two features. Three cases of Interact(F i , F j ) can be defined as Negative interaction implies that the connection between F i and F j reduces the prediction performance, whereas positive interaction leads to an increase in the prediction performance. In other words, positive or negative interactions decrease or increase the prediction error, respectively.

Results
The feature interaction function described in the previous section was implemented in the R programming environment (https://www.r-project.org). The CARET package [20] was used to build the prediction model. The feature permutation was repeated 10 times, and the average of the reduced performance of the Err() function was computed to accurately measure the feature permutation. Only the interaction between two features was measured for simplicity.
Feature interaction was demonstrated in two sample datasets. The PimaIndiansDiabetes [21] dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and is used to predict whether or not a patient has diabetes. It has nine features and 768 observations and is used for the classification task. The features in the PimaIndiansDiabetes dataset are described in Table 1.
The BostonHousing [22] dataset has housing data for 506 census tracts of Boston from the 1970 census. It is used for predicting the price of houses (mdev in the dataset) and is essentially a regression task. The dataset has 14 features and 506 observations. Features in the BostonHousing dataset are illustrated in Table 2. The C5.0 prediction model was constructed using the PimaIndiansDiabetes dataset to test the classification, and the entire dataset was used for model fitting. The training accuracy was 0.8112. Figure 4 shows the feature interaction for various combinations of any two features in the dataset. In the interaction table, the (i-th, j-th) cell shows the feature interaction between features F i and F j . The reduced prediction performance due to feature F k as calculated by the Err(F k ) is shown in the cells along the (gray) diagonal (k-th, k-th). Figure 4 shows that "glucose" is the most effective feature for prediction as the Err(glucose) = 0.1503, which is the largest value of the Err() function. It means that if "glucose" is removed from the prediction model, the model performance (prediction accuracy) decreases by 0.1503. The feature "mass" offers the best interaction with "glucose" (Interact(glucose, mass) = 0.0301) as the prediction error decreases by 0.0301. The features "age", "insulin", and "pressure" have an Err() = 0 and Interact() = 0, which indicates that these features are not effective predictors of diabetes. The prediction model when rebuilt excluding the above three features resulted in an accuracy of approximately 0.80, which is approximately equal to that obtained with the original full dataset. Figure 5 shows the feature interaction between "glucose" and other features in the dataset. The feature "mass" is most interactive and "pregnant" is second. "Pressure", "insulin", and "age" have no interaction with "glucose". The three features do not co-work with "glucose" to increase or decrease the prediction performance. A linear regression model using the BostonHousing dataset was built to evaluate regression. The complete BostonHousing dataset has an RMSE of 4.6792. Figure 6 shows the interaction table where it can be observed that "lstat" (percentage of lower status of the population) is the most useful feature for predicting house price (Err(lstat) = 2.423). The features "b", "crim", "dis", "rad", and "rm" are important for prediction, whereas "age" and "indus" are less influential. The regression model also displays negative interaction. Figure 7 shows that the "dis" (weighted distances to five Boston employment centers) and "rad" (index of accessibility to radial highways) features interact negatively with the "lstat". This suggests that "dis" and "rad" decrease the prediction performance of "lstat".
The features "rm", "nox", and "tax" have a positive interaction with "lstat". They co-work with "lstat" to improve the prediction performance of the regression model. decreases by 0.1503. The feature "mass" offers the best interaction with "glucose" (Interact(glucose, mass) = 0.0301) as the prediction error decreases by 0.0301. The features "age", "insulin", and "pressure" have an Err() = 0 and Interact() = 0, which indicates that these features are not effective predictors of diabetes. The prediction model when rebuilt excluding the above three features resulted in an accuracy of approximately 0.80, which is approximately equal to that obtained with the original full dataset. Figure 5 shows the feature interaction between "glucose" and other features in the dataset. The feature "mass" is most interactive and "pregnant" is second. "Pressure", "insulin", and "age" have no interaction with "glucose". The three features do not co-work with "glucose" to increase or decrease the prediction performance.  A linear regression model using the BostonHousing dataset was built to evaluate regression. The complete BostonHousing dataset has an RMSE of 4.6792. Figure 6 shows the interaction table where it can be observed that "lstat" (percentage of lower status of the population) is the most useful feature for predicting house price (Err(lstat) = 2.423). The features "b", "crim", "dis", "rad", and "rm" are important for prediction, whereas "age" and "indus" are less influential. The regression model also displays negative interaction. Figure 7 shows that the "dis" (weighted distances to five Boston employment centers) and "rad" (index of accessibility to radial highways) features interact negatively with the "lstat". This suggests that "dis" and "rad" decrease the prediction performance of "lstat". The features "rm", "nox", and "tax" have a positive interaction with "lstat". They co-work with "lstat" to improve the prediction performance of the regression model. A linear regression model using the BostonHousing dataset was built to evaluate regression. The complete BostonHousing dataset has an RMSE of 4.6792. Figure 6 shows the interaction table where it can be observed that "lstat" (percentage of lower status of the population) is the most useful feature for predicting house price (Err(lstat) = 2.423). The features "b", "crim", "dis", "rad", and "rm" are important for prediction, whereas "age" and "indus" are less influential. The regression model also displays negative interaction. Figure 7 shows that the "dis" (weighted distances to five Boston employment centers) and "rad" (index of accessibility to radial highways) features interact negatively with the "lstat". This suggests that "dis" and "rad" decrease the prediction performance of "lstat". The features "rm", "nox", and "tax" have a positive interaction with "lstat". They co-work with "lstat" to improve the prediction performance of the regression model.  Apart from the main implementations in this study, namely, the two-way feature interaction plots in Figures 5 and 7 and the feature interaction tables in Figures 4 and 6, several additional functions were also implemented. The following results were obtained from the BostonHousing dataset. Figure 8 shows the feature interaction details between two features, namely, "crim" and "zn". The results show that the BostonHousing dataset is based on a regression model. The total RMSE of the complete dataset is 4.679191. Excluding the "crim"(F1) and "zn"(F2) features increases the total RMSE by 0.1825532 and 0.256769, respectively. Therefore, the prediction error will increase if "crim" and "zn" are removed from the prediction model. The total RMSE increases by 0.4663873 when both the "crim" and "zn" features are excluded. Consequently, the degree of feature interaction between the "crim" and "zn" features is equal to 0.02706504, which is calculated as 0.4663873 -(0.1825532 + 0.256769). It shows that the feature "crim" and "zn" have a positive interaction, and it means that combining "crim" and "zn" helps decrease the prediction error of house price.  Figure 9 shows the total degrees of interaction corresponding to each feature. The total degree of interaction is the summation of absolute values of positive and negative interaction. The plot allows one to determine the features that strongly interact with each other. The plot also illustrates the degree of positive or negative interactions of each feature. In Figure 9, "degree_of_interaction" means the variations of prediction performance caused by each variable in a model. In the case of "dis" and "rad", they have a strong negative interaction with other features. We may consider removing the two features to decrease the prediction error of the model. The features "age", "chas", and "indus" have little interaction with other features. From Figure 6, we confirm that they have weak prediction power; thus, they are strong candidates for elimination to improve model performance. Apart from the main implementations in this study, namely, the two-way feature interaction plots in Figures 5 and 7 and the feature interaction tables in Figures 4 and 6, several additional functions were also implemented. The following results were obtained from the BostonHousing dataset. Figure 8 shows the feature interaction details between two features, namely, "crim" and "zn". The results show that the BostonHousing dataset is based on a regression model. The total RMSE of the complete dataset is 4.679191. Excluding the "crim"(F1) and "zn"(F2) features increases the total RMSE by 0.1825532 and 0.256769, respectively. Therefore, the prediction error will increase if "crim" and "zn" are removed from the prediction model. The total RMSE increases by 0.4663873 when both the "crim" and "zn" features are excluded. Consequently, the degree of feature interaction between the "crim" and "zn" features is equal to 0.02706504, which is calculated as 0.4663873 − (0.1825532 + 0.256769). It shows that the feature "crim" and "zn" have a positive interaction, and it means that combining "crim" and "zn" helps decrease the prediction error of house price. Apart from the main implementations in this study, namely, the two-way feature interaction plots in Figures 5 and 7 and the feature interaction tables in Figures 4 and 6, several additional functions were also implemented. The following results were obtained from the BostonHousing dataset. Figure 8 shows the feature interaction details between two features, namely, "crim" and "zn". The results show that the BostonHousing dataset is based on a regression model. The total RMSE of the complete dataset is 4.679191. Excluding the "crim"(F1) and "zn"(F2) features increases the total RMSE by 0.1825532 and 0.256769, respectively. Therefore, the prediction error will increase if "crim" and "zn" are removed from the prediction model. The total RMSE increases by 0.4663873 when both the "crim" and "zn" features are excluded. Consequently, the degree of feature interaction between the "crim" and "zn" features is equal to 0.02706504, which is calculated as 0.4663873 -(0.1825532 + 0.256769). It shows that the feature "crim" and "zn" have a positive interaction, and it means that combining "crim" and "zn" helps decrease the prediction error of house price.  Figure 9 shows the total degrees of interaction corresponding to each feature. The total degree of interaction is the summation of absolute values of positive and negative interaction. The plot allows one to determine the features that strongly interact with each other. The plot also illustrates the degree of positive or negative interactions of each feature. In Figure 9, "degree_of_interaction" means the variations of prediction performance caused by each variable in a model. In the case of "dis" and "rad", they have a strong negative interaction with other features. We may consider removing the two features to decrease the prediction error of the model. The features "age", "chas", and "indus" have little interaction with other features. From Figure 6, we confirm that they have weak prediction power; thus, they are strong candidates for elimination to improve model performance. Figure 8. Feature interaction details between the "crim" and "zn" features of the BostonHousing dataset. Figure 9 shows the total degrees of interaction corresponding to each feature. The total degree of interaction is the summation of absolute values of positive and negative interaction. The plot allows one to determine the features that strongly interact with each other. The plot also illustrates the degree of positive or negative interactions of each feature. In Figure 9, "degree_of_interaction" means the variations of prediction performance caused by each variable in a model. In the case of "dis" and "rad", they have a strong negative interaction with other features. We may consider removing the two features to decrease the prediction error of the model. The features "age", "chas", and "indus" have little interaction with other features. From Figure 6, we confirm that they have weak prediction power; thus, they are strong candidates for elimination to improve model performance. Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 13

Discussion
Feature interaction can be utilized in different applications owing to its alternate definition in this study compared to its prior definitions. In this section, the proposed method is compared with recent studies by Greenwell [8] and Friedman [17]. Figure 11 shows feature interaction graphs for variable "tax" and other variables in the BostonHousing dataset produced by the proposed method as well as those by Greenwell [8] and Friedman [17]. We can see that the graphs differ according to the definitions of feature interaction.
Greenwell and Friedman each focused on measuring feature interaction using partial dependency and H-statistics; the H-statistics were based on partial dependency. In these studies, feature interaction refers to the share of variances determined by the interaction [18]. In an interaction between features A and B, the partial dependency line in the PDP of feature A is changed/affected by that of feature B. In this study, feature interaction, which was measured using prediction performance, is defined as the variance of the prediction performance caused by the interaction.
Let us suppose variable A interacts with variable B in a specific model. In Greenwell's method, it measures the feature importance (defined as the variance of the partial dependence function) of one feature conditional on different, fixed points of the other feature [18]. If the variance is high, then the features interact with each other; if it is zero, they do not interact. Therefore, the interaction means that variable B influences the importance of variable A or vice versa. In other words, if variable B

Discussion
Feature interaction can be utilized in different applications owing to its alternate definition in this study compared to its prior definitions. In this section, the proposed method is compared with recent studies by Greenwell [8] and Friedman [17]. Figure 11 shows feature interaction graphs for variable "tax" and other variables in the BostonHousing dataset produced by the proposed method as well as those by Greenwell [8] and Friedman [17]. We can see that the graphs differ according to the definitions of feature interaction.
Greenwell and Friedman each focused on measuring feature interaction using partial dependency and H-statistics; the H-statistics were based on partial dependency. In these studies, feature interaction refers to the share of variances determined by the interaction [18]. In an interaction between features A and B, the partial dependency line in the PDP of feature A is changed/affected by that of feature B. In this study, feature interaction, which was measured using prediction performance, is defined as the variance of the prediction performance caused by the interaction.
Let us suppose variable A interacts with variable B in a specific model. In Greenwell's method, it measures the feature importance (defined as the variance of the partial dependence function) of one feature conditional on different, fixed points of the other feature [18]. If the variance is high, then the features interact with each other; if it is zero, they do not interact. Therefore, the interaction means that variable B influences the importance of variable A or vice versa. In other words, if variable B

Discussion
Feature interaction can be utilized in different applications owing to its alternate definition in this study compared to its prior definitions. In this section, the proposed method is compared with recent studies by Greenwell [8] and Friedman [17]. Figure 11 shows feature interaction graphs for variable "tax" and other variables in the BostonHousing dataset produced by the proposed method as well as those by Greenwell [8] and Friedman [17]. We can see that the graphs differ according to the definitions of feature interaction.
Greenwell and Friedman each focused on measuring feature interaction using partial dependency and H-statistics; the H-statistics were based on partial dependency. In these studies, feature interaction refers to the share of variances determined by the interaction [18]. In an interaction between features A and B, the partial dependency line in the PDP of feature A is changed/affected by that of feature B. In this study, feature interaction, which was measured using prediction performance, is defined as the variance of the prediction performance caused by the interaction. the interaction increases the prediction accuracy by 0.01. In the regression task, an evaluation value of 0.01 indicates that the interaction decreases the predicted RSME by 0.01. The evaluation value proposed in this paper was also compared across prediction models, which helps users clearly understand the feature interaction effects either in a specific model or across various models. Table 3 summarizes the above discussion. Feature interaction can be defined variously and has differing characteristics according to the definition.   Let us suppose variable A interacts with variable B in a specific model. In Greenwell's method, it measures the feature importance (defined as the variance of the partial dependence function) of one feature conditional on different, fixed points of the other feature [18]. If the variance is high, then the features interact with each other; if it is zero, they do not interact. Therefore, the interaction means that variable B influences the importance of variable A or vice versa. In other words, if variable B increases the importance of variable A, we can say B interacts with A. In Friedman's method, the interaction means that variable B influences partial dependency of variable A or vice versa. The meanings of feature interaction from Greenwell and Friedman are theoretically clear but difficult to intuitively understand. They can measure some influences between two features but do not show the meaning. In the proposed method, the interaction means that joining variable A and B increases or decreases prediction performance.
The partial dependency-based approach does not support the concept of "negative interaction" as the direction of interaction is not measured. In the proposed approach, negative interaction refers to the case wherein the merging of features reduces the prediction performance. In general, the addition of a new feature increases the prediction performance, although this might affect the existing prediction and reduce the prediction performance in some cases. This shows that feature selection should be carefully performed prior to fitting the prediction model. The proposed approach in this study facilitates the measurement of negative interactions in prediction models.
Greenwell's and Friedman's methods suffer from the time complexity involved in the computation of feature interactions [18]. The time complexity involved in the computation of the interaction between any two features is given by O(N 2 ) and O(2N 2 ), respectively, where N refers to the number of data instances (data points). Instance sampling techniques, such as the Monte Carlo method, have been adopted in most implementations to overcome the expensive time complexity. This approach reduces the computation time but also results in a loss of accuracy. In the proposed approach, a one-time permutation is first performed for two features, followed by a prediction test for N instances. This reduces the time complexity of the proposed approach to O(N), thus achieving higher efficiency compared to previous approaches.
A majority of Friedman's method supports a standardized measure of feature interaction, which is based on an evaluation value between 0 and 1, to compare features across prediction models. However, it is difficult for users to interpret the absolute evaluation values as they are only capable of comparing their relative sizes. In the proposed approach, the evaluation values of feature interaction are not standardized. In the classification task, an evaluation value of 0.01 indicates that the interaction increases the prediction accuracy by 0.01. In the regression task, an evaluation value of 0.01 indicates that the interaction decreases the predicted RSME by 0.01. The evaluation value proposed in this paper was also compared across prediction models, which helps users clearly understand the feature interaction effects either in a specific model or across various models. Table 3 summarizes the above discussion. Feature interaction can be defined variously and has differing characteristics according to the definition.

Conclusions
This paper focused on two-way feature interaction. Although N-way feature interaction can be measured, it is relatively complex to interpret. The term "interaction" involves various aspects, which necessitates further research in feature interaction to identify new techniques for the interpretation of black box learning models. The feature interaction measure proposed in this study is simple to understand and can be quickly calculated. The proposed method can measure influences between features in a prediction model as a point of model performance (accuracy or error). It can capture negative interaction, whereas previous works captured only positive interaction. The proposed method is useful to understand the role of a feature and interaction between features in a prediction model. Furthermore, it may be used to improve model performance. For example, if a feature has low importance and a high negative interaction, we may consider removing it from the prediction model. This is a topic for further research.