Interaction Difference Hypothesis Test for Prediction Models

: Machine learning research focuses on the improvement of prediction performance. Progress was made with black-box models that flexibly adapt to the given data. However, due to their increased complexity, black-box models are more difficult to interpret. To address this issue, techniques for interpretable machine learning have been developed, yet there is still a lack of methods to reliably identify interaction effects between predictors under uncertainty. In this work, we present a model-agnostic hypothesis test for the identification of interaction effects in black-box machine learning models. The test statistic is based on the difference between the variance of the estimated prediction function and a version of the estimated prediction function without interaction effects derived via partial dependence functions. The properties of the proposed hypothesis test were explored in simulations of linear and nonlinear models. The proposed hypothesis test can be applied to any black-box prediction model, and the null hypothesis of the test can be flexibly specified according to the research question of interest. Furthermore, the test is computationally fast to apply, as the null distribution does not require the resampling or refitting of black-box prediction models.


Background
In the context of machine learning, one of the main goals is to estimate and tune prediction models in order to optimize predefined performance criteria [1].In the ongoing academic debate, [2] argued that the attribution of causal factors may require a larger sample size than estimating a prediction model.This is in line with [3], who showed that causality can be linked with prediction robustness.The research area of interpretable machine learning (IML) tries to bridge the gap between prediction and classical statistical inference by making complex black-box predictions more understandable [4].A blackbox model is characterized by an input-output relationship between covariates and a response [5].In this approach, the internal structure of the box is not explicitly modeled and is regarded as unknown.The Rashomon effect [6] originates from a Japanese movie from 1950.The main plot is about a crime happening in the 12th century, which is shown from the perspectives of multiple people.Differences in those experiences show that it is hard to uncover later what really happened because, for a given set of facts, there are a multitude of compatible stories.Analogously, in machine learning, there are many different models that explain the observed data equally well.The problem of empirical induction has a long history in the philosophy of science.For example, the skepticism of David Hume, dating back to the 18th century [7], or Duhem's theses stated that the falsifiability of a single hypothesis is inconclusive [8].This work will not address this philosophical problem, but it takes instead a pragmatic approach [9].It is assumed that the primary goal is to optimize prediction performance [10] in a given context.There is some evidence of a trade-off between prediction performance and interpretability in the literature [11][12][13][14].However, prediction can benefit from interpretability as well because a deeper qualitative understanding of why a model produces a given output and not another can help generate more robust out-of-sample predictions.Therefore, it is recommended to use interpretability approaches that do not harm prediction performance but help incorporate human considerations into explainable artificial intelligence [15].
IML allows a researcher to benefit from advances in machine learning research and still explore the properties of the model afterwards to increase the interpretability of the model.Example applications include designing regulatorily compliant, fair [16], transparent, and trustworthy prediction models [17].Another area of IML focuses on the interpretation of the effects of covariates on prediction [18][19][20].Here, the focus is on global model interpretability, which means that the prediction function over the whole covariate distribution is the focus of interest instead of explaining single local predictions for specific covariate values [21].
The following sections, Sections 1.1-1.3,introduce the background knowledge required to understand the new proposed interaction difference hypothesis test for prediction models that is defined in Section 2. Firstly, a measure of how a prediction function changes, on average, for different values of a given set of covariates is introduced in Section 1.1.This measure is an essential component used in the definition of interaction effects.Secondly, Section 1.2 describes a general definition of interaction effects for black-box models, which is based on an additive decomposition of the predictions.The decomposition is illustrated using a linear regression example.Thirdly, Section 1.3 provides an overview of existing approaches to quantify interaction effects.Finally, the last introductory section, Section 1.4, describes the null hypothesis of the interaction test and the disadvantages of the previously described, existing approaches that will be addressed in this work.

Partial Dependence Functions
A global summary of the impact of one covariate on the predictions is the partial dependence (PD) plot [22].Let X P R nˆp be the observed matrix of p covariates with n independent observations of the multivariate random vector x P R p and f pxq be the estimated predictions of a statistical model of the prediction function f pxq on the population level.f pxq does not necessarily equal the covariate-response relationship in the datagenerating process.It is assumed that f pxq was estimated, as well as tuned, prior to IML analysis with respect to prediction performance with the test data.Define S " t1, . . ., pu to be the set of all indices of covariates, and the set s Ă S corresponds to indices of chosen covariates of interest.The term E x Szs px s q is defined as the expectation over the marginal (joint) distribution of all variables not in set s (denoted as x Szs ) for fixed values x s of the variables in set s (for a comparison, see [22] Section 8.1).Note that multiple column indices are denoted using set brackets in the subscript; for example, s " t1, 2u yields x t1,2u , and empty subscripts describe all available indices (for example, the second column with all observations, X ,2 ).The PD function is given via PDpx s q " E x Szs ´f ´xs , x Szs ¯¯and estimated by (1) For example, in the case of p " 5, s " t1, 2, 3u, the function PD ´xt1,2,3u ¯is the expected value of the predictions with respect to the covariate distribution x t4,5u , given the observed covariate values X ,t1,2,3u .If s " H; then, PD `xH ˘corresponds to the expected marginal prediction over all covariates, x t1,2,3,4,5u .Note that, in the case of s " S, the PD function equals the original model predictions, PDpx s q " f pxq, and the function argument x s values do not necessarily need to correspond to training data.

Interactions in Black-Box Models
First, we briefly recap what interaction effects are in a linear model context [23].Consider the simple case of a linear model prediction function with two independent covariates, x 1 , x 2 , main and interaction effects: ( The main effects, β 1 and β 2 , represent how the prediction function changes linearly, given β 1,2 " 0, if the covariate of interest is increased by one unit.In contrast, the interaction effect β 1,2 ‰ 0 of covariates x 1 and x 2 contributes additional flexibility that goes beyond the main effects and the global intercept β 0 .Let the difference term be d linear " f pxq ´x1 If the interaction effect is β 1,2 ‰ 0, then the variance of the difference term Var x pd linear q is greater than zero.Similarly, if β 1 ‰ 0, then Var x px 1 β 1 q ą 0 follows. One advantage of black-box models (for example, neural networks) is their capacity to fit higher-order interaction effects in a data-driven way without the need to explicitly prespecify them.Knowledge of the presence of such interaction effects would increase the scientific understanding of a given phenomenon, and the absence of interaction effects could be used to simplify black-box prediction models with little degradation in performance.In this context, interaction effects can be defined within the functional ANOVA decomposition framework [24].The prediction function f pxq is decomposed into a sum of additive orthogonal terms, f px sq, of sets s.Each term recursively subtracts all respective previously derived lower-order terms within set s.In this work, we use PD functions to define the functional ANOVA terms f px sq.In the simple linear regression example in Equation ( 3), the first ANOVA term would correspond to the expected value of the prediction f `xH ˘" PD `xH ˘( 4) with µ j being the expected value of the covariates or, in the case of µ 1,2 , the expected value of the product of the covariates.By definition, the functional ANOVA main effects, f px 1 q, f px 2 q, consist of the PD functions of x 1 , x 2 , minus the sum of all possible respective lower-order effects.In the case of one covariate, only the empty set needs to be subtracted: In the concrete scenario, the functional ANOVA main effects are given by If β 1,2 " 0, then f px 1 q and f px 2 q correspond analogously to centered main effects in the linear model.In the case of the second-order functional ANOVA term f ´xt1,2u ¯, two first-order terms that are contained in set t1, 2u need to be substracted, as well as the empty set, to ensure the orthogonality of second-and first-order ANOVA terms.The second-order interaction effect in terms of the functional ANOVA is, then, PD ´xt1,2u ¯" f pxq and ( 13) If β 1,2 ‰ 0, then Var x " f ´xt1,2u ¯ı ą 0, similar to the linear model context.If β 1,2 " 0, the functional ANOVA main effects have the property Var x " f ´xtju ¯ı ą 0 for j " t1, 2u, which is also shared within the linear model.Note that, if β 1,2 ‰ 0, then the functional ANOVA main effects include part of the linear model interaction effect in term x j µ t1,2uzj β 1,2 : j P t1, 2u.Therefore, we analogously define Var x " f px s ‹ q ‰ ą 0 : s ‹ Ă S ^|s ‹ | ě 2 as interaction effects of at least order |s ‹ | of the covariates in set s ‹ of black-box models.
One disadvantage of functional ANOVA is that those derived terms are estimators based on data, and this uncertainty has to be taken into account when conducting inference.A distribution of the functional ANOVA terms under the null hypothesis of no interaction is not available.A second disadvantage is that the complexity to compute the decomposition grows exponentially with the number of covariates to 2 p possible elements.Furthermore, this concept works best with independent covariates, which is unrealistic in practice.A generalized functional ANOVA [25] includes covariate dependencies but requires solving a system of equations that is even more computationally demanding than the functional ANOVA decomposition.This limits the practical application to lower-order interaction terms [18].

Interaction Measures Based on PD Functions
Based on the concept of PD functions, [22] derived the H 2 statistic to analyze interaction effects.The H 2 statistic measures the variance in the differences between a prediction function and its restricted form under a given null hypothesis normalized by the variance of the prediction function to detect specific interaction effects.Note that the concrete form of H 2 depends on the null hypothesis.For example, to test whether covariates in set s interact with any other covariates of set S, the statistic H assuming centered PD functions.Equation ( 15) is an extension of Equation (45) in [22] to multiple covariates.It was derived by repeatedly applying Equation (42) in [22] for each element of s.Note that the difference of Equation (15) to Equations ( 43) and ( 46) by [22] is the hypothesis that is being tested.In the latter case, the hypothesis is to test for the presence of the specific three-way interaction between covariates x j , x k , x l that allows any two-way interaction to be present in the prediction model.This work focuses on testing any interaction effects of covariates in the prediction model specified in set s.In Section 1.4, the hypothesis of this work is described in more detail.The statistic (15) was developed in the context of rule ensembles, and the flexible specification of interaction effects can be evaluated.The derived hypothesis test is a parametric bootstrap approach that simulates artificial data sets with a prediction model restricted to the null hypothesis of no interaction effects (Section 8.3 in [22]).Rule ensembles can be restricted to not include interaction effects by limiting the tree depth to one, but it does not work for different types of prediction models.Furthermore, the approach is computationally expensive due to the need to refit prediction models to artificial data sets, and the accuracy of the simulated p-value depends on the number of bootstrap replicates.The computational costs rise further due to the tuning process of hyperparameters, which are usually based on resampling methods like k-fold cross-validation.For an overview of recent developments in the field of hyperparameter optimization, we refer to [26].
Another measure to quantify interactions was developed by [27] that quantifies interactions between two covariates, x j and x k , by estimating the standard deviation of the PD function of the x j conditional on values of x k .This approach is restricted to two-way inter-actions.Generalizing this to scale higher-order interaction effects than two would reduce the number of available samples for estimating the standard deviation, and the number of possible combinations of the conditional covariates would increase exponentially.Note that there are also graphical tools to assess interaction effects, for example, [28,29]; however, these can only be meaningfully applied to illustrate lower-dimensional covariate interaction effects than three, and they do not quantify their method uncertainty analytically.Thus, an uncertainty assessment of these methods requires the usage of computer-intensive resampling methods that are not feasible with a large number of covariates.

Scope of Research
This work explored an interaction hypothesis test in model-agnostic form, meaning that it can be used with any kind of prediction model.It was assumed that the prediction model has enough capacity to potentially estimate interaction effects.In particular, consider the following null hypothesis that there is no interaction effect in the population involving any variable in s: The set s describes the covariates of interest.For example, if s " t1, 3u and S " t1, 2, 3u, then it tests whether there is any interaction involving the first and third covariates.In this special case, the statistical test includes second-and third-order interaction effects.In general, the number of elements, |s|, determines the highest order of interaction effects considered in the hypothesis test.
Generally, one could consider H 2 s ; however, using measure H 2 s as the basis for the interaction test would have some disadvantages in practice:

•
There is no asymptotic null distribution of the hypothesis test of [22] for the presence of interactions available in model-agnostic form.

•
The H 2 s interaction test is based on Monte Carlo simulations to quantify uncertainty [31], which are computationally runtime-intensive.

•
To the best of the authors' knowledge, no systematic power simulation in hypothesis interaction tests based on H 2 s was conducted.This work addresses all of these issues.Furthermore, none of the existing IML approaches provide error-rate control [32], and thus, no severe testing is possible.Ref. [33] developed a statistically sophisticated philosophy of science in which the problem of induction is reduced to the practice of severe testing.To believe in a hypothesis is not only a function of the method or data used but also concerns how well the method was critically tested to rule out potential flaws.This work is a first step towards embedding IML methods into this statistical testing framework.
As an alternative to H 2 s , the interaction difference (IAD) and the corresponding hypothesis interaction test are introduced in Section 2. It is shown how the IAD can be transformed into a test statistic that can be embedded into a two-sided, one-sample Z-test.Then, in Section 3, the asymptotic distribution of the test statistic based on test data is derived.Simulations of the proposed method are given in Section 4, which include the distribution of the proposed test statistic (Section 4.1), type 1 error, and power in linear (Section 4.2) models.The advantage of those simulation scenarios is that interaction effects can be more easily incorporated than in more complex black-box models in the design.Section 4.3 covers simulations of ẑ4 based on a random forest model.This situation is more realistic than the previous sections because, in linear models, one would not need this interaction test in practice.However, it is harder to control interaction effects in nonlinear simulation designs.The data analysis example in Section 5 focuses on a variant of the test statistic that includes covariate information.

Hypothesis Test of Interactions in Prediction Models
The concept of the proposed statistical test is to compare variances in the estimated prediction model f and the estimated prediction model without interactions represented by PD functions.That means both variances are derived from the same data and, hence, dependent.Here, we follow the framework of [34] for robust tests of scale in paired samples.Those tests convert the hypothesis to allow standard asymptotical tests to be used.An advantage of this approach is that these are far more computationally efficient than Monte Carlo permutation tests.This is especially important in high-dimensional prediction tasks to be able to analyze a larger subspace of the exponentially growing number of all possible interaction effects.The key idea is to test whether the interaction difference IAD s " Var x p f pxqq looooomooooon equals zero.IAD s measures the deviation of variability between the original prediction model, f pxq, and the prediction model under the null hypothesis.Following [22], the prediction model f pxq can be decomposed under H 0 into PD ´xSzs ¯`ř jPs PD `xj ˘if the covariates in set s do not contribute to interaction effects.Proof of this statement based on the functional ANOVA framework is given in Supplementary Materials Section S1.The decomposition of the prediction model for the purpose of testing IAD s is given via The term ζpxq includes, for example, additional interaction terms of set s that are not included in IAD PD,s .Under H 0 , it holds that Var x pζpxqq " 0 and analogous terms in the H 1 scenario, Var x pζpxqq ‰ 0. For example, in the context of a linear prediction model, f pxq " 3 , under H 0 with no interaction effect of x 1 (Supplementary Materials Section S2.1), the error term ζpxq consists of a linear combination of coefficients and their respective expectations of the covariate terms.Not all possible specifications of set s are meaningful.For example, using the empty set would give IAD PD,s " Var x pPDpx S qq " Var x p f pxqq, which results in IAD s " 0. This case is excluded.Furthermore, the cases with a number of elements |Szs| " 1 and |Szs| " 0 are equivalent.Consider the specific case S " t1, 2, 3u.Then, IAD PD,s"t1,2u " Var x ´PDpx 3 q `ř2 j"1 PD `xj ˘¯" Var x ´ř3 j"1 PD `xj ˘¯that is equal to IAD PD,s"t1,2,3u because PD `xH ˘does not depend on covariate values and is constant.In this specific case, all combinations of the set s with two covariates are excluded.Instead, the set is described as s " t1, 2, 3u.Consider the following specific example of IAD s : assuming a linear regression model with three independent, multivariate, standard, normal, distributed covariates and all possible interaction effects under restrictions of H 0 , the value of IAD s is zero, regardless of the set s (see Supplementary Materials Section S2 for details).Deviations from zero in IAD s are in favor of the alternative hypothesis H 1 .In the scenarios under the alternative hypothesis H 1 , the test statistic equals the sum of all quadratic interaction coefficients that include the covariates of set s (Supplementary Materials Section S2.5).
To test the condition under H 0 that IAD s " 0, the difference in variances in Equation ( 19) can be rewritten as covariance using Proof of this equivalence is given in Supplementary Materials Section S3 that was based on the idea of [35].The covariance in Equation ( 23) is the expectation of z 3 " pz 1 ´Ex pz 1 qqpz 2 ´Ex pz 2 qq.Let ẑ3,i be the estimated value of z 3 evaluated at the i-th observed value in the data set, and ẑ3 " p ẑ3,1 , ẑ3,2 , . . ., ẑ3,n q.The modified Pitman test [34] then evaluates a null hypothesis Epz 3 q " 0 in the framework of a one-sample, two-sided Z-test, which is equivalent to testing whether the difference of variances in Equation ( 19) is zero.In particular, the test statistic is given via ẑ3,i that estimate the term Small absolute values around zero indicate H 0 , and large absolute values favor H 1 .For testing, the value of ẑ4 is compared to the respective quantiles of a standard normal distribution.
A related but different question than testing interaction effects is how these influence the prediction performance of the prediction model.Here, we introduce a variant of ẑ4 that includes response information.Equation ( 19) is extended to interaction-difference performance (IADP) IADP s " Var x pypxq ´f pxqq loooooooooomoooooooooon The term IADP f ,s is the mean squared error of the prediction model with a quantitative response, ypxq, or the Brier score in the case of a binary response scale.IADP PD,s is the mean squared error (MSE) of the restricted prediction model under a null hypothesis of no interaction effects of covariates in set s.A one-sided test is more appropriate here because the interest is whether the interaction effects of covariates s decrease MSE (alternative hypothesis).The terms z P,1 , z P,2 , z P,3 , z P,4 for the construction of the interaction test are analogously derived to z 1 , z 2 , z 3 , z 4 via a plugin of Equation (25).

Asymptotics of Test Statistics
This section summarizes the asymptotic properties of ẑT 4 evaluated on test data.The PD functions and the prediction model are estimated from the training data.Let f denote the target function and p f the corresponding estimate.Moreover, denote gpxq " PDpx Szs q `ÿ jPS PDpx j q and let p g denote the corresponding estimate.Then, following (Equation (1)) in Hooker (2004) [24], it holds that Var x p f pxqq ě Var x pgpxqq and Var x p f pxqq " Var x pgpxqq if and only if gpxq " f pxq almost everywhere.Hence, testing the equivalence of the variances is, indeed, equivalent to testing f pxq " gpxq almost everywhere.
with 0 ă c ă 8. Define x z 1,i " p f pX i, q `p gpX i, q, and x z 2,i " p f pX i, q ´p gpX i, q.
Proof.(ii) is trivial.For (i), note that It follows from Equation (26) that Moreover, the CLT and Slutzky's lemma yield the result that converges to a normal distribution with variance c σ 2 f .The crucial assumption of the above theorem is that the convergence rate of the variance of the differences between f and ĝ is faster than pn T q ´1, where n T is the size of the test set.For most models, this will be the case when the size of the training set goes to infinity faster than the size of the test set, i.e., n T {n Ñ 0 for n, n T Ñ 8.A similar result can be derived for the test based on ẑP,4 , which measures the differences in MSE performance (for a comparison, see Equation ( 25)).
Theorem 2. Let f : R p Ñ R and g : R p Ñ R be two fixed prediction functions.Moreover, let pX 1, , ypX 1, qq, pX 2, , ypX 2, qq, . . .´Xn T , , ypX n T , q ¯denote i.i.d.samples in R p`1 .Further, assume E x r f pxq 2 s ă 8, Then, where µ di f f " E x rpypxq ´f pxqq 2 s ´Ex rpypxq ´gpxqq 2 s, and σ 2 di f f can be estimated from the given sample.
If we are interested in showing that f has a smaller expected squared prediction error than g, we can consider the testing problem In particular, in the setting of the paper, we set in the above theorem.
Then, the rejection of the null hypothesis provides evidence that the original prediction function, p f , has a smaller prediction error than the "prediction function without" interactions, x PDpX i,Szs q `řjPs x PDpX i,j q.This, in turn, suggests that there is a meaningful modeling of interaction in p f and that there are interactions in the target function f .It has to be noted that testing H 0 : Interaction effects of f do not improve MSE performance is not guaranteed to control the nominal level for the two-sample problem.However, simulations indicate that it will typically do so (and even be rather conservative).

Simulation
This section summarizes simulation results with the proposed interaction test of Section 2. All simulations use independently generated test data sets to evaluate the interaction test with the same sample size and data-generating process as the respective simulated training data sets.The first simulation analyzes the distribution of ẑ4 in the context of linear models while increasing the number of variables (Section 4.1).The second simulation conducts an analysis of type 1 error and power in the context of linear models (Section 4.2).Linear models were used in the first two simulations to demonstrate the empirical behavior in easy-to-understand scenarios where the model allows for the specification of the type of estimated interaction effects.Note that, in practical applications with estimated linear models, there would be no need to conduct the proposed interaction difference test.On the other hand, ẑ4 was developed for model-agnostic prediction models, and as such, it is desirable to check whether ẑ4 is well behaved in these scenarios, too.Then, in the third simulation, nonlinear models were explored based on a real data set (Section 4.3).Last but not least, we investigated the proposed modification ẑP,4 of the interaction test with responses.
The programming language R for the source code of the complete simulation is available as additional online supplementary material to enhance reproducibility (see the reference after Section 6).The interaction test for prediction models was implemented in the R-package IADT 1.2.1, available in the comprehensive R archive network (https://cran.r-project.org(accessed on 26 May 2024)).

Test Statistic Distribution in Linear Models
To investigate the behavior of the test statistic ẑ4 in the context of a linear model, the following data-generating process was specified: The p covariates x P R p " Np0, Σqfollow a multivariate normal distribution with correlations ρ low,j,k " 0.25 over the set tj, k P 1, . . ., p : j ‰ ku, ρ medium,j,k " 0.5 and ρ high,j,k " 0.75 (equi-correlation).The hypothesis is specified with S " t1, . . ., pu, s " t1u and the true linear model with one interaction term is This setting was chosen under the alternative hypothesis with a minimal number of interaction terms such that the test statistic was expected to be closer to zero compared to settings with more interaction terms.This simulation was conducted with a different numbers of covariates, p " t5, 10, . . ., 100u.The sample size was fixed with 1000 for both simulated training and test data sets.In each scenario, the variance σ 2 of the error term ϵ was set to 0.8 based on prior simulations with n " 10 6 .The coefficients of the datagenerating process were set to β " ´β1 , . . ., β p , β 1,2 ¯" p1, . . ., 1q P R p`1 to study power and β " p1, . . ., 0q to investigate the type I error.The linear model was correctly specified to include all covariates of the data-generating process.Each scenario was independently repeated 100 times.All together, 57,600 test statistics were simulated.
Here, the simulation results are shown for the null hypothesis that covariate one does not contribute to interaction effects (s " t1u).Figure 1 shows the difference dp ẑ4 q defined by ẑ4 , minus the normalized rank quantile of the standard normal distribution on the left side.dp ẑ4 q was estimated based on 100 independent replicates of ẑ4 , given the number of covariates and the correlation of each scenario.All boxplots fluctuate around the value of zero across different number of covariates.Furthermore, the boxplots on the left side, dp ẑ4 q, of Figure 1 are comparable to those on the right side, dpΦq, which used a standard, normal, distributed random variable, Φ, instead of ẑ4 .Note that the volatility in boxplots occurs due to the estimation of ranks, and with increasing sample sizes, the differences in dpΦq would converge to zero.The Shapiro-Wilk test [36] is considered the most powerful in detecting non-normality according to [37].If all 288 scenarios were evaluated with the Shapiro-Wilk test and adjusted for multiple comparisons with a false-discovery rate approach [38] of 0.05, then there would be no case that significantly departed from the normality distribution assumption.
The results for the alternative hypothesis specified in Equation ( 27) are shown in Figure 2.There is a decreasing trend to shift the distribution of ẑ4 more towards zero the higher the number of covariates.With low covariate correlation, the lower quartile of the distribution crosses the zero line with about 30 covariates.When the covariate correlation is higher, this happens with about 20 covariates.In such cases, it is expected that power is reduced because the H 1 distribution becomes more similar to the H 0 distribution.After about 30 covariates, the median of ẑ4 does not decrease further.For comparison, the same simulation was conducted using the t-statistic in a linear model of the interaction effect in Figure 3.This figure shows a decreasing trend in the location of the simulated t-value distribution, but the gap of the medians to zero is larger than in Figure 2, and more covariates are needed so that the lower quartile of the simulated distribution crosses the zero line.The model-specific hypothesis test that was explicitly developed for linear models can be expected to be more efficient in terms of power than a model-agnostic hypothesis test if the assumptions are justified.In conclusion, the proposed test statistic is empirically good when approximated with a normal distribution under H 0 , and small effects under H 1 result in similar behavior to t-tests with linear models.

Power Simulation in Linear Models
This section focuses on the power and type I error simulation in linear models.Due to the linear structure of the models, interaction effects can be specified separately from main effects, and thus, simulations under both hypotheses H 0 and H 1 can be more easily specified and verified than in more complex prediction models.Therefore, the setting of linear models is a good starting point to explore the properties of the interaction test based on ẑ4 .Note that, in practice, the proposed interaction test is not needed in linear models because ANOVA methods [23] were developed for the specific case of linear models to test whether the coefficients are zero.
The simulation design of the covariate distribution was the same as in the previous section, Section 4.1, with p " 5, except additionally considering the case of no correlation.The data-generating model consisted of three different scenarios with the error term ϵ " Np0, σ 2 q, and it was allowed to differ from the estimated prediction model specification: x j β j `ϵ (main effects), f pxq " p ÿ j"1 x j β j `p´1 ÿ j"1 ÿ kąj x j x k β j,k `ϵ (main effects, all second order interactions) and (main effects, all second and third order interactions).
The inference is about the population-model interaction effects (unknown in practice), but in this simulation, the interaction effects are known.The alternative hypothesis is true if the corresponding interaction effects are estimated in the prediction model and simulated in the data-generating process.In the case of the misspecification of the linear predictor, the estimated coefficients converge to the true coefficients of the data-generating process.
The error variance σ 2 was optimized on a data set with n " 10 6 prior to the simulation to approximately yield an explained variance of 0.25, 0.5, and 0.75.Sample sizes varied with n " t100, 125, . . ., 300u.Lower and upper sample sizes were chosen to avoid instabilities in the estimated coefficients and reach power levels of 1 in at least one scenario.Three different null hypotheses, s " t1u, t1, 2u, t1, 2, 3u, were investigated.The linear model was specified under H 1 to estimate all possible main and interaction effects up to the third order.In contrast, under H 0 all interaction effects that included covariates of set s were excluded from the data-generating process.Each combination of the scenarios was repeated independently 1000 times.
The rows of plots in Sections 4.2.1 and 4.2.2 correspond to different covariate correlations, 0, 0.25, 0.5, 0.75, and the columns of plots display varying explained variances, 0.25, 0.5, 0.75.The dotted-dashed lines represent the upper and lower bounds of the exact pointwise 0.95 Clopper-Pearson confidence intervals [39] that were calculated for the type I error and power proportions.The next two sections, Sections 4.2.1 and 4.2.2, summarize the type I error and power simulation results consisting of 1.08 ˆ10 6 hypothesis tests.Additional figures are available in Supplementary Materials Section S4.

Type I Error Results
Figure 4 shows the results for the correctly specified linear model under H 0 with s " t1, 2u.The estimated linear model includes the main effects of x p1,2q and additional interaction effects of the covariates x p3,4,5q up to the third order.The type I error was controlled with a significance level of α " 0.05 in all scenarios, and the hypothesis test is robust to covariate correlations, as well as explained variances.

Power Results
Figure 5 shows the power results under the alternative hypothesis based on s " t1, 2u with correctly specified linear models.The hypothesis test reaches power levels around 0.8 in zero-to low-covariate correlation scenarios with at most n " 200.The figure shows that higher covariate correlations reduce the power levels, which are influenced by the instability of the estimated linear models because of multicollinearity in this scenario.Higher explained variances result in slightly higher power.Note that the functional ANOVA decomposition theory [24] does not theoretically work well with strong covariate correlations either because great emphasis is placed on regions with a low probability mass [25].Figure 6 shows the power results under H 1 with s " t1, 2u in the context of a misspecified linear model.The data-generating model consists of all interaction effects up to order two, except those in s " t1, 2u, but in the linear model, the main effects and all possible interaction effects up to order three are estimated.Increasing covariate correlations reduces the power, and higher explained variance scenarios yield a higher power.Additional power scenarios are available in Supplementary Materials Section S4.2.

Power Simulation in Nonlinear Models
In this section, we aim to explore the power of the interaction test in a simulation study based on a data set.As an example data set, the credit approval data from the machine learning repository OpenML-CC18 [40,41] was used.The response variable was binary with categories for good and bad credit risks.The data set contains 1000 independent observations, along with 7 numeric and 13 categorical covariates.A descriptive overview of the data is given in Supplementary Materials Section S5.
The data-generating process of the simulation depends on the data set to be more realistic.Covariates were simulated without (Xind) and with dependencies (Xdep).In the former case, continuous covariates were randomly drawn from the marginal empirical distribution functions of one covariate.Discrete covariates were sampled according to observed relative frequencies.In the design Xdep, a Gaussian copula was used to simulate all continuous covariates together, considering their dependencies.The discrete covariate distribution was estimated using relative frequencies of multivariate contingency tables.
Ensemble methods like random forest were among the top-performing prediction methods with tabular data in a recent comparison to deep learning [42], and results from Kaggle competition challenges show similar trends (for example, [43]).Additionally, random forests are easy to tune, and usually, tuning the number of randomly available covariates at each split (mtry) suffices [44].First, a random forest model was tuned via 10fold cross-validation of the original data regarding out-of-sample, binomial log-likelihood function with the tuning parameter mtry (model RF interact ).Then, the absolute values of the interaction test statistic were evaluated for this model separately with each covariate.The three covariates with the highest values were chosen (age, employment, and existing credits).Among these sets, all possible pairwise sets with other covariates (excluding age, employment, and existing credits) were analyzed to determine the strongest twoway interaction effects in the data.These were "age of person interacts with housing finance", "employment status interacts with housing finance", and "number of existing credits interacts with job qualification".The sets correspond to the covariates s " t1u Ø "age of person" s " t1, 2u Ø "age of person", "employment status" s " t1, 2, 3u Ø "age of person", "employment status", "number of existing credits" To evaluate the power and type I error rates, it is necessary to be able to specify the data-generating process under both the H 0 and H 1 hypotheses.It is known that, if the random forests are restricted to only include tree stumps (only one covariate split), then there are no interaction effects.In this simulation, all data-generation processes were identical to the specification of the estimated random forest models.Under H 0 , all sets, s, were restricted to tree stumps depending on all covariates with the tuned parameter mtry (RF 0 ).For each strong interaction effect, separate random forests (RF age, housing , RF employment, housing , RF credits, job ) were estimated with an unrestricted tree depth but only including the two variables of the previously determined interaction effect with mtry " 2. If there was a strong signal of two interaction covariates in the data and the random forest model had only the option to estimate the response with those covariates, then it was quite likely that the interaction effect would be estimated in the model.Under H 1 with set s " t1u, the predictions of RF 0 and RF age, housing were averaged with the mean.Analogously, in the case of s " t1, 2u, the random forest models RF 0 , RF age, housing and RF employment, housing were averaged, and if s " t1, 2, 3u, then the average predictions of RF 0 , RF age, housing , RF employment, housing and RF credits, job were calculated.After data generation, the estimated random forest models were tuned using simulated test data analogously as model RF interact .All together, there were 120 scenarios (10 sample sizes, two covariate designs, three sets s, and two different hypotheses) that were independently repeated 1000 times.

Type I Error Results
In Figure 7, the estimated type I errors, based on random forests, are shown for independent covariate simulation.The curves fluctuate around the prespecified alpha level of 0.05.In the case of dependent covariates, Figure 8 shows that the type I error is controlled for s " t1u.Larger sets indicate a small positive trend for increasing sample sizes.This could indicate that covariate dependencies have a small influence on the type I error in nonlinear models.This is in contrast to the observed results of Section 4.2.1,where even strong covariate correlations overall did not have much of an effect on the estimated type I errors.In the design Xdep, the strongest correlation in the Gaussian copula between "credit amount" and "credit duration" was 0.6174 in the original data set.All other numeric covariates had less absolute correlation than 0.3.The simulated interaction effect between "employment" and "housing finance", measured using the corrected contingency coefficient [45], was 0.2909.The previous value is above the 0.95 empirical simulated quantile 0.1527 under independence, and thus, this case can be interpreted as low-dependency.Another difference compared to linear models is that random forests do not have continuous predictions, which means that, for certain ranges of the covariates, the prediction function stays constant.

Power Results
Figure 9 shows the estimated power based on random forest models.Power increases with the sample size, and the curve gradients decline.Several hundred observations

Interaction Test Statistic with Response
In this section, we explore the proposed extension in Equation ( 25) to include response information in ẑP,4 as a sensitivity analysis.The simulation design was based on the example given in [27,47].The response function takes the form of H 0 gpxq " 5 sinpπx 1 q `5 sinpπx 2 q `20px 3 ´0.5q 2 `10x 4 `5x 5 `ϵ and under H 1 (28) gpxq " 10 sinpπx 1 x 2 q `20px 3 ´0.5q 2 `10x 4 `5x 5 `ϵ with x P R 10 and ϵ " N `0, σ 2 ˘.Both under H 0 and H 1 , the error variance was set to achieve an explained variance of 95% based on the average of 25 independent simulated data sets of size 10 6 .The sample sizes varied from n " 100, 200, . . ., 1000.For each simulated training data set, a multivariate adaptive regression spline (MARS) was fitted [47] with a maximal degree of two.Type I error results are shown in Figure 11.Overall, the estimated type I error held the specified alpha level 0.05, but it was slightly conservative.In this example, at least 100 observations were sufficient to achieve power levels above 80% (Figure 12).The results demonstrate that the modified test statistic with the response information ẑP,4 is also able to control the type I error, and it achieves reasonable power levels similar to ẑ4 .In the next step, the impact of the previously identified covariate interaction effects can be evaluated.First, covariates with interaction effects s " t1, 5, 6, 9, 10, 11u were tested one-sided with the null hypothesis that the prediction model with possible interaction effects has an equal or higher MSE.Overall, the p-value was 0.0155, and we concluded that the interaction effects of those covariates reduce the MSE.The MSE was reduced by 5.46% relative to the prediction model without interaction effects.The next question is: Which interaction effects associated covariates are responsible for this reduction?It is answered in Figure 14.In this particular case of Boston housing prices, interaction effects with covariates NOX, RAD, TAX, and PTRATIO led to statistically significant MSE improvements in the prediction model.This means that the covariates influence the Boston Housing prices with two-way or higher-order interaction effects, and those identified interaction effects improve the prediction performance.

Discussion
This work introduced a model-agnostic statistical interaction test that a hypothesis set can be flexibly specified.An asymptotic distribution of the test statistic was derived (Section 3).The interaction test neither required the refitting of the prediction model nor the resampling of the original data.The low computational runtime cost of the interaction test allows for the exploration of multiple sets of covariates.Our recommendation is to evaluate the test statistic with test data.The distribution of the test statistic behaved well in linear models even in the case of strong covariate correlations (Section 4.1).Simulations with linear (Section 4.2) and nonlinear models (Section 4.3) show that, overall, the type I error is bounded by the prespecified alpha level in most cases and that the test achieves reasonable power levels for several hundred observations in the simulations.The interaction test can be used for black-box models along with other measures of interpretability to better understand interaction effects.Low deviations of the test statistic from zero may indicate that the prediction model could be approximated well using a simpler model without covariate interaction effects in set s.
In addition to Section 3, the evaluation of ẑ4 under the training data X 1 , X 2 , . . ., X n was discussed.In this case, the observations ẑ3,1 , ẑ3,2 , . . ., ẑ3,n are dependent because each observed value of ẑ3,i includes all training data in the estimation of the PD function in ẑ1,i , ẑ2,i .The prediction model f pxq is not constant and changes if the training sample size increases because it is estimated from the same data.As such, the uniform convergence speed of f pxq and the PD functions x PDpx s q would need to be faster than n ´1{2 , which corresponds to the convergence speed of the mean according to the Berry-Essens theorem (see, for example, [50]).However, especially nonparametric machine learning models usually have a lower convergence speed than n ´1{2 [51], and there is no guarantee that multiplications of ẑ1 , ẑ2 in ẑ3 yield faster convergence rates.Additionally, the CLT would require extensions to work under dependence between observations such as those presented in [52,53].That specific theory would require the supremum of the maximal correlation coefficient (SMCC) [54] for all possible sets of observations ẑ3,i 1 , ẑ3,i 2 with lag L " |i 1 ´i2 | to converge at least linearly to zero as L Ñ 8.This assumption is difficult to investigate with simulations and, to the best of the authors' knowledge, impossible to prove because the number of available observations with a specific lag depends on the sample size, while the supremum of the maximal correlation depends on the number of comparisons.Note that, in the case of iid random variables, higher dimensions of the covariate matrix (more comparisons) affect the distribution of the maximal estimated Pearson correlation (see [55] for asymptotic results).
Whether to use ẑ4 or ẑP,4 with a response should be decided according to the goals of data analysis.The choice may also consider the characteristics of the data-generating process of the application.For example, if the signal-to-noise ratio is low, then ẑ4 would be preferable to ẑP,4 regarding statistical power because, in this case, the usage of the response information would add more noise that would make it harder to differentiate between H 0 and H 1 .In the reverse situation with a high signal-to-noise ratio, the additional information of the response in ẑP,4 could reduce the variability of the terms IADP f ,s and IADP PD,s , and thus, hypotheses H 0 and H 1 could be more easily distinguished compared to the test statistic ẑ4 .Future research may investigate the behavior of both statistics, ẑ4 , ẑ4 P, 4, in other settings that were not considered in this work (for example, other data sets and different black-box prediction models).
From a general perspective, the choice of whether to apply IML to training or test data depends on the goals of statistical analysis [56].If the influence of covariates on the prediction model at the population level is the focus of interest, it does not matter whether training or test data are used, as long as data sets originate from the same data-generating process.The more data are available, the more powerful the proposed interaction test is, provided that all other conditions stay constant.In contrast, if the goal is to analyze the impact of covariates on prediction performance, then it is reasonable to apply IML methods to test data sets.This is in line with [18], who recommends the usage of test data in the case of permutation variable importance.Test data usage in the interaction difference test has better theoretical properties and, thus, is recommended for applications.
An alternative to H 2 s was proposed by [57] that uses accumulated local effect functions instead of PD functions.ALE curves are more computationally efficient and avoid the extrapolation problem to non-observed covariate combinations.However, ALE curves attribute part of the interaction effect to the main effect if there are interactions between correlated features [58].Extrapolations can be investigated graphically via the stratification of PD plots regarding other covariates.Furthermore, PD plots can be enhanced using individual conditional expectation curves [28], which plot each observed predicted value to investigate variability and possible interaction effects.This graphical representation is not available for ALE.Therefore, this paper focused on the analysis of PD functions.

Conclusions
This work has proposed a new model-agnostic hypothesis test to detect interaction effects in prediction models.The null hypothesis states that a given set of covariates does not contribute to any interaction effects.The concept is based on the interaction difference between the variances of the original model predictions and predictions under restricted interaction effects with the null hypothesis.The restricted form of the prediction model is IADP f ,s ´Var x ¨ypxq ´PD ´xSzs ¯´ÿ jPs PD `xj ˘‚ loooooooooooooooooooooooomoooooooooooooooooooooooon IADP PD,s .

Figure 1 .Figure 2 .Figure 3 .
Figure1.The boxplots on the left side show the distribution of d, defined according to ẑ4 , minus the normalized rank transformation of ẑ4 to the standard normal distribution.Instead of ẑ4 , the standard, normal random variable Φ was used on the right side.The graphs represent different correlation scenarios: low, medium, and high.LowMedium High 75, explained variance 0.75

Figure 4 .
Figure 4. Type I error simulations in scenario of correctly specified estimated linear model with null hypothesis s " t1, 2u.Dashed lines correspond to the standard alpha 0.05 threshold, dashed-dotted lines represent pointwise 0.95 Clopper-Pearson confidence intervals and full lines show the estimated Type I error.

Figure 5 .
Figure 5. Power simulations with correctly specified estimated linear model, main effects, and all possible interaction effects up to the third order (s " t1, 2u).Dashed lines correspond to the standard alpha 0.05 threshold, dashed-dotted lines represent pointwise 0.95 Clopper-Pearson confidence intervals and full lines show the estimated Type I error.

Figure 6 .
Figure 6.Power simulations with a misspecified estimated linear model with the main effects and all possible interaction effects up to the third order.The data-generating model consists of all interaction effects up to order two except those of s " t1, 2u.Dashed lines correspond to the standard alpha 0.05 threshold, dashed-dotted lines represent pointwise 0.95 Clopper-Pearson confidence intervals and full lines show the estimated Type I error.

Figure 7 .Figure 8 .
Figure 7.Estimated alpha error of the interaction test based on random forests with independent simulated covariates under different H 0 hypotheses.The dashed line represents the standard 0.05 significance threshold.Overall, the interaction test controls the prespecified alpha error.The dotted-dashed lines represent the upper and lower bounds of the exact pointwise 0.95 Clopper-Pearson confidence intervals.Full lines represent estimated type I error.

Figure 10 .
Figure 10.Estimated power of the interaction test based on random forests with dependent simulated covariates under different null hypotheses, s.The dashed line represents a standard power level of 0.8 assumed in sample size planning.Two hundred and fifty to four hundred observations suffice for acceptable power levels.The dotted-dashed lines represent the upper and lower bounds of the exact pointwise 0.95 Clopper-Pearson confidence intervals.Full lines represent estimated power.

Figure 11 .Figure 12 .
Figure 11.Estimated alpha error of the one-sided interaction test ẑP,4 based on MARS with s " 1.The dashed line represents a standard alpha level of 0.05 assumed in sample size planning.The dotted-dashed lines represent the upper and lower bounds of the exact pointwise 0.95 Clopper-Pearson confidence intervals.

Figure 13 .
Figure 13.Test statistic z 4 values of the gradient-boosting model for each covariate separately.The black bars highlight the passing significance threshold α ď 0.05 of the two-sided test with the null hypothesis that each covariate does not contribute to interaction effects.The dotted lines indicate positive and negative H 0 rejection thresholds.

Figure 14 .
Figure 14.Test statistic z 4 values of the gradient-boosting model for each covariate separately.The black bars highlight the passing significance threshold α ď 0.05 of the one-sided test with the null hypothesis that interaction effects associated with a specific covariate do not contribute to MSE reduction.Dotted lines indicate H 0 rejection thresholds.

Theorem 1 .
Let n T denote the sample size of the test set, and let n denote the sample size of the training set.Assume that σ 2 f " Var x p f pxqq satisfies 0 ă σ 2 f ă 8.Moreover, if Var x p f pxqq " Var x pgpxqq, then assume that, for n T Ñ 8 and some a P p1, 2q, PÝÑ c(26