On the Optimality of the LR Test for Mediation

: Testing for mediation, or indirect effects, is empirically very important in many disciplines. It has two obvious symmetries that the testing procedure should be invariant to. The ordered absolute t-statistics from two ordinary regressions are maximal invariant under the associated groups of transformations. Sobel’s (1982) Wald-type and the LR test statistic are both functions of this maximal invariant and satisfy two logical coherence requirements: (1) size coherence: rejection at level α implies rejection at all higher signiﬁcance levels; and (2) information coherence: more (less) evidence against the null implies continued (non) rejection of the null. The LR test statistic is simply the smallest of the two absolute t-statistics, and we show that the LR test is the Uniformly Most Powerful (information and size) Coherent Invariant (UMPCI) test. In short: the LR test for mediation is simple and best.


Introduction
Testing for mediation is empirically extremely important in many scientific disciplines. For example, in psychology: Baron and Kenny [1], which has more than 100,000 citations; in accounting: Coletti et al. [2]; in marketing: MacKenzie et al. [3]; in sociology: Alwin and Hauser [4]; in economics: Huber [5], amongst many others. It has also generated much methodological interest because it is a non-standard problem. Despite its great popularity, Sobel's original test has serious shortcomings, including extremely low rejection probabilities when the indirect effect is small or estimated with large variance. See MacKinnon et al. [6] for an overview of various testing procedures. The aim of mediation testing is to establish if the causal mechanism of the effect that an independent variable (X) has on a dependent variable (Y) is via a mediating variable (M). The most basic version of the model with all variables in deviation from their means is: where disturbances u and v are assumed to be independent, since Y does not influence M, e.g., because of the experimental set up. The causal variable X affects Y via two different pathways. The first is the direct effect of X on Y and is quantified by β. The second is the indirect effect of X on Y via the mediating variable M. This mediation effect is quantified by θ 1 θ 2 and only exists if both θ 1 and θ 2 are non-zero. The null hypothesis of no mediation is commonly expressed as: H 0 : θ 1 θ 2 = 0, but it is not a standard testing problem, since the null has a singular point in the origin where the two axes (θ 1 = 0 and θ 2 = 0) cross (see Drton [7]) and the distribution of the test statistic is non-standard as a result.
The famous test by Sobel [8] is a Wald-type test that standardizes the estimatorθ 1θ2 . The test can be expressed in terms of the absolute values of ordinary t-statistics (|T 1 |, |T 2 |) in (1) and (2). This renders a test that is invariant to the parameter value β and to the variances of u and v. This is no coincidence: all invariant testing procedures that exploit all the information in the data and the model must be based on (|T 1 |, |T 2 |) since it is a maximal invariant as we will see below.
An obvious alternative to the Wald test is the Likelihood Ratio (LR) test, which provides an optimal test when the null and alternative are both simple. Testing for mediation, however, involves composite null and alternative hypotheses, as well as nuisance parameters. The Neyman-Pearson lemma is not applicable, and no uniformly most powerful test exists.
We show, nevertheless, that a uniformly most powerful test exists within a class of procedures that are information coherent and invariant. Information coherence is the logical requirement introduced in Section 3, that, when information against the null increases, a test should continue to reject. It differs from the common size coherence that requires the same when the size (maximum probability of a Type I error) is increased. Both these coherency requirements are very mild. The LR test is the best coherent invariant test. Establishing this result is the main contribution of this paper.
A likelihood analysis of the estimators and LR test for mediation requires a distributional assumption. We will assume a normal distribution for each observation 1 ≤ i ≤ n: ). This is convenient, but the analysis is also valid asymptotically without assuming normality of the errors, since the t-statistics used are still asymptotically normal.
The joint density of (Y, M) given X can be written as: with λ 1 = (β, θ 2 , σ 11 ) , λ 2 = (θ 1 , σ 22 ) , and λ = λ 1 , λ 2 . The parameters λ 1 and λ 2 vary freely as a result of the triangular (recursive) structure of the model. The mediation variable is the endogenous variable in (2), but is exogenous for θ 2 in (1). There is no feedback or causal relation from Y to M. This can easily be extended to include more regressors/covariates. This will affect the degrees of freedom in finite samples, but not the asymptotic distribution of the t-statistics. One could also use instrumental variables, but X and M appear in both equations and in the standard setup u and v are independent because of the experimental interpretation of M and, without it, parameters are not identified. The log-likelihood given n independent observations is the sum of two Gaussian log-likelihoods corresponding to (1) and (2): As a consequence, the Maximum Likelihood Estimators (MLEs) for θ 1 and θ 2 are the usual OLS estimators for the two equations separately. The MLE for the full λ is minimal sufficient, and its dimension is equal to the number of parameters. The model is a full exponential model (see Van Garderen [9]), and the MLE is a complete sufficient statistic. Inference on the parameters can, therefore, be based on the MLE without loss of information. Randomization in the test procedure will lead to a loss of information and power. Hence, we will not consider randomized tests or bootstrap procedures. Finally, the information matrices will be block diagonal in λ 1 and λ 2 as well as in (β, θ 2 ) , σ 11 , θ 1 , and σ 22 . As a result the standard t-statistics, T 1 and T 2 for θ 1 and θ 2 are asymptotically independent and normally distributed.

Symmetry and Invariance
It is clear that the problem has a number of symmetries and invariances. The null hypothesis of no mediation θ 1 θ 2 = 0 remains true or false if we: (i) Interchange θ 1 and θ 2 ; (ii) Change the signs of θ 1 or θ 2 ; (iii) Change the values of the nuisance parameters σ 11 , σ 22 , or β.
We want a test procedure that respects these symmetry and invariance properties. This may not be straightforward because the distributions of test statistics will generally depend on nuisance parameters, even under the null. Moreover, the t-statistics have different degrees of freedom. We will first establish the maximal invariance of the two t-statistics under a group of location-scale transformations in finite samples. This covers the invariance with respect to nuisance parameters in (iii). Invariant procedures based on the minimal sufficient statistics should depend on the t-statistics only, in finite samples as well as asymptotically. By moving to the asymptotic normal distribution, we can abstract from the difference in the degrees of freedom and consider the permutation invariance in (i) and the reflection invariance in (ii).
The finite-sample distribution involves t-distributions with different degrees of freedom and is further complicated by the fact that the standard deviation ofθ 2 , σθ 2 , depends on M. Even in finite samples, however, we can derive exact results that provide a strong justification for restricting attention to the t-statistics. Proposition 1, derived in Hillier et al. [10], is a particular case in point and establishes that T is a maximal invariant regardless of sample size. Proposition 1. The testing problem is invariant under the group K of transformations acting on β ,θ 2 , s 11 ,θ 1 , s 22 defined by: β ,θ 2 , s 11 ,θ 1 , s 22 → √ a 1 β + a 0 , a 1 /a 2θ2 , a 1 s 11 , √ a 2θ1 , a 2 s 22 , The induced group of transformationsK acting on the parameter space is defined by: (β, θ 2 , σ 11 , θ 1 , σ 22 ) → √ a 1 (β + a 0 ), a 1 /a 2 θ 2 , a 1 σ 11 , √ a 2 θ 1 , a 2 σ 22 The vector of t-statistics is a maximal invariant statistic under the group of transformations K: A parameter-space maximal invariant under the induced groupK is: This proposition justifies restricting the attention to the two t-statistics, even in finite samples. The proof is given in Hillier et al. [10] and further establishes that T 1 and T 2 are independent. The t-statistic T 2 has one degree of freedom less than T 1 , since (1) has one more variable than (2). This difference compromises the permutation symmetry but is practically unimportant, unless the number of observations is small, for instance, less than 30. Throughout the rest of the paper, we will, therefore, use the limiting normal distribution for the t-statistics. Hence, let µ = (µ 1 , µ 2 ) = θ 0 1 /σ θ 1 , θ 0 2 /σ θ 2 where θ 0 1 , θ 0 2 denote the true parameter values and σ θ 1 , σ θ 2 the standard deviations of the OLS estimators, then asymptotically: (T − µ) d → N(0, I 2 ), but we continue as if this is the exact distribution and state this explicitly in: Turning to symmetries (i) and (ii), let G 1 be the group of permutations and G 2 the group of reflections or sign changes. The two groups only have the identity element in common, and the full group G = G 1 ×G 2 , generated by G 1 and G 2 , has 2!2 2 = 8 elements. The density after a sign change in T k is obtained by a corresponding sign change in µ k , and density for a permutation of T is also obtained by permuting µ accordingly. Hence, for any element g∈ G, we have g · T∼ N(g · µ, , so the distribution is invariant; see Lehmann and Romano [11]. It is clear that the absolute order statistic, denoted (|T| (1) , |T| (2) ) with 0 ≤ |T| (1) ≤ |T| (2) , is invariant because its value does not change by permuting or changing signs in T. It is also maximal invariant because, for two T andT such that (|T| (1) , |T| (2) ) = (|T| (1) , |T| (2) ), this can only occur if the two elements inT are a permutation of the two elements in T and possible changes in signs of the elements. There must exist a transformation g = g 1 The same argument holds for the ordered absolute parameter (|µ| (1) , |µ| (2) ) since the group of transformations is the same on parameter and sample space. We have, therefore, established the following: Proposition 2. The testing problem H 0 : µ 1 µ 2 = 0 is invariant under the group of transformations G = G 1 ×G 2 acting on T and µ, given Assumption 1. The absolute order statistic |T| (1) , |T| (2) with 0 ≤ |T| (1) ≤ |T| (2) is a maximal invariant statistic, and the absolute order parameter The distribution of (|T| (1) , |T| (2) ) depends only on (|µ| (1) , |µ| (2) ).
Note that (|T| (1) , |T| (2) ) is a function of the complete sufficient statistics and, hence, a maximal invariant that exploits all the information in the data. A randomized test or another procedure that is not a function of (|T| (1) , |T| (2) ) cannot do better in terms of power.
For deriving the joint distribution of this maximal invariant (|T| (1) , |T| (2) ), first note that if scalar T∼N(µ, 1), then |T| has the folded normal distribution (noncentral Chidistribution with one degree of freedom) with density: with φ(·) the standard normal density function. Second, |T 1 | and |T 2 | are independent and, hence, by Equation (6) of Vaughan and Venables [12], we obtain after simplification: The density of the maximal invariant stated in Proposition 2 equals:

Test Statistics, Critical Regions, and Coherence
The previous section established that invariant test procedures should be based on the ordered absolute t-statistic. The classic test by Sobel [8] and the LR test are both functions of (|T| (1) , |T| (2) ). The Sobel test statistic is the square root of the Wald statistic W based on θ 1θ2 and its standard error derived by the delta method: The LR test statistic is easily obtained by maximizing the log-likelihood with and without the restriction that either θ 1 or θ 2 is zero and: An equivalent LR test can also be based on the minimum of two F-statistics f 1 = T 2 1 and f 2 = T 2 2 . In both cases, we reject when the test statistic is larger than the two-sided critical value from a standard normal distribution. The rejection probability for both tests is monotonically increasing in |µ| (2) under the null with |µ| (1) = 0. The size, i.e., the maximum rejection probability under H 0 , is attained when |µ| (2) → ∞. The largest absolute t-value will diverge in that case, and the Sobel and LR test simply reject when the smallest absolute t-value is larger than the usual critical value, 1.96 for a 5% level test.
Rather than the one-dimensional test statistics, we can define tests in terms of their critical region (CR) in the sample space of dimension two. The sample space for (|T 1 |, |T 2 |) is the quadrant R 2 + , and the sample space for the maximal invariant (|T| (1) , |T| (2) ) is an octant in R 2 : with T = (|T| (1) , |T| (2) The LR critical region is: with z α/2 the (two-sided) critical value of standard normal variate, i.e., P[|Z| > z α/2 ] = α.
The boundary of the critical regions for the LR and Sobel test in the sample space for (|T 1 |, |T 2 |) are given in Figure 1. Both the LR and Wald CR have two desirable properties. First, if they reject at level α, then they also reject at any other level that is larger than α. Second, if evidence against the null is accumulating by increasing one or both of the t-statistics, the tests will continue to reject, and, if evidence is decreasing, they will continue to accept (not reject) the null hypothesis. The first property we call size or α coherence, with α added to the usual term to distinguish it from the second property we call information coherence. We formalize these concepts next. This allows us to prove the main theorem that the LR is the most powerful CR that respects both coherence properties.

Information and Size Coherence
In general, it is desirable for testing procedures to continue to reject if the evidence against the null hypothesis is increased and to continue to accept (i.e., not reject) when evidence against the null is decreasing. Thus, if a t-statistic for a single parameter in a one-sided alternative is increasing and we reject for t = 2, then we should also want to reject for t > 2 because this represents a value that is even less likely, more extreme, under the null and commonly interpreted as more evidence against the null. In multivariate settings, this is less trivial because no uniformly most powerful test exists, and the separate test statistics might be correlated. Nevertheless, in the mediation case with H 0 : θ 1 = 0 ∨ θ 2 = 0, the two t-tests are independent, and it seems reasonable to require that one continues to reject when either t 1 and/or t 2 is increasing in absolute value, since the information against the null is strengthening, and continue to accept when either t 1 and/or t 2 is decreasing towards 0, since the information against the null is weakened. We define a class of size α critical regions that formalizes this requirement and show that the LR test is optimal in this class of tests that respects information coherence. Consider CR and acceptance region (AR) for (|T 1 |, |T 2 |) ∈ R 2 + . Definition 1. Information Coherence. C α is the class of all information coherent critical regions of size α defined by the property that, with δ 1 , δ 2 ∈ R + (i) for any CR ∈ C α if (|T 1 |, |T 2 |) ∈ CR ⇒ (|T 1 | + δ 1 , |T 2 | + δ 2 ) ∈ CR, or, equivalently, (ii) for any AR ∈ C α if (|T 1 |, |T 2 |) ∈ AR ⇒ (|T 1 | − δ 1 , |T 2 | − δ 2 ) ∈ AR.
Traditional (α) coherence is a property of a family of CRs when the size of the test varies, but information coherence considers a fixed α and varying values of the test statistics.
A smaller significance level requires more extreme observations and, hence, a smaller CR. Note that the definition of α coherence does not require the definition of the statistics involved, but information coherence uses explicit statistics.
We show in Appendix A that a family of information coherent CRs has the following properties, with ∂CR the boundary of CR in R 2 + that separates the CR from the AR, and cv 1 (t) and cv 2 (t) the critical values for |T 1 | and |T 2 |, respectively, either simultaneously or conditional on the other: Proposition 3. Any CR ∈ C α and its boundary ∂CR have the following topological and statistical properties: (i) CR is simply connected. (ii) ∂CR is a continuous plane curve. (iii) ∂CR is monotonically weakly decreasing. (iv) ∂CR can be parametrically represented globally as ∂CR(τ) = (cv 1 (τ), cv 2 (τ)) using a one-dimensional τ ∈ R, and locally as ∂CR(t 1 ) = (t 1 , cv 2 (t 1 )) if not vertical, and/or ∂CR(t 2 ) = (t 2 , cv 1 (t 2 )) if not horizontal.
(v) The class C α contains critical regions that may not be convex, but no critical regions that are strictly concave. (vi) A test with CR ∈ C α can only be size correct and admissible if lim t 1 →∞ cv 2 (t 1 ) = z α/2 . Proposition 3 is instrumental in proving the optimality of the LR test and showing that the Sobel and LR tests are information and size coherent. Proposition 4. The Sobel (Wald) test and LR test of size α, both respect information coherence: CR √ W (α), CR LR (α) ∈ C α as well as size coherence: This result only states that both tests share two desirable properties, but does not imply that both tests are equally good. The power of the LR is much better than Sobel's test, which suffers from extremely low power when the mediation effect is small or is inaccurately estimated. In fact, the LR test is better than any other coherent test in C α , which we are now able to state and prove in the main theorem of the paper. Theorem 1. The LR test of size α is the uniformly most powerful test in C α .
Proof. The CR LR (α) = {min{|T 1 |, |T 2 |} ≥ z a/2 } sets the critical value for the cv 1 (t 2 ) = z α/2 for all t 2 ≥ t 1 ≥ 0, which is the smallest value that cv 1 can take for all values of t 2 while still being size-correct according to (iii) and (iv) of Proposition 3. Analogously, cv 2 (t 1 ) = z α/2 for all t 1 ≥ t 2 . So CR LR (α) is the closure of C α and any member CR(α) ⊆ CR LR (α). Hence, P[CR LR (α) | H a ] is larger than for any other CR(α) ∈ C α that is not equal to CR LR (α) a.s. This holds uniformly for all values of µ under H 0 .
This optimality property of the LR test is derived under coherence requirements that are very weak: it seems more than reasonable to require that any test continues to reject if more extreme outcomes are observed or if the level of the test is increased.

Discussion and Conclusions
This paper has exploited the symmetry present in the mediation testing hypothesis H 0 : θ 1 θ 2 = 0 and used invariance arguments to reduce the sample space to an eighth of R 2 . We have developed a coherence framework to formulate and analyze the requirement that increasing or decreasing information against the null leads to coherent decisions. We call tests or CRs with this property information coherent, which is distinct from the more standard (size) coherent property that tests may possess. The Sobel test is both information and size coherent, but has very poor null rejection and power properties.
The LR test is much better than the Sobel test, and this paper shows the LR test to be the best possible of all tests that satisfy the basic coherence requirements. The optimality lends support to Perlman and Wu [13] on their preference for the LR test.
Nevertheless, the LR test has some serious shortcomings, like the Sobel test, when detecting deviations from H 0 . In particular, when both θ 1 and θ 2 are close to 0, or are estimated inaccurately, it is extremely conservative and the power deteriorates and goes to α 2 when µ → (0, 0) .
The bootstrap does not provide an answer because of the (strong) dependency on the nuisance parameter. Of course, it may provide small-sample corrections and avoid asymptotic approximations and strong distributional assumptions. However, bootstrap versions of the LR and Sobel test still lack power and are neither size nor information coherent.
Van Garderen and Van Giersbergen [14] show there is an opportunity to gain power without violating the size condition by adding a small area near the diagonal where the opportunity to detect deviations from H 0 are best. It is uniformly more powerful than the LR test, but is not in C α since the boundary ∂CR is monotonically increasing and, hence, violating property (iii) of Proposition 3. Therefore, the LR test remains the optimal coherent choice for mediation testing. It is simple and best.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
Throughout this Appendix we use: In order to prove the properties of CR ∈ C α and its boundary, we first note from the definition of C α that if |t| ∈ CR ⊆ R 2 + , then CR also contains the top right quarter defined by the intersection of two half-planes: i.e., a quarter with |t| as its bottom left corner. The interior of Q(|t|) will be denoted: Proposition A1. In Definition 1, the conditions on CR and AR, i.e., (i) and (ii), are equivalent.

Property (i).
Any CR ∈ C α is simply connected.

Properties (ii) and (iii).
The boundary ∂CR of any CR ∈ C α is a continuous plane curve that is monotonically (weakly) decreasing in the sense that, if t,t ∈ ∂CR, theñ t 1 ≥ t 1 ⇔t 2 ≤ t 2 , but may be non-differentiable at infinitely many points.
Proof. If |t| ∈ ∂CR of any CR ∈ C α , then any |t| with |t 1 | ≥ |t 1 | and |t 2 | = |t 2 | is either on the boundary ∂CR or in the interior of CR. This implies that the boundary for this value of |t 1 | is equal or below |t 2 | and, therefore, weakly decreasing. That the boundary is a plane curve and continuous follows from Lemma A1 below, and the construction of the paths in the proof of the simple connectedness. The curve can be a step function with an infinite number of infinitesimal steps.
Proof. By properties (i)-(iii) and their proofs, it follows immediately that we can parameterize the boundary as a continuous function of a one-dimensional parameter τ. If the boundary is not vertical, we can locally use |t 1 | as parameter and, if the boundary is not horizontal, then we can locally use |t 2 | as parameter. Where it is vertical, the boundary associates a given |t 1 | with a set of points for |t 2 |, and, when it is horizontal, it associates a set of points |t 1 | with a given |t 2 |.
Property (v). The class C α contains critical regions that may not be convex, but no critical regions that are strictly concave.
Proof. C α contains CRs with step function boundaries. They are not convex. Nonconcavity is obvious since ∂CR is monotonically (weakly) decreasing.
Lemma A1. Given four corner points of any regular square in R 2 + , for any real ε > 0, at least one corner is not on the boundary.