Indicators of Evidence for Bioequivalence

Some equivalence tests are based on two one-sided tests, where in many applications the test statistics are approximately normal. We define and find evidence for equivalence in Z-tests and then oneand two-sample binomial tests as well as for t-tests. Multivariate equivalence tests are typically based on statistics with non-central chi-squared or non-central F distributions in which the non-centrality parameter λ is a measure of heterogeneity of several groups. Classical tests of the null λ ≥ λ0 versus the equivalence alternative λ < λ0 are available, but simple formulae for power functions are not. In these tests, the equivalence limit λ0 is typically chosen by context. We provide extensions of classical variance stabilizing transformations for the non-central chi-squared and F distributions that are easy to implement and which lead to indicators of evidence for equivalence. Approximate power functions are also obtained via simple expressions for the expected evidence in these equivalence tests.


Introduction
Our purpose is to extend the concept of "evidence for the alternative hypothesis", already available in classical one-sided testing, to contexts where that alternative is "equivalence" of two or more distributions. We abbreviate the term "bioequivalence" to "equivalence" for simplicity and because these results are of much more general applicability.

Background and Summary
Why should we introduce another approach to equivalence testing? Because, even though some equivalence tests [1] are well established and embraced by the USA Food and Drug Administration (FDA) and the European Medicines Agency (EMA), there are substantial critiques [2][3][4], as well as novel, competing approaches to multivariate equivalence testing [4][5][6][7][8].
We endorse the proposal by [4] to define a hierarchy of bioequivalence models, "average bioequivalence" within "population bioequivalence" within "individual bioequivalence", in terms of the Kullback-Leibler symmetrized distance (KLD) between distributions arising in equivalence tests. Somewhat surprisingly, we advocate estimating these distances indirectly using variance stabilized test statistics (VSTs) rather than plug-in parameter estimates for parameters appearing in the KLD. The close tie between the mean of a VST and the KLD for non-central chi-squared and F distributions will be illustrated in Appendix A.
In the remainder of this introduction we describe the notion of "evidence for the alternative" in the context of Z-tests and show how it is connected to level and power of such tests; these ideas were first introduced in [9]. After these preliminaries, we extend the notion of evidence for equivalence to two one-sided Z-tests (TOSTs), and show how it is related to a one-sided test based on a non-central chi-squared statistic with one degree of freedom.
This notion of evidence is applicable to many situations because VSTs can carry many test statistics into normally distributed statistics Z. For exponential families, Reference [10] show that the expected evidence of the variance stabilized statistic T is approximately equal to the signed square root of the Kullback-Leibler symmetrized divergence. Examples not from exponential families are in [11][12][13]. These results gives support to calling T the "evidence for the alternative hypothesis".
Throughout the paper we employ standard notation and properties of the non-central t, χ 2 and F distributions which are introduced and studied in depth in [14,15]. Below we first discuss specific examples of two one-sided tests (TOST) in Section 2, namely for binomial, two by two tables and t-tests; all methods are illustrated using data examples from the literature. In the multivariate setting, chi-squared tests for equivalence are given in Section 3 and F-tests for equivalence of K normal populations in Section 4. New methods for choosing λ 0 and for variance stabilizing the test statistics are provided. Discussion follows in Section 5, and R scripts for implementing some procedures are found in the Appendix B.

Properties of Evidence in One-Sided Z-Tests
Univariate equivalence tests are often based on statistics having the non-central t distribution or normal approximations to the binomial distribution, so we begin our introduction to evidence contained in such tests with the approximating, and simpler, Z-tests.
In the prototypical model X ∼ N(µ, 1) where µ is unknown, and one tests the null hypothesis µ ≤ µ 0 against the alternative µ > µ 0 by rejecting the null if the p-value is sufficiently small. For an observed X = x, the p-value is Φ(µ 0 − x), where Φ is the standard normal distribution function. The "evidence for the alternative" µ > µ 0 is defined to be T ≡ X − µ 0 . It is normally distributed with mean µ − µ 0 , which is linearly increasing in µ. In addition, T estimates its mean with standard error 1, regardless of the value of µ. For reasons given in [9,16] values of T near 1.645, 3.3 and 5 are called "weak", "moderate" and "strong" evidence for the alternative. Note that for an observed T = t = x − µ 0 , the p-value can be recovered from 1 − Φ(t).
In Table 1 are shown two sets of numbers with very different interpretations, resulting from different assumptions. They are based on one observation T = t, where T ∼ N(µ, 1) and µ ≥ 0. If one assumes the boundary hypothesis µ = 0, the second row of p-values gives the correct "degree of surprise" at having observed T = t; the smaller the p-value, the more surprised one is with the outcome. However, if one only assumes µ ≥ 0, the first row gives an estimate of the expected evidence E[T] = µ for the alternative µ > 0. This estimate has an additive standard normal error. When one interprets p-values, one must be careful not to interpret the smallness of their magnitudes as though they were evidence for the alternative on a linear scale. The first row in Table 1 gives a much more reasonable estimate of "evidence for the alternative" together with an easily understood standard error. The compatibility of this calibration scale for evidence with Bayesian calibration scales for p-values and Bayes factors is discussed in [10] (Section 4.3). Table 1. p-values for testing µ = 0 against µ > 0 and evidence estimates for alternatives based on one observation T = t, where T ∼ N(µ, 1). Keep in mind that the standard error of the observed value of t is equal to 1. The generality of this definition of evidence for the alternative stems from the fact that in many situations the natural test statistics X can be transformed to T, which has approximately a normal distribution with unit variance. Also, it is a more basic concept of the test than level and power. For a normally distributed test statistic T with unit variance, the expected evidence for one-sided alternatives µ > µ 0 is related to the level α and power 1 − β(µ) through the sum of the probits: (1) In the same testing problem the sample size n required to detect an alternative µ 1 with power 0.8 at level 0.05 is the solution of √ n (µ 1 − µ 0 ) = z 0.95 + z 0.8 ≈ 2.5; the expected evidence in such an experiment therefore lies between weak and moderate.
For negative T, −T is interpreted as evidence for the null µ ≤ µ 0 . Because of the symmetry of the problem, if we had begun with the null hypothesis µ ≥ µ 0 against the alternative µ < µ 0 the evidence for the alternative would be defined as µ 0 − X.

Properties of Evidence in Two One-Sided Z-Tests (TOST)
Consider the simplest example with one observation X ∼ N(µ, 1) for testing H 0 : |µ| ≥ µ 0 , where µ 0 > 0 defines the equivalence alternative H 1 : |µ| < µ 0 . The null hypothesis consists of two possibilities: µ ≤ −µ 0 and µ ≥ µ 0 . The left hand part is rejected at level α if X + µ 0 ≥ z 1−α , and the evidence for its alternative is T − = X + µ 0 , because T − ∼ N(µ + µ 0 , 1) and T − has an expected value that increases in µ and is 0 at the boundary µ = −µ 0 . The right hand part is rejected at level α if X − µ 0 ≤ z α , and the evidence for its alternative is T + = µ 0 − X, because T + ∼ N(µ 0 − µ, 1), whose expected value is increasing with decreasing µ and is 0 at its null boundary. The two one-sided testing procedure (equivalence test) rejects in favor of equivalence only if both of the one-sided tests reject their respective null hypotheses, and this has level α, because only one null hypothesis can hold.
The evidence for the alternative hypothesis of equivalence is logically the minimum of the evidences for the two one-sided tests: where T 0 = min{X, −X} and X ∼ N(µ, 1). Now −T 0 = |X| has a folded (to the right) normal distribution ( [14] (p. 170), and [16]), with parameters (µ, 1), so T 0 has a folded (to the left) normal distribution with the same parameters. The density of T 0 is given in terms of the standard normal density ϕ for t < 0 by f T 0 (t; µ) = ϕ(t − µ) + ϕ(t + µ), so the density of T is for t < 0 In Figure 1 are shown in black lines some examples of f T (t) for µ 0 = 4 and several choices of µ. When µ = 0, (exact equivalence), the density is negative half-normal with upper bound µ 0 = 4, but as |µ| increases, the distribution rapidly approaches normality.
Also of interest are the mean and standard deviation of T as µ varies. The first two moments of The mean and standard deviation of T are shown in Figure 2 as black lines. The top left-hand plot reveals that even in the case of perfect equivalence µ = 0, the equivalence test with µ 0 = 3 will only yield, on average, weak evidence for it. With µ 0 = 4, this average evidence for equivalence when µ = 0 becomes moderate.  The plots in Figure 2 suggest that, except for µ near 0, the distribution of T defined by (2) is approximately normal with mean near min{µ + µ 0 , µ 0 − µ} and standard deviation near one. For µ near 0, which is of interest in the case of equivalence, it is not normal, but this is perhaps compensated for by having a smaller standard error. Its distribution approaches a negative half-normal as |µ| → 0, with mean and standard deviation from (4) and (5) converging to µ 0 − √ 2/π and √ 1 − 2/π , respectively.

How Evidence Grows with Sample Size in Two One-Sided Z-Tests
The evidence for a one-sided Z-test grows with the square root of the sample size √ n for if the sample meanX n ∼ N(µ, 1/n) then the evidence for the alternative µ > −µ 0 to the null µ ≤ −µ 0 is T n,− = √ n (X + µ 0 ) ∼ N( √ n (µ + µ 0 ), 1), which is increasing in µ, has variance 1, and has expected value 0 at the boundary µ = −µ 0 . Similarly, T n,+ = √ n (µ 0 −X) ∼ N( √ n (µ 0 − µ), 1) the evidence for the alternative µ < µ 0 to the null µ ≥ µ 0 . Thus the evidence for equivalence |µ| < µ 0 based on n observations is: where T n,0 = √ n min{X, −X} and √ nX ∼ N( √ n µ, 1). Now T n,0 has a folded to the left normal distribution with parameters √ n µ, 1 and T n is a shift by √ n µ 0 of T n,0 , so its mean and variance are: A plot of the densities of T 4 compared to T defined by (3) are also shown in Figure 1 as red lines, and similarly for the mean and standard deviation of T 4 in Figure 2.

Sample Size Determination
For a one-sided Z-test based on n observations one can obtain a given expected evidence 2.5, say, for an alternative distant µ 0 from the null, by taking n to satisfy √ n µ 0 = 2.5. So n 1-sided = (2.5/µ 0 ) 2 , where r is the smallest integer greater than or equal to r. For a TOST Z-test with equivalence alternative |µ| < µ 0 , to obtain the same expected evidence when in fact µ = 0 one needs by (7) to have √ n µ 0 − √ 2/π = 2.5, or n TOST = (2.5 + 0.8)/µ 0 ) 2 , which is 74% larger than n 1-sided .
If one had asked for only weak expected evidence 1.645 instead of 2.5 in the above paragraph, the ratio of sample sizes required by TOST Z-tests to a one-sided test is 2.2, so the equivalence test would require 120% more observations than the one-sided test.

Connection of Evidence in TOST with a One-Sided Test
The evidence T = T TOST defined by (2) is based on X ∼ N(µ, 1) for the null hypothesis H 0 : |µ| ≥ µ 0 > 0, composed of two disjoint sets, with the equivalence alternative H 1 : |µ| < µ 0 . One could equally study the evidence in the equivalent experiment S = X 2 ∼ χ 2 1 (λ), where λ = µ 2 , for the hypotheses restated as H 0 : λ ≥ λ 0 against the equivalence alternative H 1 : λ < λ 0 , where λ 0 = µ 2 0 . The evidence for equivalence in an experiment with S ∼ χ 2 1 (λ) is a special case of (13) found in Section 3: The top left plot of Figure 3 compares the graph of this T with that of T TOST when µ 0 = 2. Further, its expected evidence is by (12) approximately The top right plot of Figure 3 shows the graph of this approximate expected evidence for equivalence as a function of µ for the case µ 0 = 2 as a dashed line, to be compared with the previously obtained expected evidence in the two one-sided test experiment (4), whose graph is shown as a solid line. The bottom plots are for µ 0 = 4.   (2) as a function of data X = x plotted as a solid line, to be compared with the one-sided evidence in the corresponding chi-squared test (7) as a dashed line. On the right are comparisons of two approximations for the expected TOST evidence for equivalence, a solid line depicting the exact value (4), and that given by the first order approximation √ λ 0 + 1/2 − µ 2 + 1/2 for the equivalent chi-squared test, shown as a dashed line.

Evidence for Equivalence in Two One-Sided Binomial Tests
Given X ∼ Binomial(n, p), and letp = X/n. Let p 1 < p < p 2 define the region of equivalence. (Often this region will also be of the form |p − p 0 | < ∆ 0 for some p 0 , ∆ 0 .) We want to test at level α the null H 0 : p ≤ p 1 or p ≥ p 2 against the equivalence alternative H 1 : p 1 < p < p 2 . The null hypothesis is two-sided and its right-hand part is rejected at level α ifp ≤ p 2 − z α p 2 (1 − p 2 )/n , whereas the left-hand part is rejected at level α ifp ≥ p 1 + z 1−α p 1 (1 − p 1 )/n . Only one of these tests can reject the null, so the level of the combined tests is α. We have assumed that n is large enough so that normal critical points give accurate levels.
The VST ofp is the well-known arc-sine transformation h(p) = 2 √ n arcsin( p ), which is asymptotically normal with variance 1 and asymptotic mean 2 indicate evidence for large p. The evidence in the test of p ≤ p 1 for an alternative p > p 1 is therefore For the combined two one-sided tests, the evidence for equivalence is the minimum evidence in these two one-sided tests; that is, T = min{T − , T + }. In the new treatment 191 patients survived a two-year progression-free period, and Wellek used two one-sided binomial tests to find that non-equivalence was rejected at the 0.05 level with estimated power 0.12. For these data T + = 2.84, T − = 0.793 so the evidence for equivalence of the treatment effect is T = 0.793, which is "weak". The standard error of T = 0.793 is known (see the comments at the end of Section 2), to satisfy 0.60 ≤ SE[T] ≤ 1, but is likely to near the smaller bound, becausê p = 191/273 is close to the center of the equivalence interval. This result is consistent with the analysis of [17], and it is much simpler.

Evidence for Equivalence of Risks
It is often the case that one wants to compare risks associated with new and standard treatments, with data often displayed in 2 by 2 tables; an example is given below after we introduce notation and explain how to find evidence for equivalence in this context.
Let X 1 and X 2 be two independent binomial random variables with parameters (n 1 , p 1 ) and (n 2 , p 2 ), respectively. Lettingp i = (X i + 0.5)/(n i + 1) for i = 1, 2 the unknown risk difference ∆ = p 1 − p 2 is estimated by∆ =p 1 −p 2 . We want the evidence for equivalence hypothesis ∆ 1 < ∆ < ∆ 2 , where ∆ 1 , ∆ 2 are specified bounds, usually of the form −∆ 0 , ∆ 0 . This can be achieved by combining the results of two one-sided tests: To find the evidence for the alternative in the first test, we use the VST of∆ derived in Kulinskaya et al. [11]. This is a family of VSTs indexed by a parameter 0 < A < 1. For the choice A = 1/2 the nuisance parameter is ψ =p = (p 1 + p 2 )/2, and we also require [11], can be written: Reference [11] show that the statistic T(∆,p, ∆ 1 ) obtained by replacingp, v and w by their plug-in estimates, is for large n 1 , n 2 normally distributed with mean that is monotone increasing in ∆ from 0 at the null ∆ = ∆ 1 . Further, this statistic has variance 1 at the null, which allows them to derived large-sample confidence intervals for ∆; these intervals are shown to be quite competitive for even small to moderate sample sizes in [13]. Next define T − gives the putative evidence for the alternative ∆ > ∆ 1 for the null ∆ ≤ ∆ 1 , while T + gives the putative evidence for the alternative ∆ < ∆ 2 for the null ∆ ≥ ∆ 2 . We say "putative" because, as [11] point out, the variances of these statistics can stray far from 1 if ∆ is not near the null. However, the evidence T for equivalence ∆ 1 < ∆ < ∆ 2 is better behaved, and has standard error similar to that for the two one-sided Z-test evidence discussed in Section 2. R scripts for computing (9) are in Appendix B.

Example 2.
(Comparing methods of patient care.) As described in [18], the objective of a randomized trial was to determine whether a standard method of care for patients by doctors was comparable to nurse-practitioner care. For the first group, there were n 1 = 225 patients and of these X 1 = 148 were found to have adequate care. For the second group, of n 2 = 167 patients, X 2 = 115 were found to have adequate care. Letting p 1 , p 2 be the probability of adequate care for the first, second methods and ∆ = p 1 − p 2 , it was desired to test for "equivalence of treatments" defined by |∆| ≤ ∆ 0 = 0.1. For these data,∆ = −0.03, T − = 1.461, T + = 2.719 and T = 1.461, which is close to 1.65 with a standard error less than 1. That is, the evidence for equivalence is positive but weak. By way of comparison, Reference [18] found the p-value for the equivalence alternative |∆| ≤ ∆ 0 = 0.1 to be 0.005, but in a later corrected analysis in [19] calculated it to be 0.07.
In order to obtain expected moderate evidence 3.3 ± 1 for equivalence |∆| ≤ ∆ 0 = 0.1 in this setting when in fact there is near equivalence, one would need sample sizes in each group near 1000.

Evidence for Equivalence in Two One-Sided t-Tests
For the t-test, the equivalence test of H 0 : |µ| ≥ µ 0 against the alternative H 1 : |µ| < µ 0 is based on n measurements of the differential effect. The t-statistic is a function of the estimated meanȳ and standard deviation s obtained from the sample. The null hypothesis is two-sided and its right-hand In both of these expressions, qt n−1,p denotes the p-quantile of the corresponding t-distribution. Both parts of the null hypothesis must be rejected in order to get significant evidence for equivalence. This holds if the confidence interval [ȳ ± qt n−1,1−α s/ √ n] is contained within [±µ 0 ]. The t-statistic S, if the true mean is µ, has a non-central t-distribution with n − 1 degrees of freedom and noncentrality parameter λ = √ n(µ − µ 0 )/σ. We are interested in evidence in favor of small |µ|. For the left-hand part, the non-centrality parameter is λ = √ n(µ + µ 0 )/σ and we are interested in evidence in favor of large µ. The VST is derived in [9,20] and is defined by . This is an increasing function and measures the evidence in favor of large µ. The evidence contained in the data in favor of equivalence for the right-hand part of the null hypothesis is thus, −h(S + ), while for the left-hand part it is h(S − ). Both need to be sufficiently large in order to conclude in favor of equivalence, that is, the empirical evidencê must be at least 2 and better 3. Negative values of the empirical evidence can occur and they have to be interpreted as evidence in favor of non-equivalence.
Example 3. Figure 4 shows a plot of the empirical evidence as a function of the average of the measurements. The evidence from the t-statistic is nearly linear inȳ and largest ifȳ is exactly halfway between the equivalence limits. The difference with the usual statistical tests is striking. There, the evidence will grow with the distance from the null hypothesis and can become arbitrarily large. Here, the maximal size of the evidence is limited by the equivalence limits.  The behavior of the evidence is as expected. If the sample size grows, so does the amount of evidence. If the standard deviation grows, there is less evidence if all other conditions are the same. The amount of evidence is bigger than a desired amount (2 or 4, for example), ifȳ is within an interval centered at the halfway mark between the equivalence limits.

Approximate Normality of the Variance Stabilized t-Statistic
The VST is symmetric with regard to the origin, because (x + shows that for values of S up to order O(n 1/3 ), the deviation from the identity is small. Only for values of the t-statistic S further into the tail does the VST pull them towards zero, that is, h(S) < S.
For very large values of S the function h(S) is logarithmic. The tail of the t-density evaluated at x is O(x −n ) as x → ∞ and thus has a tail index of n − 1. The VST transforms this to an infinite tail index.

Evidence in Multivariate Equivalence Tests
Multivariate equivalence tests are often based on a test statistic S having an exact or approximate non-central chi-squared distribution, denoted S ∼ χ 2 ν (λ), where ν is the known degrees of freedom (df ), and the non-centrality parameter (ncp) λ ≥ 0 is unknown. Others are based on the non-central F distribution, see Section 4. The null hypothesis postulates non-equivalence between the samples, λ ≥ λ 0 , whereas the alternatives postulate practical equivalence λ < λ 0 . The limit λ 0 is a positive constant adapted to the context; examples are given in [17] and the following sections.
Wellek [8,20] looks at the case of possibly dependent measurements (K of them) that are done independently on n subjects. He then wants to test whether the K measurements have equal means. His first proposal is to pass to the K − 1 differences between the K measurements and to use Hotelling's T 2 test for 0 means. He then remarks on the elliptical shape of the equivalence region, which might be criticized as being arbitrary. Reference [20] then discusses rectangular regions, as we do, and comments on the difficulties of this approach. The material in Section 3.2 below proposes a possible compromise solution. A fully Bayesian approach to multivariate equivalence testing is found in [5].

A VST for the Non-Central Chi-Squared Statistic
Once the "equivalence limit" λ 0 is chosen, one can carry out a Neyman-Pearson test which rejects non-equivalence λ ≥ λ 0 at level α in favor of equivalence when the test statistic is sufficiently small, that is, S less than the α-quantile of the χ 2 ν (λ 0 ) distribution (c α = χ 2 ν,α (λ 0 )). The power function of this test is the probability of deciding in favor of equivalence where β(λ) is the probability of falsely reaching a conclusion of non-equivalence. The testing approach underlying (10) is easier to understand when the test statistic is variance stabilized. In this context a VST is a monotone decreasing function h(S) of the test statistic S, which for all values of λ is approximately normal with variance one. Rather than summarizing the evidence by an accept/reject decision, by a p-value or by a confidence interval, we propose to use the statistic T = h(S) − h(E λ 0 [S]) because it provides a more informed and interpretable measure of the evidence in favor of equivalence. The larger its value, the more evidence resides in the data in favor of equivalence. Since its variance remains close to one for all λ, it is only the value of T that matters. The expected This is a quantity that increases monotonically as λ decreases to 0. By construction, it is 0 at the equivalence limit λ 0 and has a maximal value of K λ 0 (0). The observed evidence for equivalence can be reported as T ± 1, indicating that evidence T for equivalence has a standard normal error.
One can derive an approximation for K λ 0 (λ) = E λ [T(S)] for the evidence in S ∼ χ 2 ν (λ) for testing λ ≥ λ 0 against λ < λ 0 as follows. The variance Var λ [S] = 2ν + 4λ is not constant (not stable) and in order to stabilize it asymptotically one can use the standard delta-method [9] (p. 242). Using the fact that E λ [S] = ν + λ ≥ ν, one has Var λ [S] = g(E λ [S]), where g(s) = 4s − 2ν is defined for s ≥ ν. The transformation h(s) that removes the dependence on λ is equal to an antiderivative of −1/ g(s), which leads to h(s) = − √ s − ν/2, where the negative sign was chosen in order to obtain a decreasing function in s. This standard procedure fails to define a VST for all s ≥ 0, because strictly speaking our function h(s) should only be applied in the range s ≥ ν, and even if we tried to extend it towards s = 0, its value is undefined for s < ν/2 This problem is created by the zero crossing of g(s) at s = ν/2 and various ideas for extending h(s) to the entire positive real line could be tried; see (13) below. For this or any other such choice agreeing with h(s) for s > ν, the transformed statistic h * (S) has approximate expected value After centering h * (S) at the limit λ 0 , one obtains the observed evidence T = h * (S) + √ λ 0 + ν/2. The expected evidence for equivalence, to first order, is thus The expected evidence (12) in the experiment has a maximum at λ = 0, namely Figure 5 shows what the equivalence values λ 0 must be as a function of ν, namely λ 0 = K 2 λ 0 (0) + √ 2ν K λ 0 (0), where the maximal expected evidence is of varying strengths. For example, when ν = 15, for moderate maximal expected evidence, λ 0 must be at least 29. More frequently, the equivalence bound λ 0 is determined by context, and the degrees of freedom ν determined by ν ≥ (λ 0 − K 2 λ 0 (0)) 2 /{2K 2 λ 0 (0)}. The evidence for λ < λ 0 in S ∼ χ 2 ν (λ) can be defined for certain M > K λ 0 (0): This T has a negative continuous derivative, and for the choice M = √ λ 0 has T ≈ N(0, 1) at the null λ = λ 0 . Further, at perfect equivalence λ = 0 it has E[T] ≈ K λ 0 (0) for K λ 0 given by (12). These claims are made based on simulation studies, using R scripts in Appendix B.

Evidence for Equal Means
Given independent X k ∼ N(µ k , 1), k = 1, . . . , K, we want to find evidence for equivalence in the sense that all means equal µ = (∑ k µ k )/K simultaneously. A test statistic is S = ∑ k (X k −X) 2 , which has distribution S ∼ χ 2 K−1 (λ), where λ = ∑ k (µ k − µ) 2 . In practice, the µ k are considered "equal" if max k { | µ k − µ| } ≤ for a given > 0. How can this last notion of equivalence be translated into a value for the limit λ 0 ? To solve this problem, we note that {(µ 1 − µ), . . . , (µ K − µ)} defines a K − 1 dimensional hyperplane of R K , which when shifted to the origin is a K − 1 dimensional subspace of R K , and distances between points in the hyperplane are preserved under translation. Thus it suffices to solve the following problem for arbitrary K ≥ 2: given µ 1 , . . . , µ K with µ = (∑ k µ k )/K = 0 and max k { | µ k | } ≤ , find an appropriate choice of λ 0 = ∑ k µ 2 k = r 2 , which is the square of the Euclidean distance r of the point (µ 1 , . . . , µ K ) from the origin in R K . After a solution for the equivalence boundary λ 0 = λ 0 ( , K) is found, it can be implemented in the case of unknown µ by replacing K by K − 1.
The largest ball contained in the K-dimensional cube with edge 2 , both having the same center, has radius , whereas the smallest ball containing the same cube has radius √ K . The latter choice would allow for one µ k to be as large as √ K (if all other µ j = 0) and overall equivalence would be claimed even though it was violated in one case. To reduce this violation, the radius ought to be in-between the extremes. A "reasonable" compromise requires the volume of the L 2 -ball to equal (2 ) K , the volume of the cube with side 2 . The volume of a ball in R K with radius r is given by: To ensure equal volumes for the hypersphere and the hypercube for all K, one needs the radius of the ball to be r 0 A good approximation to b K is given by c K = 2/(πe) + (1.7 · K) −5/6 , see Table 2 and Figure 6. This makes it easier to choose the desired equivalence limit λ 0 = r 2 0 = b K K 2 ∼ 2K 2 /(πe) for moderate and large K.   (15) to 3 decimal places so that the K-dimensional ball of radius √ K b K , has the same volume as the K-dimensional cube of side 2 . The approximate value is c K = 2/(πe) + (1.7 · K) −5/6 . To illustrate the use of this table, suppose K = 4, µ is unknown, and we want each of µ k to satisfy | µ k − µ | ≤ = 1/2, say, in order to claim "equivalence of all means". As discussed above the corresponding problem with µ known to equal 0 in K − 1 = 3 dimensions is to utilize (12) it then follows that for K = 4, ν = 3 and this λ 0 the maximum expected evidence possible for equivalence of all 4 means is about 0.13, which is almost negligible.
If we had begun this section with each X k replaced byX k ∼ N(µ k , 1/n), for some sample size n ≥ 2, then the test statistic would be S n = ∑ k (X k −X) 2 , which has distribution S n ∼ χ 2 ν (λ), where ν = K − 1 and λ = n ∑ k (µ k − µ) 2 . By imposing the same condition max k { | µ k − µ| } ≤ , where µ is unknown, the "equal-volume" solution ∑ k (µ k − µ) 2 ≤= νb ν 2 , so the appropriate equivalence hypothesis is λ ≤ λ 0 = n ν b ν 2 . The maximum expected evidence in T n = T(S n ) is attained when all means are equal, and this maximum is Continuing Example 4, n = 20 will yield weak maximum expected evidence 1.65 for equivalence.

Application to between Group Sum of Squares
Given independent observations from K groups with different means X ki ∼ N(µ k , 1), i = 1, . . . , n k , k = 1, . . . , K, we denote the total sample size by N = ∑ k n k , and the group sample proportions by q k = n k /N. Let the kth sample mean beX k and the weighted group sample mean bȳ X = ∑ k q kXk ; it is an unbiased estimator of the weighted population mean µ = ∑ k q k µ k . Then the between group sum of squares SSB between = N ∑ k q k (X k −X) 2 ∼ χ 2 ν (λ), where ν = K − 1 and λ = N ∑ k q k (µ k − µ) 2 ; see, for example, [9] (p. 184). The evidence in S ∼ χ 2 ν (λ) for equivalence λ < λ 0 was derived in Section 3.1.The practical problem is to choose λ 0 . Once λ 0 is chosen one can compute the maximum evidence in the experiment; it is by (12) equal to λ 0 + (K − 1)/2 − (K − 1)/2 . As explained in Section 3.1, this determines the maximum power for equivalence at any given level.

Testing for Equivalence of K Groups
Given independent observations from K groups with common unknown variance σ 2 > 0 and different means X ki ∼ N(µ k , σ 2 ), i = 1, . . . , n k , k = 1, . . . , K. Let N = ∑ k n k , q k = n k /N for all k and the overall mean µ = ∑ k q k µ k . We define equivalence of means for a given > 0 if none of the standardized µ k differs more than from zero: Just as in Section 3.3 where σ was assumed known (and without loss of generaliity set equal to 1), we can define λ = N ∑ k q k {(µ k − µ)/σ} 2 . Given λ 0 we want evidence for the hypothesis of equivalence λ < λ 0 . The arguments for choosing λ 0 = λ 0,MS = n ν 1 b ν 1 2 where ν 1 = K − 1 carry over from Sections 3.2 and 3.3, provided sampling is nearly balanced so that all n k ≈n = n.
By way of comparison, [17] uses traditional hypothesis testing with the smaller equivalence boundary λ 0,Wellek =n 2 = 3.125 and finds the non-central F-test significant at level 0.05 with an estimated power 0.18 for detecting perfect equivalence.

Summary and Discussion
For the test statistic S ∼ N(µ, 1) of a null hypothesis µ ≤ µ 0 and alternative µ > µ 0 Neyman-Pearson methods help one make a decision; while a p-value can provide a measure of surprise regarding the boundary hypothesis µ = µ 0 . But what one often wants from S is a measure of evidence for the alternative µ > µ 0 . In this simplest of statistical tests, T = S − µ 0 is such a measure of evidence for the alternative, because T estimates the unknown expected evidence E[T] = µ − µ 0 which is linearly increasing with µ and comes with an easily understood standard normal error. Values of T near 1.645, 3.3 and 5 are interpreted as weak, moderate and strong evidence for the alternative µ > µ 0 . In addition the expected evidence can be written as the sum of the probits of level and power.
The vast majority of routine statistical tests can be transformed into the above setting through variance stabilization. And the mean of a variance stabilized test statistic T = VST(S), after centering at µ 0 , is very close to the signed square root of the Kullback-Leibler symmetrized divergence between the null and alternative distributions. This result gives more theoretical support for calling T the evidence for the alternative; references are listed in Section 1.
What we have done here is to extend the above ideas to the more complicated hypotheses of the form |µ| ≥ µ 0 versus the equivalence alternative |µ| < µ 0 , with applications for TOST based on one-and two-sample binomial experiments and two one-sided t-tests. Then, in the multivariate setting, we have found modifications of the classical VSTs for non-central chi-squared and non-central F-tests which make it practicable to find evidence for the equivalence alternative λ < λ 0 to the null hypothesis λ ≥ λ 0 of non-equivalence.
The practical choice of equivalence limit λ 0 is also an important ingredient, and we have provided a new approach to assist in its choice. In particular, when testing for equivalence of means in K arms of a study, we found the value of the radius required so that the K-ball has the same volume as the K-cube of edge 2 . This leads to a proposal for converting the condition max k |µ k − µ| ≤ into an approximate equivalence condition λ < λ 0 .
The new extensions of classical VSTs for non-central chi-squared and F statistics require more work in the choice of M to center them properly at the null λ 0 . Simulation studies show that adjusting M so that the mean of the VST is 0 when λ = λ 0 automatically ensures that the mean of the VST statistic when λ = 0 is near its expected maximum value. The choice M ≈ √ λ 0 works adequately, but simple formulae for M that depend on the degree(s) of freedom as well as λ 0 would be useful.
Finally, we note that finding simple expressions for the expected evidence in a VST greatly assists one in finding minimal sample sizes when planning an experiment for determining equivalence; by knowing the maximum expected evidence for equivalence, one also learns of the power to detect perfect equivalence at any given level.
Another application is in goodness-of-fit tests, where instead of "backing into" a model by not rejecting it a liberal level such as 0.1, one could find the evidence for the model. Chapter 8 of [17] is a good starting point for solving this problem. Further research will also take into account recent results on bio-equivalence defined in terms of the Kullback-Leibler divergences, see [4,7].
There is no simple analytic expression for J χ 2 (λ 0 ; λ), so we use numerical approximation to compute it and its signed square root. The graph of the latter (for ν = 1, λ 0 = 6) is shown in the top left plot of Figure A1 as a dashed line, and is to be compared with the graph of the K 6 (λ) defined by (12) as a solid line. Note that they are very close over the range of λ of interest although both pass through 0 at λ = λ 0 . Our main point is to emphasize the quality of the approximation (A1), which is clear from the plot to its right which shows the absolute relative error is less than 1 in 16 over this range of λ. Figure A2 and others not shown demonstrate that quite generally the absolute relative error in the approximation (A1) is less than 1 in 20 over a wide range of equivalence experiments with non-central chi-squared distributed outcomes.  Figure A1. Non-central Chi-squared with parameters ν, λ: The upper left plot shows the graph of the expected evidence K 6 (λ) when ν = 1 as a solid line, to be compared with the signed square root of the KLD, shown as a dashed line. The absolute relative error in this approximation is shown on its right. The lower two plots are for λ 0 = 12 and ν = 1.  Figure A2. The same notation as in Figure A1, but for different parameters.

Appendix A.2. Non-Central F Distribution
If f λ denotes the noncentral F density with ν 1 , ν 2 degrees of freedom and ncpλ, then by (18), and using E λ [S] = (ν 1 + λ)ν 2 /{ν 1 (ν 2 − 2)}, As for the chi-squared case, there is no simple analytic expression for J F (λ 0 ; λ), so we use numerical approximation to compute it and its signed square root. Figures A3 and A4 show typical comparative results between the above expected evidence and the signed square root of the KLD between the null f λ 0 and alternative f λ distributions. These and similar plots provide more numerical support for the approximation (A1).  Figure A3. Non-central F with parameters ν 1 , ν 2 and λ: The upper left plot shows the graph of the expected evidence K 10 (λ) when ν 1 = 1, ν 2 = 20 as a solid line, to be compared with the signed square root of the KLD, shown as a dashed line. The absolute relative error in this approximation is shown on its right. The lower two plots are also for λ 0 = 10 but now ν 1 = 4 and ν 2 = 20.  Figure A4. The same notation as in Figure A3, but for different parameters.