Next Article in Journal
On Multi-Scale Entropy Analysis of Order-Tracking Measurement for Bearing Fault Diagnosis under Variable Speed
Next Article in Special Issue
Optimal Noise Benefit in Composite Hypothesis Testing under Different Criteria
Previous Article in Journal
Control of Self-Organized Criticality through Adaptive Behavior of Nano-Structured Thin Film Coatings
Previous Article in Special Issue
The Structure of the Class of Maximum Tsallis–Havrda–Chavát Entropy Copulas
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Indicators of Evidence for Bioequivalence

by
Stephan Morgenthaler
1,† and
Robert Staudte
2,*,†
1
Département de Mathématiques, École Polytechnique Fédérale de Lausanne, Lausanne 1015, Switzerland
2
Department of Mathematics and Statistics, La Trobe University, Melbourne 3086, Australia
*
Author to whom correspondence should be addressed.
Both authors contributed equally to this work.
Entropy 2016, 18(8), 291; https://doi.org/10.3390/e18080291
Submission received: 29 May 2016 / Revised: 22 July 2016 / Accepted: 2 August 2016 / Published: 9 August 2016
(This article belongs to the Special Issue Statistical Significance and the Logic of Hypothesis Testing)

Abstract

:
Some equivalence tests are based on two one-sided tests, where in many applications the test statistics are approximately normal. We define and find evidence for equivalence in Z-tests and then one- and two-sample binomial tests as well as for t-tests. Multivariate equivalence tests are typically based on statistics with non-central chi-squared or non-central F distributions in which the non-centrality parameter λ is a measure of heterogeneity of several groups. Classical tests of the null λ λ 0 versus the equivalence alternative λ < λ 0 are available, but simple formulae for power functions are not. In these tests, the equivalence limit λ 0 is typically chosen by context. We provide extensions of classical variance stabilizing transformations for the non-central chi-squared and F distributions that are easy to implement and which lead to indicators of evidence for equivalence. Approximate power functions are also obtained via simple expressions for the expected evidence in these equivalence tests.

Graphical Abstract

1. Introduction

Our purpose is to extend the concept of “evidence for the alternative hypothesis”, already available in classical one-sided testing, to contexts where that alternative is “equivalence” of two or more distributions. We abbreviate the term “bioequivalence” to “equivalence” for simplicity and because these results are of much more general applicability.

1.1. Background and Summary

Why should we introduce another approach to equivalence testing? Because, even though some equivalence tests [1] are well established and embraced by the USA Food and Drug Administration (FDA) and the European Medicines Agency (EMA), there are substantial critiques [2,3,4], as well as novel, competing approaches to multivariate equivalence testing [4,5,6,7,8].
We endorse the proposal by [4] to define a hierarchy of bioequivalence models, “average bioequivalence” within “population bioequivalence” within “individual bioequivalence”, in terms of the Kullback–Leibler symmetrized distance (KLD) between distributions arising in equivalence tests. Somewhat surprisingly, we advocate estimating these distances indirectly using variance stabilized test statistics (VSTs) rather than plug-in parameter estimates for parameters appearing in the KLD. The close tie between the mean of a VST and the KLD for non-central chi-squared and F distributions will be illustrated in Appendix A.
In the remainder of this introduction we describe the notion of “evidence for the alternative” in the context of Z-tests and show how it is connected to level and power of such tests; these ideas were first introduced in [9]. After these preliminaries, we extend the notion of evidence for equivalence to two one-sided Z-tests (TOSTs), and show how it is related to a one-sided test based on a non-central chi-squared statistic with one degree of freedom.
This notion of evidence is applicable to many situations because VSTs can carry many test statistics into normally distributed statistics Z. For exponential families, Reference [10] show that the expected evidence of the variance stabilized statistic T is approximately equal to the signed square root of the Kullback–Leibler symmetrized divergence. Examples not from exponential families are in [11,12,13]. These results gives support to calling T the “evidence for the alternative hypothesis”.
Throughout the paper we employ standard notation and properties of the non-central t, χ 2 and F distributions which are introduced and studied in depth in [14,15]. Below we first discuss specific examples of two one-sided tests (TOST) in Section 2, namely for binomial, two by two tables and t-tests; all methods are illustrated using data examples from the literature. In the multivariate setting, chi-squared tests for equivalence are given in Section 3 and F-tests for equivalence of K normal populations in Section 4. New methods for choosing λ 0 and for variance stabilizing the test statistics are provided. Discussion follows in Section 5, and R scripts for implementing some procedures are found in the Appendix B.

1.2. Properties of Evidence in One-Sided Z-Tests

Univariate equivalence tests are often based on statistics having the non-central t distribution or normal approximations to the binomial distribution, so we begin our introduction to evidence contained in such tests with the approximating, and simpler, Z-tests.
In the prototypical model X N ( μ , 1 ) where μ is unknown, and one tests the null hypothesis μ μ 0 against the alternative μ > μ 0 by rejecting the null if the p-value is sufficiently small. For an observed X = x , the p-value is Φ ( μ 0 - x ) , where Φ is the standard normal distribution function. The ”evidence for the alternative” μ > μ 0 is defined to be T X - μ 0 . It is normally distributed with mean μ - μ 0 , which is linearly increasing in μ. In addition, T estimates its mean with standard error 1, regardless of the value of μ . For reasons given in [9,16] values of T near 1.645, 3.3 and 5 are called “weak”, “moderate” and “strong” evidence for the alternative. Note that for an observed T = t = x - μ 0 , the p-value can be recovered from 1 - Φ ( t ) .
In Table 1 are shown two sets of numbers with very different interpretations, resulting from different assumptions. They are based on one observation T = t , where T N ( μ , 1 ) and μ 0 . If one assumes the boundary hypothesis μ = 0 , the second row of p-values gives the correct “degree of surprise” at having observed T = t ; the smaller the p-value, the more surprised one is with the outcome. However, if one only assumes μ 0 , the first row gives an estimate of the expected evidence E [ T ] = μ for the alternative μ > 0 . This estimate has an additive standard normal error. When one interprets p-values, one must be careful not to interpret the smallness of their magnitudes as though they were evidence for the alternative on a linear scale. The first row in Table 1 gives a much more reasonable estimate of “evidence for the alternative” together with an easily understood standard error. The compatibility of this calibration scale for evidence with Bayesian calibration scales for p-values and Bayes factors is discussed in [10] (Section 4.3).
The generality of this definition of evidence for the alternative stems from the fact that in many situations the natural test statistics X can be transformed to T, which has approximately a normal distribution with unit variance. Also, it is a more basic concept of the test than level and power. For a normally distributed test statistic T with unit variance, the expected evidence for one-sided alternatives μ > μ 0 is related to the level α and power 1 - β ( μ ) through the sum of the probits:
E μ [ T ] = z 1 - α + z 1 - β ( μ ) .
In the same testing problem the sample size n required to detect an alternative μ 1 with power 0.8 at level 0.05 is the solution of n ( μ 1 - μ 0 ) = z 0.95 + z 0.8 2.5 ; the expected evidence in such an experiment therefore lies between weak and moderate.
For negative T, - T is interpreted as evidence for the null μ μ 0 . Because of the symmetry of the problem, if we had begun with the null hypothesis μ μ 0 against the alternative μ < μ 0 the evidence for the alternative would be defined as μ 0 - X .

1.3. Properties of Evidence in Two One-Sided Z-Tests (TOST)

Consider the simplest example with one observation X N ( μ , 1 ) for testing H 0 : | μ | μ 0 , where μ 0 > 0 defines the equivalence alternative H 1 : | μ | < μ 0 . The null hypothesis consists of two possibilities: μ - μ 0 and μ μ 0 . The left hand part is rejected at level α if X + μ 0 z 1 - α , and the evidence for its alternative is T - = X + μ 0 , because T - N ( μ + μ 0 , 1 ) and T - has an expected value that increases in μ and is 0 at the boundary μ = - μ 0 . The right hand part is rejected at level α if X - μ 0 z α , and the evidence for its alternative is T + = μ 0 - X , because T + N ( μ 0 - μ , 1 ) , whose expected value is increasing with decreasing μ and is 0 at its null boundary. The two one-sided testing procedure (equivalence test) rejects in favor of equivalence only if both of the one-sided tests reject their respective null hypotheses, and this has level α , because only one null hypothesis can hold.
The evidence for the alternative hypothesis of equivalence is logically the minimum of the evidences for the two one-sided tests:
T = min { T - , T + } = μ 0 + T 0 ,
where T 0 = min { X , - X } and X N ( μ , 1 ) . Now - T 0 = | X | has a folded (to the right) normal distribution ([14] (p. 170), and [16]), with parameters ( μ , 1 ), so T 0 has a folded (to the left) normal distribution with the same parameters. The density of T 0 is given in terms of the standard normal density φ for t < 0 by f T 0 ( t ; μ ) = φ ( t - μ ) + φ ( t + μ ) , so the density of T is for t < 0
f T ( t ; μ ) = φ ( t - μ - μ 0 ) + φ ( t + μ - μ 0 ) .
In Figure 1 are shown in black lines some examples of f T ( t ) for μ 0 = 4 and several choices of μ. When μ = 0 , (exact equivalence), the density is negative half-normal with upper bound μ 0 = 4 , but as | μ | increases, the distribution rapidly approaches normality.
Also of interest are the mean and standard deviation of T as μ varies. The first two moments of T 0 are E μ [ T 0 ] = μ { 1 - 2 Φ ( μ ) } - 2 φ ( μ ) and E μ [ T 0 2 ] = 1 + μ 2 , so
E μ [ T ] = μ 0 + E μ [ T 0 ] = μ 0 + μ { 1 - 2 Φ ( μ ) } - 2 φ ( μ )
Var μ [ T ] = 1 + μ 2 - μ 1 - 2 Φ ( μ ) - 2 φ ( μ ) 2 .
The mean and standard deviation of T are shown in Figure 2 as black lines. The top left-hand plot reveals that even in the case of perfect equivalence μ = 0 , the equivalence test with μ 0 = 3 will only yield, on average, weak evidence for it. With μ 0 = 4 , this average evidence for equivalence when μ = 0 becomes moderate.
The plots in Figure 2 suggest that, except for μ near 0, the distribution of T defined by (2) is approximately normal with mean near min { μ + μ 0 , μ 0 - μ } and standard deviation near one. For μ near 0, which is of interest in the case of equivalence, it is not normal, but this is perhaps compensated for by having a smaller standard error. Its distribution approaches a negative half-normal as | μ | 0 , with mean and standard deviation from (4) and (5) converging to μ 0 - 2 / π and 1 - 2 / π , respectively.

1.3.1. How Evidence Grows with Sample Size in Two One-Sided Z-Tests

The evidence for a one-sided Z-test grows with the square root of the sample size n for if the sample mean X ¯ n N ( μ , 1 / n ) then the evidence for the alternative μ > - μ 0 to the null μ - μ 0 is T n , - = n ( X ¯ + μ 0 ) N ( n ( μ + μ 0 ) , 1 ) , which is increasing in μ, has variance 1, and has expected value 0 at the boundary μ = - μ 0 . Similarly, T n , + = n ( μ 0 - X ¯ ) N ( n ( μ 0 - μ ) , 1 ) the evidence for the alternative μ < μ 0 to the null μ μ 0 . Thus the evidence for equivalence | μ | < μ 0 based on n observations is:
T n = min { T n , - , T n , + } = n μ 0 + T n , 0 ,
where T n , 0 = n min { X ¯ , - X ¯ } and n X ¯ N ( n μ , 1 ) . Now T n , 0 has a folded to the left normal distribution with parameters n μ , 1 and T n is a shift by n μ 0 of T n , 0 , so its mean and variance are:
E μ [ T n ] = n μ 0 + n μ { 1 - 2 Φ ( n μ ) } - 2 φ ( n μ ) Var μ [ T n ] = 1 + n μ 2 - n μ 1 - 2 Φ ( n μ ) - 2 φ ( n μ ) 2 .
A plot of the densities of T 4 compared to T defined by (3) are also shown in Figure 1 as red lines, and similarly for the mean and standard deviation of T 4 in Figure 2.

1.3.2. Sample Size Determination

For a one-sided Z-test based on n observations one can obtain a given expected evidence 2.5, say, for an alternative distant μ 0 from the null, by taking n to satisfy n μ 0 = 2.5 . So n 1-sided = ( 2.5 / μ 0 ) 2 , where r is the smallest integer greater than or equal to r. For a TOST Z-test with equivalence alternative | μ | < μ 0 , to obtain the same expected evidence when in fact μ = 0 one needs by (7) to have n μ 0 - 2 / π = 2.5 , or n TOST = ( 2.5 + 0.8 ) / μ 0 ) 2 , which is 74% larger than n 1-sided .
If one had asked for only weak expected evidence 1.645 instead of 2.5 in the above paragraph, the ratio of sample sizes required by TOST Z-tests to a one-sided test is 2.2, so the equivalence test would require 120% more observations than the one-sided test.

1.4. Connection of Evidence in TOST with a One-Sided Test

The evidence T = T TOST defined by (2) is based on X N ( μ , 1 ) for the null hypothesis H 0 : | μ | μ 0 > 0 , composed of two disjoint sets, with the equivalence alternative H 1 : | μ | < μ 0 . One could equally study the evidence in the equivalent experiment S = X 2 χ 1 2 ( λ ) , where λ = μ 2 , for the hypotheses restated as H 0 : λ λ 0 against the equivalence alternative H 1 : λ < λ 0 , where λ 0 = μ 0 2 . The evidence for equivalence in an experiment with S χ 1 2 ( λ ) is a special case of (13) found in Section 3:
T = T 1 , λ 0 ( S ) = λ 0 - S / 2 for S < 1 ; λ 0 - S - 1 / 2 for S 1 .
The top left plot of Figure 3 compares the graph of this T with that of T TOST when μ 0 = 2 . Further, its expected evidence is by (12) approximately λ 0 + 1 / 2 - λ + 1 / 2 . The top right plot of Figure 3 shows the graph of this approximate expected evidence for equivalence as a function of μ for the case μ 0 = 2 as a dashed line, to be compared with the previously obtained expected evidence in the two one-sided test experiment (4), whose graph is shown as a solid line. The bottom plots are for μ 0 = 4 .

2. More Examples of Two One-Sided Tests (TOSTS)

2.1. Evidence for Equivalence in Two One-Sided Binomial Tests

Given X Binomial ( n , p ) , and let p ^ = X / n . Let p 1 < p < p 2 define the region of equivalence. (Often this region will also be of the form | p - p 0 | < Δ 0 for some p 0 , Δ 0 .) We want to test at level α the null H 0 : p p 1 or p p 2 against the equivalence alternative H 1 : p 1 < p < p 2 . The null hypothesis is two-sided and its right-hand part is rejected at level α if p ^ p 2 - z α p 2 ( 1 - p 2 ) / n , whereas the left-hand part is rejected at level α if p ^ p 1 + z 1 - α p 1 ( 1 - p 1 ) / n . Only one of these tests can reject the null, so the level of the combined tests is α. We have assumed that n is large enough so that normal critical points give accurate levels.
The VST of p ^ is the well-known arc-sine transformation h ( p ^ ) = 2 n arcsin ( p ^ ) , which is asymptotically normal with variance 1 and asymptotic mean 2 n arcsin ( p ) . Large values of h ( p ^ ) indicate evidence for large p. The evidence in the test of p p 1 for an alternative p > p 1 is therefore T - = h ( p ^ ) - h ( p 1 ) , while the evidence in the test of p p 2 for an alternative p < p 2 is T + = h ( p 2 ) - h ( p ^ ) . For the combined two one-sided tests, the evidence for equivalence is the minimum evidence in these two one-sided tests; that is, T = min { T - , T + } .
Example 1. 
(Intervention success of new treatment.) In Example 4.2 of [17] (p. 59) a highly toxic drug used in chemotherapy treatment for a tumor led to a 73% two-year progression-free survival period. A new combined and much more tolerable treatment was administered to 361 patients and it was deemed equivalent to the previous treatment if the success rate fell in the interval [0.65, 0.75]. In the new treatment 191 patients survived a two-year progression-free period, and Wellek used two one-sided binomial tests to find that non-equivalence was rejected at the 0.05 level with estimated power 0.12. For these data T + = 2.84 , T - = 0.793 so the evidence for equivalence of the treatment effect is T = 0.793 , which is “weak”. The standard error of T = 0.793 is known (see the comments at the end of Section 2), to satisfy 0.60 SE [ T ] 1 , but is likely to near the smaller bound, because p ^ = 191 / 273 is close to the center of the equivalence interval. This result is consistent with the analysis of [17], and it is much simpler.

2.2. Evidence for Equivalence of Risks

It is often the case that one wants to compare risks associated with new and standard treatments, with data often displayed in 2 by 2 tables; an example is given below after we introduce notation and explain how to find evidence for equivalence in this context.
Let X 1 and X 2 be two independent binomial random variables with parameters ( n 1 , p 1 ) and ( n 2 , p 2 ) , respectively. Letting p ^ i = ( X i + 0.5 ) / ( n i + 1 ) for i = 1 , 2 the unknown risk difference Δ = p 1 - p 2 is estimated by Δ ^ = p ^ 1 - p ^ 2 . We want the evidence for equivalence hypothesis Δ 1 < Δ < Δ 2 , where Δ 1 , Δ 2 are specified bounds, usually of the form - Δ 0 , Δ 0 . This can be achieved by combining the results of two one-sided tests: Δ Δ 1 versus Δ > Δ 1 and Δ Δ 2 versus Δ < Δ 2 . To find the evidence for the alternative in the first test, we use the VST of Δ ^ derived in Kulinskaya et al. [11]. This is a family of VSTs indexed by a parameter 0 < A < 1 . For the choice A = 1 / 2 the nuisance parameter is ψ = p ¯ = ( p 1 + p 2 ) / 2 , and we also require N = n 1 + n 2 , v = ( 1 - 2 p ¯ ) ( 1 / 2 - n 2 / N ) and w = p ¯ ( 1 - p ¯ ) + v 2 . Then Equation 2.3 of [11], can be written:
T ( Δ ^ , p ¯ , Δ 1 ) = 4 n 1 n 2 N arcsin Δ ^ / 2 + v w - arcsin Δ 1 / 2 + v w .
Reference [11] show that the statistic T ( Δ ^ , p ¯ ^ , Δ 1 ) obtained by replacing p ¯ , v and w by their plug-in estimates, is for large n 1 , n 2 normally distributed with mean that is monotone increasing in Δ from 0 at the null Δ = Δ 1 . Further, this statistic has variance 1 at the null, which allows them to derived large-sample confidence intervals for Δ; these intervals are shown to be quite competitive for even small to moderate sample sizes in [13]. Next define
T - = T ( Δ ^ , p ¯ ^ , Δ 1 ) T + = - T ( Δ ^ , p ¯ ^ , Δ 2 ) T = min { T - , T + } .
T - gives the putative evidence for the alternative Δ > Δ 1 for the null Δ Δ 1 , while T + gives the putative evidence for the alternative Δ < Δ 2 for the null Δ Δ 2 . We say “putative” because, as [11] point out, the variances of these statistics can stray far from 1 if Δ is not near the null. However, the evidence T for equivalence Δ 1 < Δ < Δ 2 is better behaved, and has standard error similar to that for the two one-sided Z-test evidence discussed in Section 2. R scripts for computing (9) are in Appendix B.
Example 2. 
(Comparing methods of patient care.) As described in [18], the objective of a randomized trial was to determine whether a standard method of care for patients by doctors was comparable to nurse-practitioner care. For the first group, there were n 1 = 225 patients and of these X 1 = 148 were found to have adequate care. For the second group, of n 2 = 167 patients, X 2 = 115 were found to have adequate care. Letting p 1 , p 2 be the probability of adequate care for the first, second methods and Δ = p 1 - p 2 , it was desired to test for “equivalence of treatments” defined by | Δ | Δ 0 = 0.1 . For these data, Δ ^ = - 0.03 , T - = 1.461 , T + = 2.719 and T = 1.461 , which is close to 1.65 with a standard error less than 1. That is, the evidence for equivalence is positive but weak. By way of comparison, Reference [18] found the p-value for the equivalence alternative | Δ | Δ 0 = 0.1 to be 0.005, but in a later corrected analysis in [19] calculated it to be 0.07.
In order to obtain expected moderate evidence 3.3 ± 1 for equivalence | Δ | Δ 0 = 0.1 in this setting when in fact there is near equivalence, one would need sample sizes in each group near 1000.

2.3. Evidence for Equivalence in Two One-Sided t-Tests

For the t-test, the equivalence test of H 0 : | μ | μ 0 against the alternative H 1 : | μ | < μ 0 is based on n measurements of the differential effect. The t-statistic is a function of the estimated mean y ¯ and standard deviation s obtained from the sample. The null hypothesis is two-sided and its right-hand part is rejected if S + = n ( y ¯ - μ 0 ) / s < q t n - 1 , α , whereas the left-hand part is rejected if S - = n ( y ¯ + μ 0 ) / s > q t n - 1 , 1 - α . In both of these expressions, q t n - 1 , p denotes the p-quantile of the corresponding t-distribution. Both parts of the null hypothesis must be rejected in order to get significant evidence for equivalence. This holds if the confidence interval [ y ¯ ± q t n - 1 , 1 - α s / n ] is contained within [ ± μ 0 ] .
The t-statistic S, if the true mean is μ, has a non-central t-distribution with n - 1 degrees of freedom and noncentrality parameter λ = n ( μ - μ 0 ) / σ . We are interested in evidence in favor of small | μ | . For the left-hand part, the non-centrality parameter is λ = n ( μ + μ 0 ) / σ and we are interested in evidence in favor of large μ. The VST is derived in [9,20] and is defined by h ( S ) = 2 n sinh - 1 S / 2 n , where sinh - 1 ( x ) = ln ( x + x 2 + 1 ) . This is an increasing function and measures the evidence in favor of large μ. The evidence contained in the data in favor of equivalence for the right-hand part of the null hypothesis is thus, - h ( S + ) , while for the left-hand part it is h ( S - ) . Both need to be sufficiently large in order to conclude in favor of equivalence, that is, the empirical evidence
E ^ = min - 2 n sinh - 1 y ¯ - μ 0 2 s , 2 n sinh - 1 y ¯ + μ 0 2 s
must be at least 2 and better 3. Negative values of the empirical evidence can occur and they have to be interpreted as evidence in favor of non-equivalence.
Example 3. 
Figure 4 shows a plot of the empirical evidence as a function of the average of the measurements. The evidence from the t-statistic is nearly linear in y ¯ and largest if y ¯ is exactly halfway between the equivalence limits. The difference with the usual statistical tests is striking. There, the evidence will grow with the distance from the null hypothesis and can become arbitrarily large. Here, the maximal size of the evidence is limited by the equivalence limits.
The behavior of the evidence is as expected. If the sample size grows, so does the amount of evidence. If the standard deviation grows, there is less evidence if all other conditions are the same. The amount of evidence is bigger than a desired amount (2 or 4, for example), if y ¯ is within an interval centered at the halfway mark between the equivalence limits.

Approximate Normality of the Variance Stabilized t-Statistic

The VST is symmetric with regard to the origin, because ( x + x 2 + 1 ) = 1 / ( - x + x 2 + 1 ) , that is, ln ( x + x 2 + 1 ) = - ln ( - x + x 2 + 1 ) . The expansion
h ( S ) = S - 1 6 S 3 2 n + 3 40 S 5 4 n 2 O ( n - 3 ) ,
shows that for values of S up to order O ( n 1 / 3 ) , the deviation from the identity is small. Only for values of the t-statistic S further into the tail does the VST pull them towards zero, that is, h ( S ) < S . For very large values of S the function h ( S ) is logarithmic. The tail of the t-density evaluated at x is O ( x - n ) as x and thus has a tail index of n - 1 . The VST transforms this to an infinite tail index.

3. Evidence in Multivariate Equivalence Tests

Multivariate equivalence tests are often based on a test statistic S having an exact or approximate non-central chi-squared distribution, denoted S χ ν 2 ( λ ) , where ν is the known degrees of freedom (df ), and the non-centrality parameter (ncp) λ 0 is unknown. Others are based on the non-central F distribution, see Section 4. The null hypothesis postulates non-equivalence between the samples, λ λ 0 , whereas the alternatives postulate practical equivalence λ < λ 0 . The limit λ 0 is a positive constant adapted to the context; examples are given in [17] and the following sections.
Wellek [8,20] looks at the case of possibly dependent measurements (K of them) that are done independently on n subjects. He then wants to test whether the K measurements have equal means. His first proposal is to pass to the K - 1 differences between the K measurements and to use Hotelling’s T 2 test for 0 means. He then remarks on the elliptical shape of the equivalence region, which might be criticized as being arbitrary. Reference [20] then discusses rectangular regions, as we do, and comments on the difficulties of this approach. The material in Section 3.2 below proposes a possible compromise solution. A fully Bayesian approach to multivariate equivalence testing is found in [5].

3.1. A VST for the Non-Central Chi-Squared Statistic

Once the “equivalence limit” λ 0 is chosen, one can carry out a Neyman–Pearson test which rejects non-equivalence λ λ 0 at level α in favor of equivalence when the test statistic is sufficiently small, that is, S less than the α-quantile of the χ ν 2 ( λ 0 ) distribution ( c α = χ ν , α 2 ( λ 0 ) ). The power function of this test is the probability of deciding in favor of equivalence
1 - β ( λ ) = P λ ( S c α ) , 0 λ < λ 0 ,
where β ( λ ) is the probability of falsely reaching a conclusion of non-equivalence.
The testing approach underlying (10) is easier to understand when the test statistic is variance stabilized. In this context a VST is a monotone decreasing function h ( S ) of the test statistic S, which for all values of λ is approximately normal with variance one. Rather than summarizing the evidence by an accept/reject decision, by a p-value or by a confidence interval, we propose to use the statistic T = h ( S ) - h ( E λ 0 [ S ] ) because it provides a more informed and interpretable measure of the evidence in favor of equivalence. The larger its value, the more evidence resides in the data in favor of equivalence. Since its variance remains close to one for all λ, it is only the value of T that matters. The expected evidence E λ [ T ] is
K λ 0 ( λ ) = E λ [ T ] h ( E λ [ S ] ) - h ( E λ 0 [ S ] ) .
This is a quantity that increases monotonically as λ decreases to 0. By construction, it is 0 at the equivalence limit λ 0 and has a maximal value of K λ 0 ( 0 ) . The observed evidence for equivalence can be reported as T ± 1 , indicating that evidence T for equivalence has a standard normal error.
One can derive an approximation for K λ 0 ( λ ) = E λ [ T ( S ) ] for the evidence in S χ ν 2 ( λ ) for testing λ λ 0 against λ < λ 0 as follows. The variance Var λ [ S ] = 2 ν + 4 λ is not constant (not stable) and in order to stabilize it asymptotically one can use the standard delta-method [9] (p. 242). Using the fact that E λ [ S ] = ν + λ ν , one has Var λ [ S ] = g ( E λ [ S ] ) , where g ( s ) = 4 s - 2 ν is defined for s ν . The transformation h ( s ) that removes the dependence on λ is equal to an antiderivative of - 1 / g ( s ) , which leads to h ( s ) = - s - ν / 2 , where the negative sign was chosen in order to obtain a decreasing function in s. This standard procedure fails to define a VST for all s 0 , because strictly speaking our function h ( s ) should only be applied in the range s ν , and even if we tried to extend it towards s = 0 , its value is undefined for s < ν / 2
This problem is created by the zero crossing of g ( s ) at s = ν / 2 and various ideas for extending h ( s ) to the entire positive real line could be tried; see (13) below. For this or any other such choice agreeing with h ( s ) for s > ν , the transformed statistic h * ( S ) has approximate expected value E λ [ h * ( S ) ] h * ( E λ [ S ] ) = h * ( ν + λ ) h ( ν + λ ) = - λ + ν / 2 . After centering h * ( S ) at the limit λ 0 , one obtains the observed evidence T = h * ( S ) + λ 0 + ν / 2 . The expected evidence for equivalence, to first order, is thus
K λ 0 ( λ ) = E λ [ T ] = λ 0 + ν / 2 - λ + ν / 2 .
The expected evidence (12) in the experiment has a maximum at λ = 0 , namely K λ 0 ( 0 ) = λ 0 + ν / 2 - ν / 2 . The dotted line in Figure 5 shows what the equivalence values λ 0 must be as a function of ν, namely λ 0 = K λ 0 2 ( 0 ) + 2 ν K λ 0 ( 0 ) , where the maximal expected evidence is of varying strengths. For example, when ν = 15 , for moderate maximal expected evidence, λ 0 must be at least 29. More frequently, the equivalence bound λ 0 is determined by context, and the degrees of freedom ν determined by ν ( λ 0 - K λ 0 2 ( 0 ) ) 2 / { 2 K λ 0 2 ( 0 ) } .
The evidence for λ < λ 0 in S χ ν 2 ( λ ) can be defined for certain M > K λ 0 ( 0 ) :
T = T ν , λ 0 ( S ) = M - S / 2 ν for S < ν ; M - S - ν / 2 for S ν .
This T has a negative continuous derivative, and for the choice M = λ 0 has T N ( 0 , 1 ) at the null λ = λ 0 . Further, at perfect equivalence λ = 0 it has E [ T ] K λ 0 ( 0 ) for K λ 0 given by (12). These claims are made based on simulation studies, using R scripts in Appendix B.

3.2. Evidence for Equal Means

Given independent X k N ( μ k , 1 ) , k = 1 , , K , we want to find evidence for equivalence in the sense that all means equal μ = ( k μ k ) / K simultaneously. A test statistic is S = k ( X k - X ¯ ) 2 , which has distribution S χ K - 1 2 ( λ ) , where λ = k ( μ k - μ ) 2 . In practice, the μ k are considered “equal” if max k { | μ k - μ | } ϵ for a given ϵ > 0 . How can this last notion of equivalence be translated into a value for the limit λ 0 ? To solve this problem, we note that { ( μ 1 - μ ) , , ( μ K - μ ) } defines a K - 1 dimensional hyperplane of R K , which when shifted to the origin is a K - 1 dimensional subspace of R K , and distances between points in the hyperplane are preserved under translation. Thus it suffices to solve the following problem for arbitrary K 2 : given μ 1 , , μ K with μ = ( k μ k ) / K = 0 and max k { | μ k | } ϵ , find an appropriate choice of λ 0 = k μ k 2 = r 2 , which is the square of the Euclidean distance r of the point ( μ 1 , , μ K ) from the origin in R K . After a solution for the equivalence boundary λ 0 = λ 0 ( ϵ , K ) is found, it can be implemented in the case of unknown μ by replacing K by K - 1 .
The largest ball contained in the K-dimensional cube with edge 2 ϵ , both having the same center, has radius ϵ , whereas the smallest ball containing the same cube has radius K ϵ . The latter choice would allow for one μ k to be as large as K ϵ (if all other μ j = 0 ) and overall equivalence would be claimed even though it was violated in one case. To reduce this violation, the radius ought to be in-between the extremes. A “reasonable” compromise requires the volume of the L 2 -ball to equal ( 2 ϵ ) K , the volume of the cube with side 2 ϵ . The volume of a ball in R K with radius r is given by:
V K ( r ) = π m r 2 m Γ ( m + 1 ) , K = 2 m ; π m ( 2 r ) 2 m + 1 Γ ( m + 1 ) Γ ( 2 m + 2 ) , K = 2 m + 1 .
To ensure equal volumes for the hypersphere and the hypercube for all K, one needs the radius of the ball to be r 0 = K b K ϵ , where
b K = 4 { Γ ( 1 + K / 2 ) } 2 / K π K , K = 2 m ; 1 π 1 - 1 / K K Γ ( K + 1 ) Γ ( ( K + 1 ) / 2 ) 2 / K , K = 2 m + 1 .
A good approximation to b K is given by c K = 2 / ( π e ) + ( 1.7 · K ) - 5 / 6 , see Table 2 and Figure 6. This makes it easier to choose the desired equivalence limit λ 0 = r 0 2 = b K K ϵ 2 2 K ϵ 2 / ( π e ) for moderate and large K.
Example 4. 
To illustrate the use of this table, suppose K = 4 , μ is unknown, and we want each of μ k to satisfy | μ k - μ | ϵ = 1 / 2 , say, in order to claim “equivalence of all means”. As discussed above the corresponding problem with μ known to equal 0 in K - 1 = 3 dimensions is to utilize λ 0 = λ 0 ( ϵ , K - 1 ) = λ 0 ( 1 / 2 , 3 ) = r 0 2 = 3 b 3 ϵ 2 = 3 × 0.513 × 0.25 = 0.3375 . From Equation (12) it then follows that for K = 4 , ν = 3 and this λ 0 the maximum expected evidence possible for equivalence of all 4 means is about 0.13, which is almost negligible.
If we had begun this section with each X k replaced by X ¯ k N ( μ k , 1 / n ) , for some sample size n 2 , then the test statistic would be S n = k ( X ¯ k - X ¯ ¯ ) 2 , which has distribution S n χ ν 2 ( λ ) , where ν = K - 1 and λ = n k ( μ k - μ ) 2 . By imposing the same condition max k { | μ k - μ | } ϵ , where μ is unknown, the “equal-volume” solution k ( μ k - μ ) 2 = ν b ν ϵ 2 , so the appropriate equivalence hypothesis is λ λ 0 = n ν b ν ϵ 2 . The maximum expected evidence in T n = T ( S n ) is attained when all means are equal, and this maximum is λ 0 + ν / 2 - ν / 2 , or n ν b ν ϵ 2 + ν / 2 - ν / 2 which is growing at rate n . Continuing Example 4, n = 20 will yield weak maximum expected evidence 1.65 for equivalence.

3.3. Application to between Group Sum of Squares

Given independent observations from K groups with different means X k i N ( μ k , 1 ) , i = 1 , , n k , k = 1 , , K , we denote the total sample size by N = k n k , and the group sample proportions by q k = n k / N . Let the kth sample mean be X ¯ k and the weighted group sample mean by X ¯ = k q k X ¯ k ; it is an unbiased estimator of the weighted population mean μ = k q k μ k . Then the between group sum of squares S S B between = N k q k ( X ¯ k - X ¯ ) 2 χ ν 2 ( λ ) , where ν = K - 1 and λ = N k q k ( μ k - μ ) 2 ; see, for example, [9] (p. 184). The evidence in S χ ν 2 ( λ ) for equivalence λ < λ 0 was derived in Section 3.1.The practical problem is to choose λ 0 . Once λ 0 is chosen one can compute the maximum evidence in the experiment; it is by (12) equal to λ 0 + ( K - 1 ) / 2 - ( K - 1 ) / 2 . As explained in Section 3.1, this determines the maximum power for equivalence at any given level.
First we consider the argument of [17] (p. 164) for choosing λ 0 . He introduces the parameter ψ 2 = k n k n ¯ ( μ k - μ ) 2 , where n ¯ = N / K , which he calls a generalized squared Euclidean distance between ( μ 1 , , μ K ) and ( μ , , μ ) . He proposes to define homogeneity in terms of ψ 2 and the equivalence hypothesis by ψ 2 ϵ 2 , where ϵ is to be chosen. Note that our ncp λ = n ¯ ψ 2 , the same as his. For K = 2 condition ψ 2 ϵ 2 reduces to | μ 2 - μ 1 | 2 ϵ . This leads [17] (p. 164) using conditions for comparing 2 normal populations, to suggest that, in general, one take λ 0 , W e l l e k = n ¯ ϵ 2 , with ϵ ranging from 1 / 4 to 1 / 2 where 1 / 4 yields a “strict” equivalence limit n ¯ / 16 while ϵ = 1 / 2 leads to what he calls a “liberal” limit of n ¯ / 4 . However, this approach assumes that the requirement max k { | μ k - μ | } ϵ for a pre-specified ϵ holds for all K. From our point of view the choice of λ 0 should grow with K and ϵ should be determined by context.
In the balanced case n k n , we can make direct comparisons between Wellek’s criterion and ours, derived near the end of Section 3.2, which yielded λ 0 , M S = n ν b ν ϵ 2 . Assuming the same choice of ϵ, the ratio of λ 0 , M S / λ 0 , W e l l e k = ν b ν . Using Table 2, this ratio varies with ν = 1 , 2 , 3 , considerably and equals, respectively, 0.64 , 1.03 , 1.35 , 1.65 , 1.93 , ; further it grows with ν as 2 ν / ( π e ) . Only for ν = 2 are the two criteria the same.

4. Testing for Equivalence of K Groups

Given independent observations from K groups with common unknown variance σ 2 > 0 and different means X k i N ( μ k , σ 2 ) , i = 1 , , n k , k = 1 , , K . Let N = k n k , q k = n k / N for all k and the overall mean μ = k q k μ k . We define equivalence of means for a given ϵ > 0 if none of the standardized μ k differs more than ϵ from zero:
max k | μ k - μ | σ ϵ .
Just as in Section 3.3 where σ was assumed known (and without loss of generaliity set equal to 1), we can define λ = N k q k { ( μ k - μ ) / σ } 2 . Given λ 0 we want evidence for the hypothesis of equivalence λ < λ 0 . The arguments for choosing λ 0 = λ 0 , M S = n ν 1 b ν 1 ϵ 2 where ν 1 = K - 1 carry over from Section 3.2 and Section 3.3, provided sampling is nearly balanced so that all n k n ¯ = n .
The within group sum of squares is defined by
S S within = k - 1 K i = 1 n i ( X k i - X ¯ k ) 2 .
Standard theory [9] (p. 196) shows S = ( S between / ν 1 ) / { S S within / ν 2 } has a non-central F distribution with df ν 1 = K - 1 , ν 2 = N - K and ncp λ. A VST for the statistic S has been derived by [21] and also [9] (p. 197). It assumes ν 2 > 4 . Let a 2 = ( ν 2 - 4 ) / 2 and c 2 = ν 2 2 ( ν 1 + ν 2 - 2 ) / { ν 1 2 ( ν 2 - 2 ) } . The VST is h = h ( S ) , defined by h ( s ) = - a cosh - 1 ( s + ν 2 / ν 1 ) / c . where the inverse hyperbolic cosine function is defined by cosh - 1 ( x ) = ln ( x + x 2 - 1 ) . Now cosh - 1 ( x ) is only defined for | x | 1 so the VST h ( s ) is only defined for s > c - ν 2 / μ 1 .
The evidence for λ < λ 0 is defined by T = h ( S ) - h ( E λ 0 [ S ] ) . The expected value of the non-central F-statistic S is E λ [ S ] = ( ν 1 + λ ) ν 2 / { ν 1 ( ν 2 - 2 ) } , so the expected evidence, to first order, is given by:
K λ 0 ( λ ) = E λ [ h ( S ) ] - h ( E λ 0 [ S ] ) .
Unfortunately the VST itself and hence the evidence T = h ( S ) - h ( E λ 0 [ S ] ) for equivalence is undefined for informative small values of S. To make it useful, we extend the evidence function to small values via monotone linearization in (19). Strictly speaking, the F-statistic VST was only derived for s > b = ν 2 / ( ν 2 - 2 ) , as mentioned in [9] (p. 197), for the same reason the chi-squared VST derivation had a limited domain, see Section 3.1. And, for the same reasons given there, we can extend it to 0 s b = ν 2 / ( ν 2 - 2 ) without changing much the expectation (18). Evidence for equivalence λ < λ 0 based on S F ν 1 , ν 2 , λ is defined for certain M > K λ 0 ( 0 ) by
T = T ν 1 , ν 2 , λ 0 ( S ) = M - a S cosh - 1 ( b + ν 2 / ν 1 ) / c / b for S < b ; M - a cosh - 1 ( S + ν 2 / ν 1 ) / c for S b .
This T is continuous and increasing as S moves to 0. For the choice M = λ 0 - ν 1 / ( ν 1 + ν 2 ) it has T N ( 0 , 1 ) at the null λ = λ 0 ; and, at perfect equivalence λ = 0 it has E [ T ] K λ 0 ( 0 ) for the K λ 0 of (18).
Example 5. 
Example 7.1 of [17] (p. 165) considers four treatments for hypertension with measurements taken on diastolic blood pressure averaged over an interval. To test for equivalence of the treatments, the following summary data (sample size, mean, standard deviation) were recorded: n 1 = 10 , x ¯ 1 = 99.8120 , s 1 = 7.56391 ; n 2 = 12 , x ¯ 2 = 99.2903 , s 2 = 5.9968 ; n 3 = 13 , x ¯ 3 = 100.0024 , s 3 = 10.4809 ; and n 4 = 15 , x ¯ 4 = 98.6407 , s 4 = 4.5309 . This yields N = 50 , n ¯ = 12.5 and x ¯ ¯ = 99.3849 . Continuing, S S b e t w e e n = 15.196 and S S w i t h i n = 2516.135 so the F-test statistic S = ( S S b e t w e e n / ( K - 1 ) ) / ( S S w i t h i n / ( N - K ) = 0.0926 . For ϵ = 0.5 and λ 0 , M S = n ¯ ν 1 b ν 1 ϵ 2 = 4.219 we have by (19) the evidence for equivalence T = 1.934 , which is slightly more than weak.
By way of comparison, [17] uses traditional hypothesis testing with the smaller equivalence boundary λ 0 , W e l l e k = n ¯ ϵ 2 = 3.125 and finds the non-central F-test significant at level 0.05 with an estimated power 0.18 for detecting perfect equivalence.

5. Summary and Discussion

For the test statistic S N ( μ , 1 ) of a null hypothesis μ μ 0 and alternative μ > μ 0 Neyman–Pearson methods help one make a decision; while a p-value can provide a measure of surprise regarding the boundary hypothesis μ = μ 0 . But what one often wants from S is a measure of evidence for the alternative μ > μ 0 . In this simplest of statistical tests, T = S - μ 0 is such a measure of evidence for the alternative, because T estimates the unknown expected evidence E [ T ] = μ - μ 0 which is linearly increasing with μ and comes with an easily understood standard normal error. Values of T near 1.645, 3.3 and 5 are interpreted as weak, moderate and strong evidence for the alternative μ > μ 0 . In addition the expected evidence can be written as the sum of the probits of level and power.
The vast majority of routine statistical tests can be transformed into the above setting through variance stabilization. And the mean of a variance stabilized test statistic T = VST ( S ) , after centering at μ 0 , is very close to the signed square root of the Kullback–Leibler symmetrized divergence between the null and alternative distributions. This result gives more theoretical support for calling T the evidence for the alternative; references are listed in Section 1.
What we have done here is to extend the above ideas to the more complicated hypotheses of the form | μ | μ 0 versus the equivalence alternative | μ | < μ 0 , with applications for TOST based on one- and two-sample binomial experiments and two one-sided t-tests. Then, in the multivariate setting, we have found modifications of the classical VSTs for non-central chi-squared and non-central F-tests which make it practicable to find evidence for the equivalence alternative λ < λ 0 to the null hypothesis λ λ 0 of non-equivalence.
The practical choice of equivalence limit λ 0 is also an important ingredient, and we have provided a new approach to assist in its choice. In particular, when testing for equivalence of means in K arms of a study, we found the value of the radius required so that the K-ball has the same volume as the K-cube of edge 2 ϵ . This leads to a proposal for converting the condition max k | μ k - μ | ϵ into an approximate equivalence condition λ < λ 0 .
The new extensions of classical VSTs for non-central chi-squared and F statistics require more work in the choice of M to center them properly at the null λ 0 . Simulation studies show that adjusting M so that the mean of the VST is 0 when λ = λ 0 automatically ensures that the mean of the VST statistic when λ = 0 is near its expected maximum value. The choice M λ 0 works adequately, but simple formulae for M that depend on the degree(s) of freedom as well as λ 0 would be useful.
Finally, we note that finding simple expressions for the expected evidence in a VST greatly assists one in finding minimal sample sizes when planning an experiment for determining equivalence; by knowing the maximum expected evidence for equivalence, one also learns of the power to detect perfect equivalence at any given level.
Another application is in goodness-of-fit tests, where instead of “backing into” a model by not rejecting it a liberal level such as 0.1, one could find the evidence for the model. Chapter 8 of [17] is a good starting point for solving this problem. Further research will also take into account recent results on bio-equivalence defined in terms of the Kullback–Leibler divergences, see [4,7].

Acknowledgments

The authors thank the Editors and referees for their many helpful comments and suggestions which have improved the breadth and clarity of the manuscript.

Author Contributions

The two authors collaborated on both the research for, and writing of this manuscript. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Quality of KLD Approximations

Let S have density f λ belonging to a family of densities indexed by a real parameter λ. Our purpose in this section is to compare the expected evidence K λ 0 ( λ ) of the VST for testing λ λ 0 against the equivalence alternative λ < λ 0 with the signed square root of the [22] symmetrized divergence (KLD) between the null f λ 0 and alternative f λ distributions. That is, to examine the approximation
K λ 0 ( λ ) sgn ( λ 0 - λ ) J ( λ 0 , λ ) ,
where the KLD is defined by
J ( λ 0 , λ ) = E λ 0 [ log { f λ 0 ( S ) / f λ ( S ) } ] + E λ [ log { f λ ( S ) / f λ 0 ( S ) } ] .
References [10,12,13] discuss the theory behind such approximations, but here we consider numerical examples relevant to multivariate equivalence testing.

Appendix A.1. Non-Central Chi-Squared Distribution

Let f λ be the noncentral χ 2 density with ν degrees of freedom, so that by (12), K λ 0 ( λ ) = E λ [ T ] = λ 0 + ν / 2 - λ + ν / 2 . There is no simple analytic expression for J χ 2 ( λ 0 ; λ ) , so we use numerical approximation to compute it and its signed square root. The graph of the latter (for ν = 1 , λ 0 = 6 ) is shown in the top left plot of Figure A1 as a dashed line, and is to be compared with the graph of the K 6 ( λ ) defined by (12) as a solid line. Note that they are very close over the range of λ of interest although both pass through 0 at λ = λ 0 . Our main point is to emphasize the quality of the approximation (A1), which is clear from the plot to its right which shows the absolute relative error is less than 1 in 16 over this range of λ.
Figure A2 and others not shown demonstrate that quite generally the absolute relative error in the approximation (A1) is less than 1 in 20 over a wide range of equivalence experiments with non-central chi-squared distributed outcomes.
Figure A1. Non-central Chi-squared with parameters ν , λ : The upper left plot shows the graph of the expected evidence K 6 ( λ ) when ν = 1 as a solid line, to be compared with the signed square root of the KLD, shown as a dashed line. The absolute relative error in this approximation is shown on its right. The lower two plots are for λ 0 = 12 and ν = 1 .
Figure A1. Non-central Chi-squared with parameters ν , λ : The upper left plot shows the graph of the expected evidence K 6 ( λ ) when ν = 1 as a solid line, to be compared with the signed square root of the KLD, shown as a dashed line. The absolute relative error in this approximation is shown on its right. The lower two plots are for λ 0 = 12 and ν = 1 .
Entropy 18 00291 g007
Figure A2. The same notation as in Figure A1, but for different parameters.
Figure A2. The same notation as in Figure A1, but for different parameters.
Entropy 18 00291 g008

Appendix A.2. Non-Central F Distribution

If f λ denotes the noncentral F density with ν 1 , ν 2 degrees of freedom and ncpλ, then by (18), and using E λ [ S ] = ( ν 1 + λ ) ν 2 / { ν 1 ( ν 2 - 2 ) } ,
K λ 0 ( λ ) = E λ [ T ] = a cosh - 1 ( E λ 0 [ S ] + ν 2 / ν 1 ) / c - a cosh - 1 ( E λ [ S ] + ν 2 / ν 1 ) / c .
As for the chi-squared case, there is no simple analytic expression for J F ( λ 0 ; λ ) , so we use numerical approximation to compute it and its signed square root. Figure A3 and Figure A4 show typical comparative results between the above expected evidence and the signed square root of the KLD between the null f λ 0 and alternative f λ distributions. These and similar plots provide more numerical support for the approximation (A1).
Figure A3. Non-central F with parameters ν 1 , ν 2 and λ: The upper left plot shows the graph of the expected evidence K 10 ( λ ) when ν 1 = 1 , ν 2 = 20 as a solid line, to be compared with the signed square root of the KLD, shown as a dashed line. The absolute relative error in this approximation is shown on its right. The lower two plots are also for λ 0 = 10 but now ν 1 = 4 and ν 2 = 20 .
Figure A3. Non-central F with parameters ν 1 , ν 2 and λ: The upper left plot shows the graph of the expected evidence K 10 ( λ ) when ν 1 = 1 , ν 2 = 20 as a solid line, to be compared with the signed square root of the KLD, shown as a dashed line. The absolute relative error in this approximation is shown on its right. The lower two plots are also for λ 0 = 10 but now ν 1 = 4 and ν 2 = 20 .
Entropy 18 00291 g009
Figure A4. The same notation as in Figure A3, but for different parameters.
Figure A4. The same notation as in Figure A3, but for different parameters.
Entropy 18 00291 g010

Appendix B. R Scripts for Computing VSTs

#############  Evidence for equivalence
#############  using TOST, and one-sample binomial data
vstbinom <_ function(n,p)
{h <_ 2*sqrt(n)*asin(sqrt(p))
return(h)}
 
############# Usually p1=p0-Delta0,p2=p0+Delta0
bioevid <_ function(n,p1,p2,phat)
{Tminus <_  vstbinom(n,phat)-vstbinom(n,p1)
Tplus  <_  vstbinom(n,p2)-vstbinom(n,phat)
evidforequiv <_ min(Tminus,Tplus)
out <_ c(Tminus,Tplus,evidforequiv)
outrd <_ round(out,digits=3)
return(outrd)}
 
#############  Example 1 of the text. (data from Example 4.2 of Weller, page 59.)
n <_ 273
phat <_ 199/273
p1 <_ 0.65
p2 <_ 0.75
bioevid(n,p1,p2,phat)
 
############################################################################
############# Evidence for equivalence for risk difference
 
vstRD <_ function(n1,p1hat,n2,p2hat,Delta0)
{Deltahat <_ p1hat-p2hat
pbarhat <_ (p1hat+p2hat)/2
N <_ n1+n2
vhat <_ (1-2*pbarhat)*(1/2-n2/N)
what <_ sqrt(pbarhat*(1-pbarhat)+vhat^2)
vst <_ sqrt(4*n1*n2/N)*(asin((Deltahat/2 +vhat)/what)-asin((Delta0/2 +vhat)/what))
return(vst)}
 
############## Usually Delta1= -Delta0 and Delta2= +Delta0
bioevidRD <_ function(x1,n1,x2,n2, Delta1,Delta2)
{p1hat <_ (x1+0.5)/(n1+1)
p2hat <_ (x2+0.5)/(n2+1)
Deltahat <_ p1hat-p2hat
Tminus <_  vstRD(n1,p1hat,n2,p2hat,Delta1)
Tplus  <_  -vstRD(n1,p1hat,n2,p2hat,Delta2)
T <_ min(Tminus,Tplus)
out <_ c(Deltahat,Tminus,Tplus,T)
outrd <_ round(out,digits=3)
return(outrd)}
 
####### Example 2 of the text. (data from Dunnett and Gent(1977) Biometrics)
x1 <_ 148
n1 <_ 225
x2 <_ 115
n2 <_ 167
Delta1 <_ -0.1
Delta2 <_ 0.1
bioevidRD(x1,n1,x2,n2, Delta1,Delta2)
 
###########################################################################
## Linear extension of vst for chisq(nu,lambda) (Equation (13) of the text.)
 
evidchisqlin <_ function(s,nu,lambda0,M)
{smalls <_ s[s <= nu]
Tsmall <_ M-sqrt(s[s>nu]-nu/2)  ## usual vst
grad <_ -sqrt(nu/2)/nu
Tbig <_ M +grad*s[s<=nu]
T <_  c(Tbig,Tsmall)
return(T)}
 
Mfun <_ function(nu,lambda0)    ## requires lambda0 > nu/2
{return(sqrt(lambda0))}
 
# Mfun2 <_ function(nu,lambda0) ##  The user may want to supply their own M.
# {return(sqrt(lambda0+nu)-sqrt(nu/10))}
 
############### Plot evidence function (illustrative example)
nu = 3
lambda0 = 6
maxexev <_ sqrt(lambda0+nu/2)-sqrt(nu/2)
maxexev
M <_ Mfun(nu,lambda0)
M
s <_ c(seq(0,lambda0+nu,.01))
T <_  evidchisqlin(s,nu,lambda0,M)
plot(s,T,type="l",lwd=2,main="Evidence for equivalence")
abline(h=maxexev,lty=3)
abline(h=0)
 
##################### To examine properties of evidence function.
lambda = lambda0   ## Try this and other lambda, where 0 <= lambda < lambda0.
s <_ rchisq(10000,nu,lambda)
T <_ evidchisqlin(s,nu,lambda0,M)
mean(T)
sd(T)
hist(T)
 
###########################################################################
## Linear extension of VST for F(nu1,nu2,lambda) (Equation (19) of the text.)
 
evidFlin <_ function(s,nu1,nu2,lambda0,M)   ### nu2>4
{c <_ (nu2/nu1)*sqrt((nu1+nu2-2)/(nu2-2))
b <_ nu2/(nu2-2)
a <_ sqrt((nu2-4)/2)
Tsmall <_  M-a*acosh((s[s>b]+nu2/nu1)/c)
grad <_ -a*acosh((b+nu2/nu1)/c)/b
Tbig <_ M+grad*s[s<=b]
T <_  c(Tbig,Tsmall)
return(T)}
 
###################    Example 5 of the text. (data from Wellek, Example 7.2, p.165.)
 
n1 <_ 10
x1bar <_ 99.8120
s1 <_ 7.5639
n2 <_ 12
x2bar <_ 99.2903
s2 <_ 5.9968
n3 <_ 13
x3bar <_ 100.0024
s3 <_ 10.4809
n4 <_ 15
x4bar <_ 98.6407
s4 <_ 4.5309
N <_ n1+n2+n3+n4  ##
xbarbar <_ (n1*x1bar+n2*x2bar+n3*x3bar+n4*x4bar)/N  ## 99.3849
K <_ 4
nbar <_ N/K
ssqW <_ (n1-1)*s1^2+(n2-1)*s2^2+(n3-1)*s3^2+(n4-1)*s4^2
ssqB <_ n1*(x1bar-xbarbar)^2+n2*(x2bar-xbarbar)^2+n3*(x3bar-xbarbar)^2
ssqB <_ ssqB+n4*(x4bar-xbarbar)^2
 
epsilon <_ 1/2                 ## Wellek’s choice
nu1 <_ K-1
nu2 <_ N-K
lambda0 <_ 12.5*0.25*1.35      ## nbar*eps^2*vu*bnu = 4.21875
 
nu1 <_ K-1
nu2 <_ N-K
 
epsilon <_ 1/2                 ## this is arbitrary; want max_k |mu_k-mu|/sigma < epsilon
lambda0 <_ 12.5*0.25*1.35      ## nbar*eps^2*vu1*bnu1 = 4.21875
 
MFfun <_ function(nu1,nu2,lambda0)
{return(sqrt(lambda0+nu1/(nu1+nu2)))}
 
M <_ MFfun(nu1,nu2,lambda0)
S <_   (ssqB/nu1)/(ssqW/nu2)     ## 0.0926 (Value of F-statoistic.)
 
evidFlin(S,nu1,nu2,lambda0,M)    ## T = 1.934
 
################  To compute maximum expected evidence, require:
 
exS <_ function(lambda,nu1,nu2)   ##### this computes expected value of S
{return(nu2*(n1+lambda)/(nu1*(nu2-2)))}
 
c <_ (nu2/nu1)*sqrt((nu1+nu2-2)/(nu2-2))
b <_ nu2/(nu2-2)
a <_ sqrt((nu2-4)/2)
M <_ sqrt(lambda0-nu1/(nu1+nu2))
M
maxexev <_ -a*acosh((exS(0,nu1,nu2)+nu2/nu1)/c)+
a*acosh((exS(lambda0,nu1,nu2)+nu2/nu1)/c)
 
maxexev
 
############### Plot evidence function (illustrative example)
 
s <_ c(seq(0,exS(lambda0,nu1,nu2)+1,.01))
T <_  evidFlin(s,nu1,nu2,lambda0,M)
plot(s,T,type="l",lwd=2,ylim=c(-1,M),main="Evidence for equivalence")
abline(h=maxexev,lty=3)
abline(h=0)
abline(v=lambda0,lty=3)
 
################################# To examine properties of evidence function
lambda=0 ## Try this and other lambda, where 0 <= lambda < lambda0.lambda = lambda0
 
s <_ rf(10000,nu1,nu2,lambda)
T <_  evidFlin(s,nu1,nu2,lambda0,M)
mean(T)
sd(T)
hist(T)
		

References

  1. Schuirmann, D.J. A comparison of two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharm. Biopharm. 1987, 15, 657–680. [Google Scholar] [CrossRef]
  2. Berger, R.; Hsu, J.C. Bioequivalence trials, intersection-union tests and equivalence confidence sets with discussion. Stat. Sci. 1996, 11, 283–302. [Google Scholar]
  3. Senn, S. Statistical issues in equivalence testing. Stat. Med. 2001, 20, 2785–2799. [Google Scholar] [CrossRef] [PubMed]
  4. Dragalin, V.; Fedorov, V.; Patterson, S.; Jones, B. Kullback–leibler divergence for evaluating bioequivalence. Stat. Med. 2003, 22, 913–930. [Google Scholar] [CrossRef] [PubMed]
  5. Lauretto, M.; Pereira, C.A.B.; Stern, J.M.; Zacks, S. Full Bayesian Signicance Test Applied to Multivariate Normal Structure Models. Braz. J. Probab. Stat. 2003, 17, 147–168. [Google Scholar]
  6. Chervoneva, I.; Hyslop, T.; Hauck, W.W. A multivariate test for population bioequivalence. Stat. Med. 2007, 26, 1208–1223. [Google Scholar] [CrossRef] [PubMed]
  7. Ocaña, J.; Sanchez, M.P.; Sanchez, A.; Carrasco, J.L. On equivalence and bioequivalence testing. Sort 2008, 32, 151–158. [Google Scholar]
  8. Tsai, C.A.; Huang, C.Y.; Liu, J.P. An approximate approach to sample size determination in bioequivalence testing with multiple pharmacokinetic responses. Stat. Med. 2014, 33, 3300–3317. [Google Scholar] [CrossRef] [PubMed]
  9. Kulinskaya, E.; Morgenthaler, S.; Staudte, R.G. Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  10. Morgenthaler, S.; Staudte, R.G. Advantages of variance stabilization. Scand. J. Stat. 2012, 39, 714–728. [Google Scholar] [CrossRef]
  11. Kulinskaya, E.; Morgenthaler, S.; Staudte, R.G. Variance stabilizing the difference of two binomial proportions. Am. Stat. 2010, 64, 350–356. [Google Scholar] [CrossRef]
  12. Morgenthaler, S.; Staudte, R.G. Evidence for alternative hypotheses. In Robustness and Complex Data Structures; Becker, C., Fried, R., Kuhnt, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 315–329. [Google Scholar]
  13. Prendergast, L.A.; Staudte, R.G. Better than you think: Interval estimators of the difference of binomial proportions. J. Stat. Plan. Inference 2014, 148, 38–48. [Google Scholar] [CrossRef]
  14. Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed.; Wiley: New York, NY, USA, 1994; Volume 1. [Google Scholar]
  15. Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed.; Wiley: New York, NY, USA, 1995; Volume 2. [Google Scholar]
  16. Leone, F.C.; Nelson, L.S.; Nottingham, R.B. The folded normal distribution. Technometrics 1961, 3, 543–550. [Google Scholar] [CrossRef]
  17. Wellek, S. Testing Statistical Hypotheses of Equivalence; CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar]
  18. Dunnett, C.W.; Gent, M. Significance testing to establish equivalence between treatments, with special reference to data in the form of 2 × 2 tables. Biometrics 1977, 33, 593–602. [Google Scholar] [CrossRef] [PubMed]
  19. Johnson, R.; Dunnett, C.W.; Gent, M. p-values in 2 × 2 tables. Biometrics 1988, 44, 907–910. [Google Scholar]
  20. Wellek, S. Testing Statistical Hypotheses of Equivalence and Noninferiority, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
  21. Laubscher, N.F. Normalizing the noncentral t and f distributions. Ann. Math. Stat. 1960, 31, 1105–1112. [Google Scholar] [CrossRef]
  22. Kullback, S. Information Theory and Statistics; Dover: Mineola, NY, USA, 1968. [Google Scholar]
Figure 1. The black lines show the densities f T ( t ) of the evidence for equivalence (3), for μ 0 = 4 and various μ. The red lines show the densities of evidence for equivalence (6) when n = 4 .
Figure 1. The black lines show the densities f T ( t ) of the evidence for equivalence (3), for μ 0 = 4 and various μ. The red lines show the densities of evidence for equivalence (6) when n = 4 .
Entropy 18 00291 g001
Figure 2. In the top row μ 0 = 3 ; in the bottom row μ 0 = 4 . Exact values of the mean (left plot) and standard deviation (right plot) of TOST evidence T for equivalence are shown in black lines, while the red solid lines show the graphs based on n = 4 observations. The horizontal black dotted line is at height μ 0 - 2 / π , while the red dotted line is 2 μ 0 - 2 / π . The horizontal dotted lines in the right hand plot are at 1 - 2 / π and 1.
Figure 2. In the top row μ 0 = 3 ; in the bottom row μ 0 = 4 . Exact values of the mean (left plot) and standard deviation (right plot) of TOST evidence T for equivalence are shown in black lines, while the red solid lines show the graphs based on n = 4 observations. The horizontal black dotted line is at height μ 0 - 2 / π , while the red dotted line is 2 μ 0 - 2 / π . The horizontal dotted lines in the right hand plot are at 1 - 2 / π and 1.
Entropy 18 00291 g002
Figure 3. The top plots are for μ 0 = 2 ; the bottom for μ 0 = 4 . On the left is a comparison of T μ 0 given by (2) as a function of data X = x plotted as a solid line, to be compared with the one-sided evidence in the corresponding chi-squared test (7) as a dashed line. On the right are comparisons of two approximations for the expected TOST evidence for equivalence, a solid line depicting the exact value (4), and that given by the first order approximation λ 0 + 1 / 2 - μ 2 + 1 / 2 for the equivalent chi-squared test, shown as a dashed line.
Figure 3. The top plots are for μ 0 = 2 ; the bottom for μ 0 = 4 . On the left is a comparison of T μ 0 given by (2) as a function of data X = x plotted as a solid line, to be compared with the one-sided evidence in the corresponding chi-squared test (7) as a dashed line. On the right are comparisons of two approximations for the expected TOST evidence for equivalence, a solid line depicting the exact value (4), and that given by the first order approximation λ 0 + 1 / 2 - μ 2 + 1 / 2 for the equivalent chi-squared test, shown as a dashed line.
Entropy 18 00291 g003
Figure 4. The solid curves show the evidence in favor of equivalence as a function of the average y ¯ . It is assumed that the equivalence limits are μ 0 = ± 0.12 and the empirical standard deviation is s = 0.1 . As the sample size grows from n = 20 to n = 40 and n = 100 , the evidence grows. The dotted lines are for n = 40 and show the decreasing evidence if s = 0.2 and s = 0.3 . The horizontal grey lines are at 0, 2 and 4.
Figure 4. The solid curves show the evidence in favor of equivalence as a function of the average y ¯ . It is assumed that the equivalence limits are μ 0 = ± 0.12 and the empirical standard deviation is s = 0.1 . As the sample size grows from n = 20 to n = 40 and n = 100 , the evidence grows. The dotted lines are for n = 40 and show the decreasing evidence if s = 0.2 and s = 0.3 . The horizontal grey lines are at 0, 2 and 4.
Entropy 18 00291 g004
Figure 5. Plot of λ 0 against degrees of freedom ν based on (12) with λ = 0 for weak maximum expected evidence (dotted line), moderate maximum expected evidence (dashed line) and strong maximum expected evidence (solid line). The vertical line marks the df ν = 15 .
Figure 5. Plot of λ 0 against degrees of freedom ν based on (12) with λ = 0 for weak maximum expected evidence (dotted line), moderate maximum expected evidence (dashed line) and strong maximum expected evidence (solid line). The vertical line marks the df ν = 15 .
Entropy 18 00291 g005
Figure 6. Plot of b K (points) and approximation c K = 2 / ( π e ) + ( 1.7 K ) - 5 / 6 (continuous line) against K ranging from 0 to 21. The dotted horizontal line gives the asymptotic limit 2 / ( π e ) .
Figure 6. Plot of b K (points) and approximation c K = 2 / ( π e ) + ( 1.7 K ) - 5 / 6 (continuous line) against K ranging from 0 to 21. The dotted horizontal line gives the asymptotic limit 2 / ( π e ) .
Entropy 18 00291 g006
Table 1. p-values for testing μ = 0 against μ > 0 and evidence estimates for alternatives based on one observation T = t , where T N ( μ , 1 ) . Keep in mind that the standard error of the observed value of t is equal to 1.
Table 1. p-values for testing μ = 0 against μ > 0 and evidence estimates for alternatives based on one observation T = t , where T N ( μ , 1 ) . Keep in mind that the standard error of the observed value of t is equal to 1.
t01.2811.6452.3263.0903.33.7195
p-value0.50.100.050.010.0010.00050.00010.0000003
Table 2. For selected values of K are shown the exact b K coefficients (15) to 3 decimal places so that the K-dimensional ball of radius K b K ϵ , has the same volume as the K-dimensional cube of side 2 ϵ . The approximate value is c K = 2 / ( π e ) + ( 1.7 · K ) - 5 / 6 .
Table 2. For selected values of K are shown the exact b K coefficients (15) to 3 decimal places so that the K-dimensional ball of radius K b K ϵ , has the same volume as the K-dimensional cube of side 2 ϵ . The approximate value is c K = 2 / ( π e ) + ( 1.7 · K ) - 5 / 6 .
K23456781050100
b K 2 / π 0.5130.4500.4120.3860.3670.3520.3320.2590.2480.23420...
c K 0.5950.4920.4370.4020.3790.3610.3480.3290.2590.248 2 / ( π e )

Share and Cite

MDPI and ACS Style

Morgenthaler, S.; Staudte, R. Indicators of Evidence for Bioequivalence. Entropy 2016, 18, 291. https://doi.org/10.3390/e18080291

AMA Style

Morgenthaler S, Staudte R. Indicators of Evidence for Bioequivalence. Entropy. 2016; 18(8):291. https://doi.org/10.3390/e18080291

Chicago/Turabian Style

Morgenthaler, Stephan, and Robert Staudte. 2016. "Indicators of Evidence for Bioequivalence" Entropy 18, no. 8: 291. https://doi.org/10.3390/e18080291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop