Combined Permutation Tests for Pairwise Comparison of Scale Parameters Using Deviances

: Nonparametric combinations of permutation tests for pairwise comparison of scale parameters, based on deviances, are examined. Permutation tests for comparing two or more groups based on the ratio of deviances have been investigated, and a procedure based on Higgins’ RMD statistic was found to perform well, but two other tests were sometimes more powerful. Thus, combinations of these tests are investigated. A simulation study shows a combined test can be more powerful than any single test.


Introduction
Tests for homogeneity of scale are of interest in many areas of application, including industrial quality assurance, agricultural production and education [1].Parametric tests for comparing scale (e.g., [2][3][4]) are generally not robust to nonnormality (see [5]).Consequently, more robust alternatives are of interest.
An approximate test using the ANOVA F-test on the absolute deviations from the mean was proposed [6].Using absolute deviations from the median (referred to as deviances in the remainder of this paper), referred to as the W50 test, was later suggested [7].However, no uniformly best test for scale has been demonstrated in the literature.In fact, without more stringent distributional assumptions, the minimal sufficient statistic would generally be the n-dimensional vector of order statistics.Thus, no single statistic exists that summarizes the information contained in the data, and a uniformly best test statistic does not generally exist.In spite of this, the W50 test has been recommended as a computationally simple test showing good overall performance with respect to power and robustness to nonnormality in several comparative studies ( [8][9][10]).More recently, a study [5] compared 25 omnibus tests for homogeneity of variance and recommended the W50 test as "superior".A modification of Levene's test [6] (referred to as OB) was proposed [11] which has been recommended over the W50 test for light-tailed distributions [12].The W50 and OB tests, as well as permutation versions of these tests, were evaluated [5] and it was found that the permutation versions tended to be more robust and have higher power.The W50 test was recommended as a computationally simple robust test, as was the permutation version of the OB test for symmetric and lighter-tailed skewed distributions.Another test for scale utilizing deviances, based on the ratio of the mean deviances, was also proposed [13].This test will be referred to as the RMD test.The RMD test was found [14] to be generally superior to W50 and OB, although there were still cases where each of W50 and OB had higher power.Since no test has been found to be uniformly superior, it is of interest to develop a test that combines these three tests.A combined test of scale parameters based on the IQR was studied [15] and the combined test was found to be more powerful than its constituent tests in some scenarios.Similarly, we will investigate nonparametric combinations of the RMD, W50 and OB tests to determine if combining the tests can provide increased power compared to individual tests.

Methods for Comparing Scale Parameters
Consider a one-way layout with t treatments and n i observations per treatment.We assume a location-scale model, y ij = µ i + σ i ε ij , i = 1, . . ., t, j = 1, . . ., n i , where µ i and σ i are the location and scale parameters, respectively, of treatment i, and ε ij are independent and identically distributed with median 0. It is desired to test H 0 : for some i and j.

Brown-Forsythe (W50) Test
First, compute the deviances, , where ∼ y is the sample median.The ANOVA F test is performed on these scores, and the p-value is based on the F distribution with t − 1 and n − t degrees of freedom [7].

Higgins' (RMD) Test
The statistic is defined as , where ∼ z i is the mean of the deviances, ∼ z ij , for treatment i.The deviances ∼ z ij = y ij − ∼ y i are the same as those used by the W50 test.The permutation distribution of the RMD statistic was used to calculate a p-value [13].
At one extreme, when w = 0, the statistic reduces to r ij (0) = n i (yij−y i ) , which is a slight modification of Levene's test, which uses z 2 ij = y ij − y i 2 .At the other extreme, when , which was referred to as a "jackknife pseudovalue of s 2 i [11]".The ANOVA F test is performed on these scores and, the p-value is based on the F distribution with t − 1 and n − t degrees of freedom.Tests based on z 2 ij have been shown to have inflated Type I error rates, while those based on q ij tend to have low power.Since r(w) is a weighted average of the two tests, it provides a way to balance the drawbacks of the two tests.A "utility" value of w = 0.5 was suggested for most situations [11], and this is the value employed in this study.

Permutation Tests
While the permutation test using the RMD statistic was suggested [13], the W50 and OB tests described previously were proposed as approximate tests based on the F distribution.However, p-values for W50 and OB can also be calculated using permutation distributions.A simulation study [1] found for the two-treatment case that the permutation versions tended to be more robust and have greater power than the approximate tests.Thus, we will consider only the permutation versions of these combined tests.Test statistics will be computed for a large number of random reassignments of observations to treatments, and the p-value will be calculated as the proportion of values of the permutation distribution that is at least as extreme as the observed test statistic value.

Combined Tests
A two-step approach to create a nonparametric combination of dependent tests was proposed [16] and described as follows: Step 1. Analyze the data using the tests of interest, referred to as partial tests; Step 2. Combine the partial tests to assess the global hypothesis.
Several different combining functions have been developed that satisfy the properties required for a suitable combining function [16].Since the relative power of different combining functions can vary across conditions, we consider combined tests using three of the best-known combining functions: the Fisher, Liptak and Tippett combining functions [15].
Let λ i be the p-value associated with the ith test to be combined.Then, the test statistics for the Fisher, Liptak and Tippett functions are 1.
The Fisher combining function is The Liptak combining function is The Tippett combining function is The Tippett function tends to have the highest power when one or a few, but not all, of the constituent tests reject the null hypothesis; the Liptak function tends to have the highest power when all tests reject the null hypothesis; the power of the Fisher function will tend to lie between the other two, making it the more general option and thus probably the most popular [16].The combined tests are carried out as follows [16].

1.
Compute the observed test statistic value (T F , T L , T T ) according to the above definitions, using the permutation p-values of RMD, W50 and OB.

2.
To compute the permutation test p-value associated with each combined statistic: i For the ith statistic in the permutation distributions constructed for RMD, W50 and OB, compute the ith partial p-value as the proportion of test statistic values at least as large as the ith statistic value.ii Using the partial p-values for RMD, W50 and OB, use the respective combining function to compute a test statistic value (T F , T L , T T ) for each permutation.This results in a permutation distribution for each of the combined statistics.iii For each combined test, the permutation p-value is then the proportion of values in the permutation distribution at least as large as the observed statistic value.
Note that all tests are based on the same set of randomly generated permutations.Since the RMD, W50 and OB tests were each most powerful for at least some scenarios in past simulations (e.g., [5]), combinations of these three tests will be examined.In addition, since RMD and W50 were usually more powerful than OB, a combination of only RMD and W50 will also be considered.The p-values for each of the constituent tests in each combination will be estimated using the permutation distribution of the statistic.The powers Type I error rates of the Fisher, Liptak and Tippett combining functions will be estimated and compared, and these will also be compared to those of the individual tests.

Strong Familywise Error Rate Control for Pairwise Comparisons
The familywise error rate (FWER) will be controlled using the technique of Richter and McCann [17].Richter and McCann [17] proposed a restricted permutation method to provide strong control of the familywise error rate (FWER) for pairwise comparison of location parameters.This method will be extended to the present case of comparing scale parameters as follows.First, the two-sample test statistic for a given method will be calculated for each of the possible t(t − 1)/2 pairs of treatments.Then, the maximum value of the test statistic across all pairs will be calculated.Next, observations will be reassigned at random to treatments within each pair of treatments, a test statistic calculated for each pair of treatments, and the maximum value determined.This will be repeated many times to build the permutation distribution, and the p-value for comparing each pair of treatments will be calculated as the proportion of values in the permutation distribution that is at least as extreme as the observed value.

Procedures Studied
A simulation study estimated and compared the familywise Type I error rate and "any-pair" power (probability of detecting at least one true difference) of the methods described in Section 2: 1.

Distributions
Several different g and h distributions [18] were used to simulate data from distributions with different characteristics.g and h distributions are monotonic functions of normal distributions and allow investigation of nonnormal distributions with specific characteristics.The g-and-h random variable is defined as where Z ∼ N(0, 1).When g = h = 0, Y g,h (Z) ∼ N(0, 1).Nonzero values of g increase the skewness and positive values of h increase the elongation (tail heaviness) of the distribution.
Type I error rate and power were estimated based on 1000 randomly selected data sets from each distribution, for each setting of sample sizes and scale parameter patterns.It has been suggested [19] that only 253 random permutations are necessary with 1000 random data sets if the goal of the simulation is to estimate the power of a test and only a "rough" estimate of the permutation p-value is required, while a random sample of at least 1600 permutations was recommended [20] to estimate the exact p-value for a permutation test.Since precise estimation of the permutation test p-values was considered important, a conservative sample of 1999 random permutations was utilized, and thus the permutation distribution for each test was based on 2000 values: the observed test statistic value plus 1999 values based on random permutations of the observed data.Type I error rate and power were estimated based on 1000 randomly selected data sets from each distribution, for each setting of sample sizes and scale parameter patterns.It has been suggested [19] that only 253 random permutations are necessary with 1000 random data sets if the goal of the simulation is to estimate the power of a test and only a "rough" estimate of the permutation p-value is required, while a random sample of at least 1600 permutations was recommended [20] to estimate the exact p-value for a permutation test.Since precise estimation of the permutation test p-values was considered important, a conservative sample of 1999 random permutations was utilized, and thus the permutation distribution for each test was based on 2000 values: the observed test statistic value plus 1999 values based on random permutations of the observed data.

Familywise Type I Error
All tests were robust in the sense that estimated rates of Type I error were close to the nominal level of 0.05 (See Tables 1-6) with only one exceeding 0.075 (0.084 for RMD in the equal sample  = 30 case, g = 0.8, h = 0.4).Note that in the tables, the first row of each distribution represents the equal scale case, and thus the value given is the estimated Type I error rate.

Any-Pair Power
When sample sizes were equal (Tables 1 and 2), RMD tended to have the highest power, although in some cases the Fisher or Liptak combined test was most powerful.
When sample sizes were small and unequal and the larger scales were associated with the smaller samples(Table 3), the F2 and L2 combined tests were most powerful for all scale configurations, with L2 usually having the higher power.The lone exception was when the distribution was symmetric with very heavy tails (g = 0, h = 0.8) where the RMD had similar power to F2 and L2.When the sample sizes increased to  = 10, 10, 20, 30, 30 , however, the power advantage of the combined tests over RMD tended to diminish, except for the skewed, light-tailed distributions, where the combined tests were still more powerful (See Table 4).
Neither of the Tippett combined tests was as powerful as the Liptak and Fisher versions.

Familywise Type I Error
All tests were robust in the sense that estimated rates of Type I error were close to the nominal level of 0.05 (See Tables 1-6) with only one exceeding 0.075 (0.084 for RMD in the equal sample n i = 30 case, g = 0.8, h = 0.4).Note that in the tables, the first row of each distribution represents the equal scale case, and thus the value given is the estimated Type I error rate.

Any-Pair Power
When sample sizes were equal (Tables 1 and 2), RMD tended to have the highest power, although in some cases the Fisher or Liptak combined test was most powerful.
When sample sizes were small and unequal and the larger scales were associated with the smaller samples (Table 3), the F 2 and L 2 combined tests were most powerful for all scale configurations, with L 2 usually having the higher power.The lone exception was when the distribution was symmetric with very heavy tails (g = 0, h = 0.8) where the RMD had similar power to F 2 and L 2 .When the sample sizes increased to n i = 10, 10, 20, 30, 30, however, the power advantage of the combined tests over RMD tended to diminish, except for the skewed, light-tailed distributions, where the combined tests were still more powerful (See Table 4).
Neither of the Tippett combined tests was as powerful as the Liptak and Fisher versions.
When the sample sizes were small and unequal but the larger scales were associated with the larger samples (Tables 5 and 6), L 2 and F 2 had the highest power for normal and moderately skewed-only distributions.Meanwhile, RMD had the highest power for all distributions with heavy tails ( h = 0.4, 0.8).As before, as sample sizes increased, the power advantages of the combined tests diminished while RMD maintained power advantages for heavier-tailed distributions.

Figure 1 .
Figure 1.Example boxplots of the simulated distributions.Note that the "Value" axis has been truncated to omit extreme values from distributions 3 and 7.

Figure 1 .
Figure 1.Example boxplots of the simulated distributions.Note that the "Value" axis has been truncated to omit extreme values from distributions 3 and 7.

Table 1 .
Proportion of at least one rejection at α = 0.05, five treatments, equal samples of size n i = 10.

Table 2 .
Proportion of at least one rejection at α = 0.05, five treatments, equal samples of size n i = 30.Cases that were uninformative for comparing methods were omitted.

Table 3 .
Proportion of at least one rejection at α = 0.05, five treatments, unequal samples of size n i = 5, 5, 10, 15, 15, larger scale associated with smaller sample size.Cases that were uninformative for comparing methods were omitted.