1. Introduction
The concept of
p-value is generally credited to Pearson [
1], although it was implicitly used much earlier by Arbuthnot [
2] in 1710. Defined as the probability of obtaining, under a null hypothesis, a result that is as extreme or more extreme than the one observed, it was considered to be an informal index to assess the discrepancy between the data and the hypothesis under investigation. The use of
p-values gained popularity with Sir Ronald Fisher [
3,
4], and about their use, Fisher [
5] states that “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this [P = 0.05] level of significance”. Therefore, the question of reproducibility of results was naturally raised (cf. Greenwald et al. [
6], or Colquhoun [
7]), which in turn demanded the
p-values collected from replicated experiments to be summarized into a combined
p-value. In 1931, Tippett [
8], a co-worker of Fisher, performed the first meta-analysis of
p-values, and in 1932, Fisher himself [
9] suggested a method for combining
p-values.
The classical combined test procedures assume that the observed p-values, , are, under null hypotheses , , of no difference or no effect, observations from independent random variables , which is an immediate consequence of the probability integral transform theorem. It is then said that a from is a genuine (or a true) p-value.
Section 2 describes some classical methods for combining
p-values, using either their values directly, for example through order statistics or Pythagorean means, or using basic transformations of standard uniform random variables, such as
and
, where
is the inverse of the standard Gaussian cumulative distribution function, or the logit function
. For additional
p-values combinations, see Brilhante et al. [
10].
Although today there is an intense debate on whether significance testing, and therefore the use of
p-values, is an acceptable scientific research tool; see, for instance, the editorials in
The American Statistician vol. 70 (Wasserstein and Lazar [
11]) and vol. 73 (Wasserstein et al. [
12]) on the topic, traditionally low
p-values were a valid passport for being published. This has created a so-called file drawer problem due to publication bias. As with other techniques used in meta-analysis, publication bias can easily lead to false conclusions. In fact, the set of available
p-values comes mainly from studies considered worthy of publication because the observed
p-values were small, presumably indicating significant results. Thus, the assumption that the
’s are observations from independent
random variables is quite questionable since generally they are a set of low order statistics, given that
p-values greater than 0.05 have less chances of being published.
One way of assessing publication bias is by computing the number of non-significant
p-values that would be needed to reverse the decision to reject the overall hypothesis based on a set of available
p-values. For example, Jin et al. [
13] and Lin and Chu [
14] give interesting overviews of how to deal with publication bias. Givens et al. [
15] also provide a deep insight into publication bias in meta-analysis, namely using data-augmentation techniques.
Publication bias is also the cause of poor scientific practices, in some cases even fraud, especially when the replication of experiments is carried out with the intent of, hopefully, obtaining more favorable p-values to increase the chances of publishing. While replicating experiments is legitimate and recommended to establish consistent results, replicating with the purpose of reporting the smallest of the observed p-values is an unacceptable scientific practice. If this is indeed the case, the reported “fake” p-value, being the minimum of independent standard uniform random variables, is -distributed. However, replicating experiments has a cost, either monetary or timewise, and if in the replication of an experiment only once, both p-values obtained are greater than 0.05, then what appears to be the wisest decision is not to continue replicating the experiment; otherwise, the smallest of the two p-values is reported, or none at all. In fact, what seems realistic to consider is either , and therefore a nuisance “fake two p-value” is reported, or , i.e., a “genuine”, not the minimum of “two p-value”, is disclosed.
In Fisher’s [
16] comments about Mendel’s work, he conjectured that “the data of most, if not all, of the experiments have been falsified to agree closely with Mendel’s expectations”. Fisher made it quite clear that he suspected that Mendel’s “too good to be true” results were carefully chosen to support the hereditary theory that Mendel wanted to prove. Due to this historical background, in
Section 3, we shall call Mendel distribution the model that is a mixture of a
(or
) distribution and a
distribution, thus representing a mixture of “fake two
p-value” and “genuine not two
p-value”. We briefly explain how an extension of Deng and George’s [
17] characterization of the standard uniform distribution using a Mendel random variable instead of a uniform random variable can be considered to test the uniformity of a set of
p-values or determine if it is contaminated with fake
p-values.
In
Section 4, an example is given to illustrate how to use the critical values from the tables in Brilhante et al.’s [
10] supplementary materials for jointly combining genuine and fake
p-values using classical combining methods. The example shows that a thorough comparison should always be made, since most likely there is no reliable information that rules out the existence of fake
p-values that have resulted from bad scientific practices, and therefore it is important to acknowledge their potential effects when performing a meta-analysis of
p-values.
In
Section 5, further developments for combining
p-values are reviewed, with a very brief reference to the recent research field on
e-values. Finally,
Section 6 reinforces the recommendation that when extending the usual combined tests to include genuine and fake
p-values, they should be compared with each other in terms of the conclusions drawn for an informed final decision.
2. An Overview of Classical Combined Tests for p-Values
Let us assume that the p-values are known for testing versus , , in n independent studies on some common issue, and that the objective is to decide on the overall hypothesis : all the are true versus : some of the are true. As there are many different ways in which can be false, selecting the right test is generally unfeasible. On the other hand, combining the available ’s so that a function is the observed value of a random variable with a known sampling distribution under is a simple problem, since under , is the observed value of a random sample from a distribution. In fact, several different and reasonable combined testing procedures are often used with suitable functions of the ’s. Moreover, it should be guaranteed that a combined procedure is monotone, in the sense that if one set of p-values leads to the rejection of the overall null hypothesis , then any set of component-wise smaller p-values , i.e., , , must also lead to its rejection.
Tippett [
8] used the statistic
From the fact that
, the criterion for rejecting
at a significance level
is
. Tippett’s method is a special case of Wilkinson’s method [
18], which recommends that
should be rejected when some observed order statistic
. As
, the cut-of-point
c to reject
is the solution of
where
,
, is the Beta function.
Simes [
19], on the other hand, gives an interesting development of Wilkinson’s method: Let
be the ordered
p-values for testing the overall hypothesis
, which should be rejected at a significance level
if
for any
.
Another way of constructing combined
p-values is to use functions of standard uniform random variables. Fisher [
9] suggested the use of the statistic
since
when
,
. As
, the criterion for rejecting
at a significance level
is
, with
denoting the
p-th quantile of the chi-square distribution with
m degrees of freedom.
Tippett’s method illustrates the direct use of standard uniform random variables, while Fisher’s method shows the use of transformed standard uniform random variables. Moreover, Fisher’s method is often the most efficient way of making use of all the information available, whereas Tippett’s method disregards almost all available information. Therefore, these two methods can be viewed as two extreme cases.
Combining
p-values using functions of their sums or products, namely their arithmetic mean or their geometric mean, is also feasible but less appealing than Fisher’s chi-square transformation method. Edgington [
20] suggested the use of the arithmetic mean as a test statistic, i.e.,
but it has a very cumbersome probability density function, defined as
with
being the largest integer not greater than
x and
,
, Euler’s Gamma function. However, if
n is large, an approximation based on the central limit theorem can be used to perform an overall test on
versus
, but it is not consistent, in the sense that it can fail to reject the overall test’s null hypothesis, even though the results of some of the individual tests are extremely significant.
Pearson’s [
21] proposal for combining
p-values is based on their product, i.e., on the statistic
which under
has a probability density function
In other words,
(see Brilhante et al. [
22] for more details on BetaBoop random variables). Consequently, the geometric mean
has a cumulative distribution function
where
,
, is the upper incomplete Gamma function. The critical quantiles
of
can easily be computed from the critical quantiles
of
, where
, since
.
Note, however, that using products of standard uniform random variables or adding their exponential logarithms provides essentially the same information, as recognized by Pearson [
21] in his final remark, and hence, it is more convenient to use Fisher’s statistic.
In 1934, Pearson [
23] considered that in a bilateral framework, it would be more appropriate to use the statistic
Owen [
24] suggested a simple modified version of
, namely the statistic
for which he recommends a Bonferroni correction to establish lower and upper bounds for the computation of probabilities. Another alternative to
is Pearson’s [
23] minimum of geometric means statistic,
Also, concerning the use of transformed
p-values, Stouffer et al. [
25] used as a test statistic
Since
, the criterion for rejecting
at a significance level
is
, with
denoting the
p-th quantile of the standard Gaussian distribution.
A further simple transformation based on the standard uniform random variables
and
is the logit transformation
, which was used by Mudholkar and George [
26] to construct the combined test statistic
Using the approximation
should be rejected at a significance level
if
with
denoting the
p-th quantile of Student’s
t distribution with
m degrees of freedom.
On the other hand, Birnbaum [
27] has shown that every monotone combined test procedure is admissible, i.e., provides a most powerful test against some alternative hypothesis for combining a collection of tests, and therefore is optimal for a combined testing situation whose goal is to harmonize possibly conflicting evidence or to pool inconclusive evidence. In the context of social sciences, Mosteller and Bush [
28] recommend Stouffer’s method, but Littell and Folks [
29,
30] have shown that under mild conditions, Fisher’s method is optimal for combining independent tests.
The thorough comparison performed by Loughin [
31] shows that the normal combining function performs quite well in problems where the evidence against the combined null hypothesis is spread among more than a small fraction of the individual tests. However, when the total evidence is weak, Fisher’s method is the best choice, especially when the evidence is at least moderately strong, and it is concentrated in a relatively small fraction of the individual tests. Mudholkar and George’s [
26] logistic combined test manages to provide a compromise between the two previous cases. Additionally, when the total evidence against the combined null hypothesis is concentrated on one or on a few tests to be combined, Tippett’s combining function is useful.
3. Fake p-Values and Mendel Random Variables
An important issue that should be addressed before combining
p-values is whether they are genuine or not. The overall alternative hypothesis
states that some of the individual
are true, and so a meta-decision on
implicitly assumes that some of the
’s may not have a uniform distribution, cf. Hartung et al. [
32] (pp. 81–84), and Kulinskaya et al. [
33] (pp. 117–119). In fact, the uniformity of the
’s is solely the consequence of assuming that the null hypothesis is true, but this questionable assumption led Tsui and Weerahandi [
34] to introduce the concept of generalized
p-values. See Weerahandi [
35], Hung et al. [
36] and Brilhante [
37], and references therein, on the concepts of generalized and random
p-values.
Moreover, the assumption
,
, can be unrealistic. As a matter of fact, when an observed
p-value is not highly significant or significant, there is a possibility that the experiment will be repeated in the hope of obtaining a “better”
p-value to increase the likelihood of the research being published. However, the scientific malpractice of trying to obtain better
p-values to comply with research teams’ expectations, which in some cases can be labeled as a fraudulent practice, can lead to disclosing results that are “too good to be true”, as Fisher [
16] observed in his appraisal of Mendel’s work. Consult Pires and Branco [
38] and Franklin [
39] for more information on the famous Mendel-Fisher controversy.
If a reported is the “best” of observed p-values of independent replications of an experiment, i.e., is the minimum of independent Uniform(0, 1) random variables, then , which has a probability density function . Therefore, . This also holds true for the case , i.e., for genuine p-values, since when . So, the changes needed in Fisher’s statistic are , which under is also -distributed. However, the main problem here is that there is no information on whether some of the p-values are “fake ones”, and if they do exist, which ones are and what are the corresponding values of .
Please note that what makes the most sense is to consider either or because it would be a complete waste of time and resources to continue replicating an experiment if non-significant p-values keep showing up, especially if there is the (wrong) belief that a p-value is only “a good one” if it is significant. It is, therefore, assumed that when a genuine p-value is reported, regardless of whether it is significant or not. However, when some researchers are dissatisfied with obtaining non-significant p-values for their (first) results, they may decide not to report them and abandon their research, or repeat the experiment once (). In the latter case, one of the following scenarios takes place:
- (a)
the second p-value is significant, and hence it is the one reported (fake p-value);
- (b)
the second p-value is also not significant and consequently, either the smallest of the two observed p-values is reported (fake p-value), or none is reported and the research stops.
From the above, if
, then clearly the right model for
is a mixture of the minimum of two independent
random variables (or a
random variable) and a
random variable, i.e., with probability density function
where
, and which can be reparameterized as
with
,
. Therefore, in Equation (
1),
is the probability of a
p-value being a fake
p-value.
What is interesting to notice is that if the probability density function of the standard uniform distribution is tilted using the point
as a pole, then for
, the right-hand side of Equation (
1) is still a probability density function, more specifically, the probability density function of a Mendel random variable
.
From Equation (
1), it is straightforward to see that
,
, and
, i.e., the maximum of two independent standard uniform random variables. Moreover, if
, then the Mendel distribution is a mixture of standard uniform distribution, with weight
, and a
distribution, while if
, it is a mixture of standard uniform distribution, with weight
, and a
distribution. So, the probability density function of
,
, can be expressed in the form
with
if
, or
if
, and where
and
denote, respectively, the minimum and maximum of two independent standard uniform random variables, and
.
An interesting fact related to the Mendel distribution is that if
X and
Y are independent random variables, both with support
, and with
, then
which generalizes Deng and George’s [
17] characterization of the standard uniform distribution when
(see Theorem 1 in Brilhante et al. [
10]). Furthermore, if
, then
V and
Y are independent random variables.
In particular, if
X and
Y are independent such that
and
, then
On the other hand, if
and
are independent, then
while if
and
are independent, then
Please note that Equations (
2) or (
3) can be used to test whether a sample of
p-values
are observations from a
, a
, or a
distribution, being very useful to increase the test’s power when the sample size is small (see Gomes et al. [
40] and Brilhante et al. [
41] for more details). For this purpose, setting
and generating
, then
is obtained, and therefore to test, for instance, the uniformity of the sample (
, one tests the uniformity of the pseudo-random sample
.
4. Combining Genuine and Fake p-Values
It is generally impossible to know whether there are or not fake p-values among the set of p-values to be combined. Therefore, a realistic approach is to examine possible scenarios and assess how the probable existence of fake p-values in a sample can affect the decision on the overall hypothesis . For this purpose, tables with critical quantiles for p-values’ combination methods that take into account the existence of fake p-values in a sample, most likely in a very small number, can be useful to give an overall picture.
Such tables are given in Brilhante et al.’s [
10] supplementary materials for the most commonly used combined test statistics, where it is assumed that among the
n (
)
p-values to be combined there are at most
fake ones. The usefulness of the tables is illustrated with Example 1.
Example 1. For the set of n = 13 p-values obtained in studies on depressive effects of a weekly 1mg dose of semaglutidethe observed values for the combined test statistics are: The quantiles for n = 13 are extracted from the tables in [10] (without the standard errors) for the following methods: Fisher (Table 1), Stouffer (Table 2), Mudholkar and George (Table 3), Pearson’s geometric mean (Table 4), Pearson’s minimum of geometric means (Table 5), Edgington’s arithmetic mean (Table 6) and Tippett (Table 7). The quantiles that lead to the rejection of are highlighted for each method, thus showing for which significance level this happens.
Table 1.
Estimated quantiles of with fake p-values.
Table 1.
Estimated quantiles of with fake p-values.
n | | 0.900 | 0.950 | 0.975 | 0.990 | 0.995 |
---|
13 | 0 | 35.5632 | 38.8851 | 41.9232 | 45.6417 | 48.2899 |
13 | 1 | 36.6548 | 40.0294 | 43.0852 | 46.7821 | 49.4752 |
13 | 2 | 37.7241 | 41.1053 | 44.1240 | 47.9461 | 50.6022 |
13 | 3 | 38.8119 | 42.1576 | 45.2735 | 49.0533 | 51.7268 |
13 | 4 | 39.9069 | 43.2759 | 46.2994 | 50.1729 | 52.9071 |
Table 2.
Estimated quantiles of with fake p-values.
Table 2.
Estimated quantiles of with fake p-values.
n | | 0.900 | 0.950 | 0.975 | 0.990 | 0.995 |
---|
13 | 0 | 1.2815 | 1.6448 | 1.9600 | 2.3264 | 2.5758 |
13 | 1 | 1.1087 | 1.4720 | 1.7844 | 2.1391 | 2.3767 |
13 | 2 | 0.9350 | 1.2924 | 1.6009 | 1.9524 | 2.1995 |
13 | 3 | 0.7620 | 1.1079 | 1.4117 | 1.7670 | 2.0049 |
13 | 4 | 0.5908 | 0.9312 | 1.2345 | 1.5756 | 1.8255 |
Table 3.
Estimated quantiles of with fake p-values.
Table 3.
Estimated quantiles of with fake p-values.
n | | 0.900 | 0.950 | 0.975 | 0.990 | 0.995 |
---|
13 | 0 | 8.3859 | 10.7850 | 12.8627 | 15.3892 | 17.1365 |
13 | 1 | 9.2840 | 11.6589 | 13.7337 | 16.2667 | 17.9027 |
13 | 2 | 10.1682 | 12.5187 | 14.5952 | 17.1134 | 18.8075 |
13 | 3 | 11.0523 | 13.3512 | 15.4344 | 17.9587 | 19.5983 |
13 | 4 | 11.9532 | 14.2848 | 16.2954 | 18.7587 | 20.4252 |
Table 4.
Estimated quantiles of with fake p-values.
Table 4.
Estimated quantiles of with fake p-values.
n | | 0.005 | 0.010 | 0.025 | 0.050 | 0.100 |
---|
13 | 0 | 0.15609 | 0.17283 | 0.19940 | 0.22412 | 0.25466 |
13 | 1 | 0.14919 | 0.16544 | 0.19070 | 0.21448 | 0.24420 |
13 | 2 | 0.14287 | 0.15820 | 0.18323 | 0.20578 | 0.23436 |
13 | 3 | 0.13684 | 0.15162 | 0.17531 | 0.19762 | 0.22476 |
13 | 4 | 0.13075 | 0.14522 | 0.16853 | 0.18930 | 0.21549 |
Table 5.
Estimated quantiles of with fake p-values.
Table 5.
Estimated quantiles of with fake p-values.
n | | 0.005 | 0.010 | 0.025 | 0.050 | 0.100 |
---|
13 | 0 | 0.14144 | 0.15578 | 0.17882 | 0.19940 | 0.22388 |
13 | 1 | 0.14177 | 0.15608 | 0.17876 | 0.19939 | 0.22400 |
13 | 2 | 0.13916 | 0.15370 | 0.17667 | 0.19710 | 0.22212 |
13 | 3 | 0.13536 | 0.14960 | 0.17235 | 0.19326 | 0.21799 |
13 | 4 | 0.13019 | 0.14438 | 0.16717 | 0.18746 | 0.21216 |
Table 6.
Estimated quantiles of with fake p-values.
Table 6.
Estimated quantiles of with fake p-values.
n | | 0.005 | 0.010 | 0.025 | 0.050 | 0.100 |
---|
13 | 0 | | | 0.34333 | 0.36774 | 0.39629 |
13 | 1 | | | 0.33258 | 0.35682 | 0.38486 |
13 | 2 | | | | 0.34616 | 0.37356 |
13 | 3 | | | | 0.33557 | 0.36253 |
13 | 4 | | | | | 0.35112 |
Table 7.
Quantiles of with fake p-values.
Table 7.
Quantiles of with fake p-values.
n | | 0.005 | 0.010 | 0.025 | 0.050 | 0.100 |
---|
13 | 0 | | | | | |
13 | 1 | | | | | |
13 | 2 | | | | | |
13 | 3 | | | | | |
13 | 4 | | | | | |
For this example, Fisher’s method shows some stability when it comes to deciding on , even when a small number of fake p-values can exist in the sample, and thus it seems robust to a prior choice of a significance level (usually 0.05). The same can be said of Pearson’s geometric mean method, which is, in fact, equivalent to Fisher’s method. The runner-up is Mudholkar and George’s method, which in traditional contexts has shown to be a compromise between Fisher’s and Stouffer’s methods. Please note that Stouffer’s method, recommended in the social sciences, looks less reliable in this case. Clearly, Tippett’s method should be avoided, despite being the simplest of them all and having a very uncomplicated sampling distribution for its statistic, even when fake p-values exist, since .
This example reinforces, to some extent, the general belief that Fisher’s combined test (or Pearson’s equivalent geometric mean test) should be used, even in a wider context of jointly combining genuine and fake p-values. However, a more in-depth study is needed to support such a conclusion, but this is beyond the scope of this review paper.
5. Further Developments in Combining p-Values
There are many other modifications and generalizations of the classical test statistics for combining genuine
P-values than those discussed in
Section 2.
Fisher’s statistic is the most widely used for combining
p-values and has therefore been the subject of several generalizations, namely weighted versions. The discussion of conceptual advantages of weighting
p-values, for instance, to improve the power of the combination method, goes as far as Good [
42]. In regard to the weighted combination of independent probabilities, see also Bhoj [
43]. As for the combination of dependent and weighted
p-values, these are intertwined topics. Aside from the references Chuang and Shih [
44], Hou [
45], Makambi [
46], and Yang [
47], cf. for instance Alves and Yu [
48].
Lancaster [
49] generalized Fisher’s method by transforming
p-values using the chi-squared distribution with
degrees of freedom,
where
is the inverse of the chi-square cumulative distribution function with
degrees of freedom, so that in an independent setup,
. Chen’s [
50] numerical comparisons indicate that Lancaster’s statistic
has a higher power than the traditional combination rules described in
Section 2. Dai et al. [
51] combined dependent
P-values using approximations to the distribution of
, obtaining higher Bahadur efficiency than with a weighted version of the
z-test.
Hou and Yang [
52] developed a weighted version of Lancaster’s statistic, namely
Regardless of whether
are independent or not,
, and by equating expectations and variances, i.e.,
and
, the parameter
c can be estimated considering that
and the parameter
f by considering
It then follows that the
-th percentile of the distribution of
can be approximated by
.
Zhang and Wu [
53] investigated a general family of Fisher’s type of statistics referred to as the GFisher, which covers many classical statistics. Systematic simulations show that new
p-value calculation methods based on moment-ratio matching and joint distribution surrogating are more accurate under the multivariate Gaussian and more robust under the generalized linear model and the multivariate
t distribution. Relevant computation has been implemented in the R package GFisher, which is available in the Comprehensive R Archive Network.
The poolr package (Cinar and Viechtbauer [
54]) provides an implementation of a variety of methods for combining
p-values, including the inverse chi-square method (Liu [
55]), a binomial test (Wilkinson [
18]) and a Bonferroni/Holm method [
56], which is an alternative to Simes’ test [
19]. Using an empirically derived null distribution based on pseudo-replicates that mimics a proper permutation test, an adjustment to account for dependence among the tests from which the
p-values have been derived is made assuming multivariate normality among the test statistics. The poolr package has been compared with several other packages that can be used to combine
p-values. Dewey’s [
57] metap v1.9 package provides an implementation of a wide variety of methods for combining independent
p-values described in Becker [
58].
Liu and Xie [
59] suggested a statistic defined as a weighted sum of the Cauchy transformation of individual
p-values, implying that the tail of the null distribution can be well approximated by a Cauchy distribution under arbitrary dependency structures. The
p-value calculation of the test is accurate and as simple as the classical
z-test or
t-test, making it well suited for analyzing massive data. On the other hand, Ham and Park [
60] showed that the Cauchy combination test provides the best combined
p-value in the sense that it had the best performance among the examined methods while controlling type I error rates.
As the independence assumption is clearly a strong limitation when it comes to combining
p-values, in 1975, Brown [
61] discussed a method for combining non-independent tests of significance. The combination of
p-values in correlated setups, for instance, in genome research requiring the analysis of Big Data, is currently a very active field of research, cf. Makambi [
46], Hou [
45], Yang [
62], and Chuang and Shih [
44]. In 2002, Kost and McDermott [
63] derived an approximation to the null distribution of Fisher’s statistic for combining
p-values when the underlying test statistics are jointly distributed as a multivariate
t with a common denominator.
As already mentioned, Fisher’s statistic is the most used for combining
p-values and generalizing it for dependence contexts has also been a constantly revisited research topic (see, for instance, Yang [
47], Dai et al. [
51] or Li et al. [
64]). Chen [
65] investigated new Gamma-based combination of
p-values, based on the test statistic
where
denotes the inverse of the Gamma cumulative distribution function with shape parameter
and scale parameter
, and showed that in many situations it provides an asymptotically Uniformly Most Powerful test.
Wilson [
66] recommends the use of the harmonic mean
p-value, i.e.,
for combining dependent
p-values, since it controls the overall type I error, i.e., the probability of falsely rejecting the overall null hypothesis
in favor of at least one alternative hypothesis
. It is a complementary method to Fisher’s method by averaging only valid
p-values when these are mutually exclusive but not necessarily independent. The sampling distribution of
is known to be in the domain of attraction of the heavy-tailed Landau skewed additive (1,1)-stable law, is robust to positive dependency between
p-values and also to the distribution of the weights
w used in its computation. Furthermore, it is insensitive to the number of tests and is mainly influenced by the smallest
p-values.
Chien [
67] compared the performances of Wilson’s [
66] harmonic mean method and of Kost and McDermott’s [
63] method to the performance of an empirical method based on the gamma distribution for combining dependent
p-values from multiple hypothesis testing, which robustly controls the type I error and keeps a good power rate.
Based on recent developments in robust risk aggregation techniques, Vovk and Wang [
68] by combining a number of
p-values without making any assumption about their dependence structure, extended those results to generalized means, and showed that
np-values can be combined by scaling up their harmonic mean by a factor of
.
E-values, defined as expectations, in contrast to
p-values, defined as probabilities, are nonnegative random variables whose expected values under the null hypothesis are bounded by 1 (Shafer et al. [
69]), as in Bayes factors and likelihood ratios in the case of a simple null hypothesis (Grünwald et al. [
70]; Shafer et al. [
69]). The combination of
e-values via
e-merging functions is a more recent and active field of research (cf. Grünwald et al. [
70], Shafer [
71], Vovk et al. [
72,
73], and Vuursteen et al. [
74]). For instance, the product of independent
e-values is clearly an
e-value. However, so far, little is known about the power of these combination procedures, although this is now the main focus of research in this field.
6. Conclusions
The meta-analysis of p-values poses some challenges, especially in today’s world in which academic and scientific achievements are largely measured (and funded) by the number of papers published, thus putting much pressure on researchers. For this reason, possibly some—but almost certainly a very few—of the ’s, , to be used in a statistic are fake p-values (minimum of Two P), when in an honest world, they should all be genuine p-values (not Two P). Therefore, it is a good idea to perform a comparison between the conclusions drawn from different combined tests, assuming that among the observed ’s there are fake p-values, to ensure a more informed decision on the overall hypothesis.
The tables with quantiles of the most used methods for combining
p-values that take into consideration the existence of a small number of fake
p-values in a sample, obtained by the authors and provided in Brilhante et al. [
10], can be a useful tool to assess the reliability of the conclusions drawn from meta-analyses of
p-values in the event of their unknown presence.