Asymptotic Properties of MSE Estimate for the False Discovery Rate Controlling Procedures in Multiple Hypothesis Testing

: Problems with analyzing and processing high-dimensional random vectors arise in a wide variety of areas. Important practical tasks are economical representation, searching for signiﬁcant features, and removal of insigniﬁcant (noise) features. These tasks are fundamentally important for a wide class of practical applications, such as genetic chain analysis, encephalography, spectrography, video and audio processing, and a number of others. Current research in this area includes a wide range of papers devoted to various ﬁltering methods based on the sparse representation of the obtained experimental data and statistical procedures for their processing. One of the most popular approaches to constructing statistical estimates of regularities in experimental data is the procedure of multiple testing of hypotheses about the signiﬁcance of observations. In this paper, we consider a procedure based on the false discovery rate (FDR) measure that controls the expected percentage of false rejections of the null hypothesis. We analyze the asymptotic properties of the mean-square error estimate for this procedure and prove the statements about the asymptotic normality of this estimate. The obtained results make it possible to construct asymptotic conﬁdence intervals for the mean-square error of the FDR method using only the observed data.


Introduction
The problems involved in testing statistical hypotheses occupy an important place in applied statistics and are used in such areas as genetics, biology, astronomy, radar, computer graphics, etc. The classical methods for solving these problems are based on a single hypothesis test. There is a sample X of size m and the null hypothesis H 0 is tested against the general alternative H 1 . The hypothesis is tested using the statistic T, a function of the sample with a known distribution under the null hypothesis (zero distribution). For a given zero distribution, the attainable p-values are calculated, and the decision to reject the null hypothesis is made on their basis. Errors arising from the application of this one-time hypothesis testing algorithm are divided into two types, and the probability of falsely rejecting the correct null hypothesis (the probability of a type I error) is bounded by a given significance level α: P(type I error) = P(T ≥ t|H 0 ) ≤ α, where t is the critical threshold value.
With this approach, we can often not only find the region for which the α-constraint on the probability of a type I error is satisfied, but also minimize the probability of a type II error, i.e., maximize the statistical power.
When considering the problem of multiple hypothesis testing, the task becomes more complicated: now we are dealing with n different null hypotheses {H 0 i , i = 1, . . . , n} and the alternatives {H 1 i , i = 1, . . . , n}. These hypotheses are tested by statistics T i with given zero distributions. Thus, for each hypothesis, the attainable p-values {p i , i = 1, . . . , n} can be calculated as well as type II error probabilities. Let us introduce the notation: M 0 is the set of indices of true null hypotheses, R is the set of indices of rejected hypotheses. Then V = |M 0 ∩ R| is the number of type I errors. The task is to minimize V by changing the parameter R.
There are many statistical procedures that offer different ways to solve the multiple hypothesis testing problem. One of the first measures proposed to generalize the type I error was the family-wise error rate (FWER) [1]. This value is defined as the probability of making at least one type I error, i.e., instead of controlling the probability of a type I error at the level α for each test, the overall FWER is controlled: FWER = P(V ≥ 1) ≤ α. However, such a strict criterion significantly increases the type II error for a large number of tested hypotheses.
In [2], an alternative measure called the false discovery rate (FDR) was proposed. This measure assumes to control the expected proportion of false rejections: This approach is widely used in situations where the number of tested hypotheses is so large that it is preferable to allow a certain number of type I errors in order to increase the statistical power.
To control FDR, the Benjamini-Hochberg [2] multiple hypothesis testing algorithm is often used, which under the condition of independency of the testing statistics allows the FDR value to be bounded by the parameter α, i.e., In this procedure, the significance levels change linearly: To apply the Benjamini-Hochberg method, a variational series is constructed from the attained p-values: All hypotheses H 0 1 , . . . , H 0 k are rejected, where k, k ∈ {1, n}, is the maximum index for which There are other measures to control the total number of type I errors. In [1], a q-value is considered that provides control of the positive false discovery rate (pFDR). Controlling the full coverage ratio (FCR) involves solving the problem of multiple hypothesis testing in terms of the confidence intervals [3]. The papers [4,5] are devoted to the harmonic mean p-value (HMP) method. However, in this paper we focus on the properties of the FDR method. It is believed that the widespread use of the FDR measure is due to the development of technologies that allow collecting and analyzing large amounts of data. Computing power makes it easy to perform hundreds or thousands of statistical tests on a given data set, and therefore the use of FWER loses its relevance.
In this paper, we study the asymptotic properties of the mean-square risk estimate for the FDR method in the problem of multiple hypothesis testing for the mathematical expectation of a Gaussian vector with independent components. The consistency of this estimate was proved in [6]. In this paper, we prove its asymptotic normality.
The paper is organized as follows. Section 2 provides some basic information about the statement of the problem and the considered vector classes. In Section 3 we define the mean-square risk of the thresholding method and describe the properties of the FDR-threshold. Section 4 considers the asymptotic properties of the mean-square risk estimate, and Section 5 contains some concluding remarks.

Preliminaries
Consider the problem of estimating the mathematical expectation of a Gaussian vector where W i are independent normally distributed random variables with zero expectation and known variance σ 2 , and µ = (µ 1 , . . . , µ n ) is an unknown vector belonging to some given set (class). The key assumption adopted in this paper is the "sparsity" of the vector µ, i.e., it is assumed that only a relatively small number of its components are significantly large. A similar problem statement arises, for example, in the analysis and processing of signals containing noise. In this case, the sparsity or "economical" representation of the signal is achieved using some special preprocessing, for example, a discrete wavelet transform of the signal vector.
In this paper, we consider the following definitions of sparsity. Let µ 0 denote the number of nonzero components of µ. Fixing η n , define the class For small values of η n , only a small number of vector components are nonzero. Another possible way to define sparsity is to limit the absolute values of µ i . To do this, consider the sorted absolute values |µ| (1) ≥ . . . ≥ |µ| (n) and for 0 < p < 2 define the class L p (η n ) = {µ ∈ R n : |µ| (k) ≤ η n n 1/p k −1/p for all k = 1, . . . , n}.
In addition, sparsity can be modeled using the p -norm In this case, the sparse class is defined as There are important relationships between these classes. As p → 0, the p -norm approaches 0 :

Mean-Square Risk and Properties of the FDR Threshold
In the considered problem, one of the widespread and well-proven methods for constructing an estimate of µ is the method of (hard) thresholding of each vector component: i.e., the vector component is zeroed if its absolute value does not exceed the critical threshold T. This procedure is equivalent to testing the hypothesis of zero mathematical expectation for each component of the vector, and when using the FDR method, the threshold value T is selected according to the following rule. The initial sample is used to construct a variational series of decreasing absolute values and |X| (k) are compared with the right tail Gaussian quantiles t k = σz(α/2 · k/n). Let k F be the largest index k for which |X| (k) ≥ t k , then the threshold T F = t k F is chosen.
In combination with hypothesis testing methods, the penalty method is also widely used, in which the target loss function is minimized with the addition of a penalty term [7][8][9]. In a particular case, this method leads to the so-called soft thresholding: the estimates of the vector components are calculated according to the ruleμ This approach is in some cases more adequate than (2), since the function ρ S in (3) is continuous in T.
The mean-square error (or risk) of the considered procedures is determined as Methods for selecting the threshold value T are usually focused on minimizing the risk (4) provided that the vector µ belongs to a given class. A "perfect" value of the threshold is Note that the expression (4) contains unknown values of µ i and it is impossible to calculate R(T) and T min in practice. Therefore, a minimax approach is used. The threshold T F is calculated based on the observed values of X i and has the property of an adaptive minimax optimality in the considered sparse classes [7]. In addition, T F has the following important property [7], which we will use later in proving the asymptotic normality of the risk estimate. Theorem 1. [7] Suppose that µ ∈ L 0 (η n ) or µ ∈ L p (η n ), 0 < p < 2, where η n ∈ [n −1 (log n) 5 , n −γ ] for L 0 (η n ) and η p n ∈ [n −1 (log n) 5 , n −γ ] for L p (η n ), 0 < γ < 1. Then there exists c > 0 such that for the FDR-threshold T F with a controlling parameter α n → 0 and large n, for L 0 (η n ) and for L p (η n ).
Thus, if α n is chosen so that α n κ n γ 2 n log n → ∞, the value of T 1 is the lower bound for the threshold T F . Note also that the so-called universal threshold T U = σ 2 log n is popular as well. This threshold is, in a certain sense, the maximum (it was shown in [10,11] that T > T U can be ignored). Based on this, we will assume everywhere that T ≤ T U .

Asymptotic Properties of the Risk Estimate
As already mentioned, since the expression (4) explicitly depends on the unknown values of µ i , it cannot be calculated in practice. However, it is possible to construct its estimate, which is calculated using only the observed data. This estimate is determined by the expression where for the soft thresholding [12]. In [6] it is proved that the estimate (7) is consistent.

Theorem 2.
[6] Let the conditions of Theorem 1 be satisfied and α n → 0 as n → ∞ so that α n κ n γ 2 n log n → ∞, then Let us prove a statement about the asymptotic normality of the estimate (7), which, in particular, allows constructing asymptotic confidence intervals for the mean-square risk (4). In the proof, we will use the same notation C for different positive constants that may depend on the parameters of the classes and methods under consideration, but do not depend on n.

Theorem 3.
Let µ ∈ L 0 (η n ), η n ∈ [n −1 (log n) 5 , n −γ ], 1/2 < γ < 1. Let T F be the FDR-threshold with a controlling parameter α n → 0 and α n κ n γ 2 n log n → ∞ as n → ∞, where κ n and γ n are defined in (5). Then Proof. Let us prove the theorem for the soft thresholding method. In the case of hard thresholding, the proof is similar. Denote and writeR Let us show thatR With soft thresholding,R(T min ) is an unbiased estimate of R(T min ), and with hard thresholding, under the conditions of the theorem the bias tends to zero when divided by √ n [12]. For the variance of the numerator [13] Moreover, since X i are independent, DX 2 i = 2σ 4 + 4σ 2 µ 2 i and the number of nonzero µ i does not exceed η n n, we obtain Finally, the Lindeberg condition is met: for any > 0 as n → ∞ where ). Indeed, due to (9) and (10) and since the summands in R(T min ) are modulo bounded by the value T 2 U + σ 2 , starting from some n all indicators in (11) vanish. Therefore, (8) holds, and to prove the theorem it remains to show that Repeating the reasoning from [14][15][16] it can be shown that T min ≥ T 1 − α n , where |α n | ≤ C log log n √ log n .
To shorten the notation without compromising the proof, we can omit α n and assume that T min ≥ T 1 . For any ε > 0 Let U(T) = S 1 (T) + S 2 (T), T ∈ [T 1 , T U ], where the sum S 2 (T) contains terms with µ i = 0, and S 1 (T) contains all other terms. By the definition of the class L 0 (η n ), the number of terms in S 1 (T) does not exceed n 1 ≈ η n n. Moreover, the absolute value of each term is bounded by T 2 U + σ 2 . For convenience, we will assume that S 1 (T) contains terms with indices from 1 to n 1 , i.e., . Given the definition of the class L 0 (η n ) and the form of T 1 , it can be shown that for the terms of S 2 the estimate |EH i (T, |EU(T)| ≤ C log n n 1−γ and for γ > 1/2 as n → ∞.
Let us now consider the sum S 2 (T). For large n, the number of terms in this sum is n − n 1 ≈ n. Repeating the above reasoning, we divide the segment [T 1 , T U ] into equal parts: T j = T 1 + jδ n 1 ∈ [T 1 , T U ], j = 1, . . . , n − 1, δ n = (T U − T 1 )/n. Then Taking into account the definition of the class L 0 (η n ) and the form of T 1 , we can bound the variance of the terms in S 2 (and hence Z 2 ): DH i (T, T min ) ≤ C log n (log n) 5 3/2 n −γ . Then, applying Bernstein's inequality [18] for D n , we obtain Next, The variance of the terms in N 2 (T j , T j + δ n ) is bounded by C log n (log n) 5 1/2 n −γ . Applying Bernstein's inequality, we obtain Hence, Thus, for an arbitrary ε > 0 P sup as n → ∞.
A similar statement is true for the class L p (η n ).
Proof. The main steps in the proof of this theorem repeat the proof of Theorem 3. We also writê The statementR is proved exactly the same as the statement (8). Let U(T) = S 1 (T) + S 2 (T), T ∈ [T 1 , T U ], where the sum S 2 (T) contains terms with |µ i | ≤ C/T 1 , and S 1 (T) contains all other terms. By the definition of the class L p (η n ), the number of terms in S 1 (T) does not exceed n 1 ≈ Cη p n n and each term is modulo bounded by T 2 U + σ 2 . Considering the form of T 1 , it can be shown that the mathematical expectations of the terms in S 2 do not exceed C(log n) 1/2 n −γ , and their variances do not exceed C log n (log n) 5 3/2 n −γ . Next, arguing as in Theorem 3, we see that for an arbitrary ε > 0 as n → ∞. Thus, since P (T F ≤ T 1 ) → 0, as n → ∞.
The above statements demonstrate that the considered method for constructing estimates in the model (1) has very similar properties to the method based on minimizing the estimate (7) in the parameter T (see [19]).

Conclusions
In this paper, we considered a method of estimating the mean of a Gaussian vector based on the procedure of multiple hypothesis testing. The estimation is based on the false discovery rate measure, which controls the expected percentage of false rejections of the null hypothesis. It is common to use the mean-square risk for evaluating the performance of this approach. Its value cannot be calculated in practice, so its estimate must be considered instead. We analyzed the asymptotic properties of this estimate and proved that it is asymptotically normal for the classes of sparse vectors. This result justifies the use of the mean-square risk estimate for practical purposes and allows constructing asymptotic confidence intervals for a theoretical mean-square risk. For more accurate analysis it is desirable to have guaranteed confidence intervals. These intervals could be constructed based on the estimates of the convergence rate in Theorems 3 and 4. Guaranteed confidence intervals would help to understand how the results of Theorems 3 and 4 affect the risk estimation for a finite sample size. We therefore leave the problem of estimating the rate of convergence and numerical simulation for future work.

Conflicts of Interest:
The authors declare no conflicts of interest.