Composite Tests under Corrupted Data

Broniatowski, Michel; Jurečková, Jana; Moses, Ashok Kumar; Miranda, Emilie

doi:10.3390/e21010063

Open AccessArticle

Composite Tests under Corrupted Data

by

Michel Broniatowski

^1,*,

Jana Jurečková

^2,3,

Ashok Kumar Moses

⁴

and

Emilie Miranda

^1,5

¹

Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, 75005 Paris, France

²

Institute of Information Theory and Automation, The Czech Academy of Sciences, 18208 Prague, Czech Republic

³

Faculty of Mathematics and Physics, Charles University, 18207 Prague, Czech Republic

⁴

Department of ECE, Indian Institute of Technology, Palakkad 560012, India

⁵

Safran Aircraft Engines, 77550 Moissy-Cramayel, France

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(1), 63; https://doi.org/10.3390/e21010063

Submission received: 11 November 2018 / Revised: 8 January 2019 / Accepted: 10 January 2019 / Published: 14 January 2019

(This article belongs to the Special Issue New Developments in Statistical Information Theory Based on Entropy and Divergence Measures)

Download

Browse Figures

Versions Notes

Abstract

:

This paper focuses on test procedures under corrupted data. We assume that the observations

Z_{i}

are mismeasured, due to the presence of measurement errors. Thus, instead of

Z_{i}

for

i = 1, \dots, n

, we observe

X_{i} = Z_{i} + \sqrt{δ} V_{i}

, with an unknown parameter

δ

and an unobservable random variable

V_{i}

. It is assumed that the random variables

Z_{i}

are i.i.d., as are the

X_{i}

and the

V_{i}

. The test procedure aims at deciding between two simple hyptheses pertaining to the density of the variable

Z_{i}

, namely

f_{0}

and

g_{0}

. In this setting, the density of the

V_{i}

is supposed to be known. The procedure which we propose aggregates likelihood ratios for a collection of values of

δ

. A new definition of least-favorable hypotheses for the aggregate family of tests is presented, and a relation with the Kullback-Leibler divergence between the sets

{(f_{δ})}_{δ}

and

{(g_{δ})}_{δ}

is presented. Finite-sample lower bounds for the power of these tests are presented, both through analytical inequalities and through simulation under the least-favorable hypotheses. Since no optimality holds for the aggregation of likelihood ratio tests, a similar procedure is proposed, replacing the individual likelihood ratio by some divergence based test statistics. It is shown and discussed that the resulting aggregated test may perform better than the aggregate likelihood ratio procedure.

Keywords:

composite hypotheses; corrupted data; least-favorable hypotheses; Neyman Pearson test; divergence based testing; Chernoff Stein lemma

1. Introduction

A situation which is commonly met in quality control is the following: Some characteristic Z of an item is supposed to be random, and a decision about its distribution has to be made based on a sample of such items, each with the same distribution

F_{0}

(with density

f_{0}

) or

G_{0}

(with density

g_{0}

). The measurement device adds a random noise

V_{δ}

to each measurement, mutually independent and independent of the item, with a common distribution function

H_{δ}

and density

h_{δ}

, where

δ

is an unknown scaling parameter. Therefore the density of the measurement

X : = Z + V_{δ}

is either

f_{δ} : = f_{0} * h_{δ}

or

g_{δ} : = g_{0} * h_{δ}

, where ∗ denotes the convolution operation. We denote

F_{δ}

(respectively

G_{δ}

) to be the distribution function with density

f_{δ}

(respectively

g_{δ}

).

The problem of interest, studied in [1], is how the measurement errors can affect the conclusion of the likelihood ratio test with statistics

L_{n} : = \frac{1}{n} \sum log \frac{g_{0}}{f_{0}} (X_{i}) .

For small

δ

, the result of [2] enables us to estimate the true log-likelihood ratio (true Kullback-Leibler divergence) even when we only dispose of locally perturbed data by additive measurement errors. The distribution function

H_{0}

of the measurement errors is considered unknown, up to zero expectation and unit variance. When we use the likelihood ratio test, while ignoring the possible measurement errors, we can incur a loss in both errors of the first and second kind. However, it is shown, in [1], that for small

δ

the original likelihood ratio test (LRT) is still the most powerful, only on a slightly changed significance level. The test problem leads to a composite of null and alternative classes

H_{0}

or

H_{1}

of distributions of random variables

Z + V_{δ}

with

V_{δ} : = \sqrt{δ} V

, where V has distribution

H_{1} .

If those families are bounded by alternating Choquet capacities of order 2, then the minimax test is based on the likelihood ratio of the pair of the least-favorable distributions of

H_{0}

and

H_{1}

, respectively (see Huber and Strassen [3]). Moreover, Eguchi and Copas [4] showed that the overall loss of power caused by a misspecified alternative equals the Kullback-Leibler divergence between the original and the corrupted alternatives. Surprisingly, the value of the overall loss is independent of the choice of null hypothesis. The arguments of [2] and of [5] enable us to approximate the loss of power locally, for a broad set of alternatives. The asymptotic behavior of the loss of power of the test based on sampled data is considered in [1], and is supplemented with numerical illustration.

Statement of the Test Problem

Our aim is to propose a class of statistics for testing the composite hypotheses

H_{0}

and

H_{1}

, extending the optimal Neyman-Pearson LRT between

f_{0}

and

g_{0}

. Unlike in [1], the scaling parameter

δ

is not supposed to be small, but merely to belong to some interval bounded away from 0.

We assume that the distribution H of the random variable (r.v.) V is known; indeed, in the tuning of the offset of a measurement device, it is customary to perform a large number of observations on the noise under a controlled environment.

Therefore, this first step produces a good basis for the modelling of the distribution of the density h. Although the distribution of V is known, under operational conditions the distribution of the noise is modified: For a given

δ

in

[δ_{min}, δ_{max}]

with

δ_{min} > 0

, denote by

V_{δ}

a r.v. whose distribution is obtained through some transformation from the distribution of V, which quantifies the level of the random noise. A classical example is when

V_{δ} = \sqrt{δ} V

, but at times we have a weaker assumption, which amounts to some decomposability property with respect to

δ

: For instance, in the Gaussian case, we assume that for all

δ, η

, there exists some r.v.

W_{δ, η}

such that

V_{δ + η} =_{d} V_{δ} + W_{δ, η}

, where

V_{δ}

and

W_{δ, η}

are independent.

The test problem can be stated as follows: A batch of i.i.d. measurements

X_{i} : = Z_{i} + V_{δ, i}

is performed, where

δ > 0

is unknown, and we consider the family of tests of

H_{0}

(

δ

)

: =

[X has density

f_{δ}

] vs.

H_{1}

(

δ

)

: =

[X has density

g_{δ}

], with

δ \in Δ = [δ_{min}, δ_{max}] .

Only the

X_{i}

are observed. A class of combined tests of

H_{0}

vs.

H_{1}

is proposed, in the spirit of [6,7,8,9].

Under every fixed n, we assume that

δ

is allowed to run over a finite set

p_{n}

of components of the vector

Δ_{n} : = [δ_{min} = δ_{0, n}, \dots, δ_{p_{n}, n} = δ_{max}]

. The present construction is essentially non-asymptotic, neither on n nor on

δ

, in contrast with [1], where

δ

was supposed to lie in a small neighborhood of 0. However, with increasing n, it would be useful to consider that the array

{(δ_{j, n})}_{j = 1}^{p_{n}}

is getting dense in

Δ = [δ_{min}, δ_{max}]

and that

lim_{n \to \infty} \frac{log p_{n}}{n} = 0 .

(1)

For the sake of notational brevity, we denote by

Δ

the above grid

Δ_{n}

, and all suprema or infima over

Δ

are supposed to be over

Δ_{n}

. For any event B and any

δ

in

Δ

,

F_{δ} (B)

(respectively

G_{δ} (B)

) designates the probability of B under the distribution

F_{δ}

(respectively

G_{δ}

). Given a sequence of levels

α_{n}

, we consider a sequence of test criteria

T_{n} : = T_{n} (X_{1}, \dots, X_{n})

of

H_{0} (δ)

, and the pertaining critical regions

T_{n} (X_{1}, \dots, X_{n}) > A_{n},

(2)

such that

F_{δ} (T_{n} (X_{1}, \dots, X_{n}) > A_{n}) \leq α_{n} \forall δ \in Δ,

leading to rejection of

H_{0}

(

δ

) for at least some

δ \in Δ .

In an asymptotic context, it is natural to assume that

α_{n}

converges to 0 as n increases, since an increase in the sample size allows for a smaller first kind risk. For example, in [8],

α_{n}

takes the form

α_{n} : = exp {- n a_{n}}

for some sequence

a_{n} \to \infty

.

In the sequel, the Kullback-Leibler discrepancy between probability measures Q and P, with respective densities p and q (with respect to the Lebesgue measure on

R

), is denoted

K (Q, P) : = \int log \frac{q (x)}{p (x)} q (x) d x

whenever defined, and takes value

+ \infty

otherwise.

The present paper handles some issues with respect to this context. In Section 2, we consider some test procedures based on the supremum of Likelihood Ratios (LR) for various values of

δ

, and define

T_{n}

. The threshold for such a test is obtained for any level

α_{n}

, and a lower bound for its power is provided. In Section 3, we develop an asymptotic approach to the Least Favorable Hypotheses (LFH) for these tests. We prove that asymptotically least-favorable hypotheses are obtained through minimization of the Kullback-Leibler divergence between the two composite classes

H 0

and

H 1

independently upon the level of the test.

We next consider, in Section 3.3, the performance of the test numerically; indeed, under the least-favorable pair of hypotheses we compare the power of the test (as obtained through simulation) with the theoretical lower bound, as obtained in Section 2. We show that the minimal power, as measured under the LFH, is indeed larger than the theoretical lower bound—this result shows that the simulation results overperform on theoretical bounds. These results are developed in a number of examples.

Since no argument plays in favor of any type of optimality for the test based on the supremum of likelihood ratios for composite testing, we consider substituting those ratios with other kinds of scores in the family of divergence-based concepts, extending the likelihood ratio in a natural way. Such an approach has a long history, stemming from the seminal book by Liese and Vajda [10]. Extensions of the Kullback-Leibler based criterions (such as the likelihood ratio) to power-type criterions have been proposed for many applications in Physics and in Statistics (see, e.g., [11]). We explore the properties of those new tests under the pair of hypotheses minimizing the Kullback-Leibler divergence between the two composite classes

H 0

and

H 1

. We show that, in some cases, we can build a test procedure whose properties overperform the above supremum of the LRTs, and we provide an explanation for this fact. This is the scope of Section 4.

2. An Extension of the Likelihood Ratio Test

For any

δ

in

Δ

, let

T_{n, δ} : = \frac{1}{n} \sum_{i = 1}^{n} log \frac{g_{δ}}{f_{δ}} (X_{i}),

(3)

and define

T_{n} : = sup_{δ \in Δ} T_{n, δ} .

Consider, for fixed

δ

, the Likelihood Ratio Test with statistics

T_{n, δ}

which is uniformly most powerful (UMP) within all tests of

H 0 (δ) : = p_{T} = f_{δ}

vs.

H 1 (δ) : = p_{T} = g_{δ}

, where

p_{T}

designates the distribution of the generic r.v. X. The test procedure to be discussed aims at solving the question: Does there exist some

δ

, for which

H 0

(

δ

) would be rejected vs.

H 1

(

δ

), for some prescribed value of the first kind risk?

Whenever

H 0

(

δ

) is rejected in favor of

H 1

(

δ

), for some

δ

, we reject

H 0 : = f_{0} = g_{0}

in favor of

H 1 : = f_{0} \neq g_{0}

. A critical region for this test with level

α_{n}

is defined by

T_{n} > A_{n},

with

\begin{array}{l} P_{H 0} (H 1) & = & sup_{δ \in Δ} F_{δ} (T_{n} > A_{n}) \\ = & sup_{δ \in Δ} F_{δ} (⋃_{δ^{'}} T_{n, δ^{'}} > A_{n}) \leq α_{n} . \end{array}

Since, for any sequence of events

B_{1}, \dots, B_{p_{n}}

,

F_{δ} (⋃_{k = 1}^{p_{n}} B_{k}) \leq p_{n} max_{1 \leq k \leq p_{n}} F_{δ} (B_{k}),

it holds that

P_{H 0} (H 1) \leq p_{n} max_{δ \in Δ} max_{δ^{'} \in Δ} F_{δ} (T_{n, δ^{'}} > A_{n}) .

(4)

An upper bound for

P_{H 0} (H 1)

can be obtained, making use of the Chernoff inequality for the right side of (4), providing an upper bound for the risk of first kind for a given

A_{n}

. The correspondence between

A_{n}

and this risk allows us to define the threshold

A_{n}

accordingly.

Turning to the power of this test, we define the risk of second kind by

\begin{array}{l} P_{H_{1}} (H 0) & : = sup_{η \in Δ} G_{η} (T_{n} \leq A_{n}) \\ = sup_{η \in Δ} G_{η} (sup_{δ \in Δ} T_{n, δ} \leq A_{n}) \\ = sup_{η \in Δ} G_{η} (⋂_{δ \in Δ} T_{n, δ} \leq A_{n}) \\ \leq sup_{η \in Δ} G_{η} (T_{n, η} \leq A_{n}), \end{array}

(5)

a crude bound which, in turn, can be bounded from above through the Chernoff inequality, which yields a lower bound for the power of the test under any hypothesis

g_{η}

in

H 1

.

Let

α_{n}

denote a sequence of levels, such that

lim sup_{n \to \infty} α_{n} < 1 .

We make use of the following hypothesis:

sup_{δ \in Δ} sup_{δ^{'} \in Δ} \int log \frac{f_{δ^{'}}}{g_{δ^{'}}} f_{δ} < 0 .

(6)

Remark 1.

Since

\int log \frac{f_{δ^{'}}}{g_{δ^{'}}} f_{δ} = K (F_{δ}, G_{δ^{'}}) - K (F_{δ}, F_{δ^{'}}),

making use of the Chernoff-Stein Lemma (see Theorem A1 in the Appendix A), Hypothesis (6) means that any LRT with H0:

p_{T} = f_{δ}

vs. H1:

p_{T} = g_{δ^{'}}

is asymptotically more powerful than any LRT with H0:

p_{T} = f_{δ}

vs. H1:

p_{T} = f_{δ^{'}} .

Both hypotheses (7) and (8), which are defined below, are used to provide the critical region and the power of the test.

For all

δ, δ^{'}

define

Z_{δ^{'}} : = log \frac{g_{δ^{'}}}{f_{δ^{'}}} (X),

and let

φ_{δ, δ^{'}} (t) : = log E_{F_{δ}} (exp (t Z_{δ^{'}})) = log \int {(\frac{g_{δ^{'}} (x)}{f_{δ^{'}} (x)})}^{t} f_{δ} (x) d x .

With

N_{δ, δ^{'}}

, the set of all t such that

φ_{δ, δ^{'}} (t)

is finite, we assume

N_{δ, δ^{'}} is a non void open neighborhood of 0 .

(7)

Define, further,

J_{δ, δ^{'}} (x) : = sup_{t} t x - φ_{δ, δ^{'}} (t),

and let

J (x) : = min_{(δ, δ^{'}) \in Δ \times Δ} J_{δ, δ^{'}} (x) .

For any

η

, let

W_{η} : = - log \frac{g_{η}}{f_{η}} (X),

and let

ψ_{η} (t) : = log E_{G_{η}} (exp (t W_{η})) .

Let

M_{η}

be the set of all t such that

ψ_{η} (t)

is finite. Assume

M_{η} is a non void neighborhood of 0 .

(8)

Let

I_{η} (x) : = sup_{t} t x - log E_{G_{η}} (exp (t W_{η})),

(9)

and

I (x) : = inf_{η} I_{η} (x) .

We also assume an accessory condition on the support of

Z_{δ^{'}}

and

W_{η}

, respectively under

F_{δ}

and under

G_{η}

(see (A2) and (A5) in the proof of Theorem A1). Suppose the regularity assumptions (7) and (8) are fulfilled for all

δ, δ^{'}

and

η

. Assume, further, that

p_{n}

fulfills (1).

The following result holds:

Proposition 2.

Whenever (6) holds, for any sequence of levels

α_{n}

bounded away from 1, defining

A_{n} : = J^{- 1} (- \frac{1}{n} log \frac{α_{n}}{p_{n}}),

it holds, for large n, that

P_{H 0} (H 1) = sup_{δ \in Δ} F_{δ} (T_{n} > A_{n}) \leq α_{n}

and

P_{H 1} (H 1) = sup_{δ \in Δ} G_{δ} (T_{n} > A_{n}) \geq 1 - exp (- n I (A_{n})) .

3. Minimax Tests under Noisy Data, Least-Favorable Hypotheses

3.1. An Asymptotic Definition for the Least-Favorable Hypotheses

We prove that the above procedure is asymptotically minimax for testing the composite hypothesis

H_{0}

against the composite alternative

H_{1}

. Indeed, we identify the least-favorable hypotheses, say

F_{δ_{*}} \in H_{0}

and

G_{δ_{*}} \in H_{1}

, which lead to minimal power and maximal first kind risk for these tests. This requires a discussion of the definition and existence of such a least-favourable pair of hypotheses in an asymptotic context; indeed, for a fixed sample size, the usual definition only leads to an explicit definition in very specific cases. Unlike in [1], the minimax tests will not be in the sense of Huber and Strassen. Indeed, on one hand, hypotheses

H_{0}

and

H_{1}

are not defined in topological neighbourhoods of

F_{0}

and

G_{0}

, but rather through a convolution under a parametric setting. On the other hand, the specific test of

\{H_{0} (δ), δ \in Δ\}

against

\{H_{1} (δ), δ \in Δ\}

does not require capacities dominating the corresponding probability measures.

Throughout the subsequent text, we shall assume that there exists

δ_{*}

such that

min_{δ \in Δ} K (F_{δ}, G_{δ}) = K (F_{δ_{*}}, G_{δ_{*}}) .

(10)

We shall call the pair of distributions

(F_{\underset{̲}{δ}}, G_{\underset{̲}{δ}})

least-favorable for the sequence of tests

1 \{T_{n} > A_{n}\}

if they satisfy

\begin{matrix} F_{δ} (T_{n} \leq A_{n}) \geq F_{\underset{̲}{δ}} (T_{n} \leq A_{n}) \\ \geq G_{\underset{̲}{δ}} (T_{n} \leq A_{n}) \geq G_{δ} (T_{n} \leq A_{n}) \end{matrix}

(11)

for all

δ \in Δ .

The condition of unbiasedness of the test is captured by the central inequality in (11).

Because, for finite n, such a pair can be constructed only in few cases, we should take a recourse of (11) to the asymptotics

n \to \infty .

We shall show that any pair of distributions

(F_{δ_{*}} G_{δ_{*}})

achieving (10) are least-favorable. Indeed, it satisfies the inequality (11) asymptotically on the logarithmic scale.

Specifically, we say that

(F_{\underset{̲}{δ}}, G_{\underset{̲}{δ}})

is a least-favorable pair of distributions when, for any

δ \in Δ

,

\begin{matrix} lim inf_{n \to \infty} \frac{1}{n} log F_{\underset{̲}{δ}} (T_{n} \leq A_{n}) & \geq lim_{n \to \infty} \frac{1}{n} log G_{\underset{̲}{δ}} (T_{n} \leq A_{n}) \\ \geq lim_{n \to \infty} sup \frac{1}{n} log G_{δ} (T_{n} \leq A_{n}) . \end{matrix}

(12)

Define the total variation distance

d_{T V} (F_{δ}, G_{δ}) : = sup_{B} |F_{δ} (B) - G_{δ} (B)|,

where the supremum is over all Borel sets B of

R

. We will assume that, for all n,

α_{n} < 1 - sup_{δ \in Δ} d_{T V} (F_{δ}, G_{δ}) .

(13)

We state our main result, whose proof is deferred to the Appendix B.

Theorem 3.

For any level

α_{n}

satisfying (13), the pair

(F_{δ_{*}}, G_{δ_{*}})

is a least-favorable pair of hypotheses for the family of tests

1 \{T_{n} \geq A_{n}\}

, in the sense of (12).

3.2. Identifying the Least-Favorable Hypotheses

We now concentrate on (10).

For given

δ \in [δ_{min}, δ_{max}]

with

δ_{min} > 0

, the distribution of the r.v.

V_{δ}

is obtained through some transformation from the known distribution of

V .

The classical example is

V_{δ} = \sqrt{δ} V

, which is a scaling, and where

\sqrt{δ}

is the signal to noise ratio. The following results state that the Kullback-Leibler discrepancy

K (F_{δ}, G_{δ})

reaches its minimal value when the noise

V_{δ}

is “maximal”, under some additivity property with respect to

δ .

This result is not surprising: Adding noise deteriorates the ability to discriminate between the two distributions

F_{0}

and

G_{0}

—this effect is captured in

K (F_{δ}, G_{δ})

, which takes its minimal value for the maximal

δ .

Proposition 4.

Assume that, for all

δ, η

, there exists some r.v.

W_{δ, η}

such that

V_{δ + η} =_{d} V_{δ} + W_{δ, η}

where

V_{δ}

and

W_{δ, η}

are independent. Then

δ_{*} = δ_{max} .

This result holds as a consequence of Lemma A5 in the Appendix C.

In the Gaussian case, when h is the standard normal density, Proposition 4 holds, since

h_{δ + η} = h_{δ} * h_{η - δ}

with

h_{ε} (x) : = (1 / \sqrt{ε}) h (x / \sqrt{ε}) .

In order to model symmetric noise, we may consider a symmetrized Gamma density as follows: Set

h_{δ} (x) : = (1 / 2) γ^{+} (1, δ) (x) + (1 / 2) γ^{-} (1, δ) (x)

, where

γ^{+} (1, δ)

designates the Gamma density with scale parameter 1 and shape parameter

δ

, and

γ^{-} (1, δ)

the Gamma density on

R^{-}

with same parameter. Hence a r.v. with density

h_{δ}

is symmetrically distributed and has variance

2 δ .

Clearly,

h_{δ + η} (x) = h_{δ} * h_{η} (x)

, which shows that Proposition 4 also holds in this case. Note that, except for values of

δ

less than or equal to 1, the density

h_{δ}

is bimodal, which does not play in favour of such densities for modelling the uncertainty, due to the noise. In contrast with the Gaussian case,

h_{δ}

cannot be obtained from

h_{1}

by any scaling. The centred Cauchy distribution may help as a description of heavy tailed symmetric noise, and keeps uni-modality through convolution; it satisfies the requirements of Proposition 4 since

f_{δ} * f_{η} (x) = f_{δ + η} (x)

where

f_{ε} (x) : = ε / π (x^{2} + ε^{2})

. In this case,

δ

acts as a scaling, since

f_{δ}

is the density of

δ X

where X has density

f_{1} .

In practice, the interesting case is when

δ

is the variance of the noise and corresponds to a scaling of a generic density, as occurs for the Gaussian case or for the Cauchy case. In the examples, which will be used below, we also consider symmetric, exponentially distributed densities (Laplace densities) or symmetric Weibull densities with a given shape parameter. The Weibull distribution also fulfills the condition in Proposition 4, being infinitely divisible (see [12]).

3.3. Numerical Performances of the Minimax Test

As frequently observed, numerical results deduced from theoretical bounds are of poor interest due to the sub-optimality of the involved inequalities. They may be sharpened on specific cases. This motivates the need for simulation. We study two cases, which can be considered as benchmarks.

In the first case, $f_{0}$ is a normal density with expectation 0 and variance 1, whereas $g_{0}$ is a normal density with expectation $0.3$ and variance $1$ .
The second case handles a situation where $f_{0}$ and $g_{0}$ belong to different models: $f_{0}$ is a log-normal density with location parameter 1 and scale parameter $0.2$ , whereas $g_{0}$ is a Weibull density on $R^{+}$ with shape parameter 5 and scale parameter $3$ . Those two densities differ strongly, in terms of asymptotic decay. They are, however, very close to one another in terms of their symmetrized Kullback-Leibler divergence (the so-called Jeffrey distance). Indeed, centering on the log-normal distribution $f_{0}$ , the closest among all Weibull densities is at distance $0.10$ —the density $g_{0}$ is at distance $0.12$ from $f_{0} .$

Both cases are treated, considering four types of distribution for the noise:

The noise $h_{δ}$ is a centered normal density with variance $δ^{2}$ ;
the noise $h_{δ}$ is a centered Laplace density with parameter $λ (δ)$ ;
the noise $h_{δ}$ is a symmetrized Weibull density with shape parameter $1.5$ and variable scale parameter $β (δ)$ ; and
the noise $h_{δ}$ is Cauchy with density $h_{δ} (x) = γ (δ) / π (γ {(δ)}^{2} + x^{2})$ .

In order to compare the performances of the test under those four distributions, we have adopted the following rule: The parameter of the distribution of the noise is tuned such that, for each value

\underset{̲}{δ}

, it holds that

P (|V_{\underset{̲}{δ}}| > \underset{̲}{δ}) = Φ (1) - Φ (- 1) \sim 0.65

, where

Φ

stands for the standard Gaussian cumulative function. Thus, distributions b to d are scaled with respect to the Gaussian noise with variance

δ^{2}

.

In both cases A and B, the range of

δ

is

Δ = (δ_{min} = 0.1, δ_{max})

, and we have selected a number of possibilities for

δ_{max}

, ranging from

0.2

to

0.7

.

In case A, we selected

= δ_{max}^{2} = 0.5

, which has a signal-to-noise ratio equal to

0.7

, a commonly chosen bound in quality control tests.

In case B, the variance of

f_{0}

is roughly

0.6

and the variance of

g_{0}

is roughly

0.4 .

The maximal value of

δ_{max}^{2}

is roughly

0.5 .

This is thus a maximal upper bound for a practical modeling.

We present some power functions, making use of the theoretical bounds together with the corresponding ones based on simulation runs. As seen, the performances in the theoretical approach is weak. We have focused on simulation, after some comparison with the theoretical bounds.

3.3.1. Case A: The Shift Problem

In this subsection, we evaluate the quality of the theoretical power bound, defined in the previous sections. Thus, we compare the theoretical formula to the empirical lower performances obtained through simulations under the least-favorable hypotheses.

Theoretical Power Bound

While supposedly valid for finite n, the theoretical power bound given by (A8) still assumes some sort of asymptotics, since a good approximation of the bound implies a fine discretization of

Δ

to compute

I (A_{n}) = {inf}_{η \in Δ_{n}} I_{η} (A_{n})

. Thus, by condition (1), n has to be large. Therefore, in the following, we will compute this lower bound for n sufficiently large (that is, at least 100 observations), which is also consistent with industrial applications.

Numerical Power Bound

In order to obtain a minimal bound for the power of the composite test, we compute the power of the test

H_{0} (δ_{*})

against

H_{1} (δ_{*})

, where

δ_{*}

defines the LFH pair

(F_{δ_{*}}, G_{δ_{*}})

.

Following Proposition 4, the LFH for the test defined by

T_{n}

when the noise follows a Gaussian, a Cauchy, or a symmetrized Weibull distribution is achieved for

(F_{δ_{max}}, G_{δ_{max}})

.

When the noise follows a Laplace distribution, the pair of LFH is the one that satisfies:

(F_{δ_{*}}, G_{δ_{*}}) = arg min_{(F_{δ}, G_{δ}), δ \in Δ_{n}} K (F_{δ}, G_{δ}) .

(14)

In both of the cases A and B, this condition is also satisfied for

δ^{*} = δ_{max}

.

Comparison of the Two Power Curves

As expected, Figure 1, Figure 2 and Figure 3 show that the theoretical lower bound is always below the empirical lower bound, when n is high enough to provide a good approximation of

I (A_{n})

. This is also true when the noise follows a Cauchy distribution, but for a bigger sample size than in the figures above (

n > 250

).

In most cases, the theoretical bound tends to largely underestimate the power of the test, when compared to its minimal performance given by simulations under the least-favorable hypotheses. The gap between the two also tends to increase as n grows. This result may be explained by the large bound provided by (5), while the numerical performances are obtained with respect to the least-favorable hypotheses.

From a computational perspective, the computational cost of the theoretical bound is far higher than its numeric counterpart.

3.3.2. Case B: The Tail Thickness Problem

The calculation of the moment-generating function, appearing in the formula of

I_{η} (x)

in (9), is numerically unstable, which renders the computation of the theoretical bound impossible. Thus, in the following sections, the performances of the test will be evaluated numerically, through Monte Carlo replications.

4. Some Alternative Statistics for Testing

4.1. A Family of Composite Tests Based on Divergence Distances

This section provides a similar treatment as above, dealing now with some extensions of the LRT test to the same composite setting. The class of tests is related to the divergence-based approach to testing, and it includes the cases considered so far. For reasons developed in Section 3.3, we argue through simulation and do not develop the corresponding large deviation approach.

The statistics

T_{n}

can be generalized in a natural way, by defining a family of tests depending on some parameter

γ .

For

γ \neq 0, 1

, let

ϕ_{γ} (x) : = \frac{x^{γ} - γ x + γ - 1}{γ (γ - 1)}

be a function defined on

(0, \infty)

with values in

(0, \infty)

, setting

ϕ_{0} (x) : = - log x + x - 1

and

ϕ_{1} (x) : = x log x - x + 1 .

For

γ \leq 2

, this class of functions is instrumental in order to define the so-called power divergences between probability measures, a class of pseudo-distances widely used in statistical inference (see, for example, [13]).

Associated to this class, consider the function

\begin{matrix} φ_{γ} (x) & : = - \frac{d}{d x} ϕ_{γ} (x) \\ = \frac{1 - x^{γ - 1}}{γ - 1} for γ \neq 0, 1 . \end{matrix}

We also consider

\begin{array}{l} φ_{1} (x) : = - log x \\ φ_{0} (x) : = \frac{1}{x} - 1, \end{array}

from which the statistics

T_{n, δ}^{γ} : = \frac{1}{n} \sum_{i = 1}^{n} φ_{γ} (X_{i})

and

T_{n}^{γ} : = sup_{δ} T_{n, δ}^{γ}

are well defined, for all

γ \leq 2

. Figure 4 illustrates the functions

φ_{γ}

, according to

γ

.

Fix a risk of first kind

α

, and the corresponding power of the LRT pertaining to

H 0 (δ_{*})

vs.

H 1 (δ_{*})

by

1 - β : = G_{δ_{*}} (T_{n, δ_{*}}^{1} > s_{α}),

with

s_{α} : = inf \{s : F_{δ_{*}} (T_{n, δ_{*}}^{1} > s) \leq α\} .

Define, accordingly, the power of the test, based on

T_{n}^{γ}

under the same hypotheses, by

s_{α}^{γ} : = inf \{s : F_{δ_{*}} (T_{n}^{γ} > s) \leq α\}

and

1 - β^{'} : = G_{δ_{*}} (T_{n}^{γ} > s_{α}^{γ}) .

First,

δ_{*}

defines the pair of hypotheses

(F_{δ_{*}}, G_{δ_{*}})

, such that the LRT with statistics

T_{n, δ_{*}}^{1}

has maximal power among all tests

H 0 (δ_{*})

vs.

H 1 (δ_{*}) .

Furthermore, by Theorem A1, it has minimal power on the logarithmic scale among all tests

H 0 (δ)

vs.

H 1 (δ) .

On the other hand,

(F_{δ_{*}}, G_{δ_{*}})

is the LF pair for the test with statistics

T_{n}^{1}

among all pairs

(F_{δ}, G_{δ}) .

These two facts allow for the definition of the loss of power, making use of

T_{n}^{1}

instead of

T_{n, δ_{*}}^{1}

for testing

H 0 (δ_{*})

vs.

H 1 (δ_{*}) .

This amounts to considering the price of aggregating the local tests

T_{n, δ}^{1}

, a necessity since the true value of

δ

is unknown. A natural indicator for this loss consists in the difference

Δ_{n}^{1} : = G_{δ_{*}} (T_{n, δ_{*}}^{1} > s_{α}) - G_{δ_{*}} (T_{n}^{1} > s_{α}^{1}) \geq 0 .

Consider, now, aggregated test statistics

T_{n}^{γ} .

We do not have at hand a similar result, as in Proposition 2. We, thus, consider the behavior of the test

H 0 (δ_{*})

vs.

H 1 (δ_{*})

, although

(F_{δ_{*}}, G_{δ_{*}})

may not be a LFH for the test statistics

T_{n}^{γ} .

The heuristics, which we propose, makes use of the corresponding loss of power with respect to the LRT, through

Δ_{n}^{γ} : = G_{δ_{*}} (T_{n, δ_{*}}^{1} > s_{α}) - G_{δ_{*}} (T_{n}^{γ} > s_{α}^{γ}) .

We will see that it may happen that

Δ_{n}^{γ}

improves over

Δ_{n}^{1} .

We define the optimal value of

γ

, say

γ^{*}

, such that

Δ_{n}^{γ^{*}} \leq Δ_{n}^{γ},

for all

γ .

In the various figures hereafter, NP corresponds to the LRT defined between the LFH’s

(F_{δ_{*}}, G_{δ_{*}}),

KL to the test with statistics

T_{n}^{1}

(hence, as presented Section 2), HELL corresponds to

T_{n}^{1 / 2}

, which is associated to the Hellinger power divergence, and G = 2 corresponds to

γ = 2 .

4.2. A Practical Choice for Composite Tests Based on Simulation

We consider the same cases A and B, as described in Section 3.3.

As stated in the previous section, the performances of the different test statistics are compared, considering the test of

H_{0} (δ_{*})

against

H_{1} (δ_{*})

where

δ^{*}

is defined, as explained in Section 3.3 as the LFH for the test

T_{n}^{1}

. In both cases A and B, this corresponds to

δ^{*} = δ_{max}

.

4.2.1. Case A: The Shift Problem

Overall, the aggregated tests perform well, when the problem consists in identifying a shift in a distribution. Indeed, for the three values of

γ

(0.5, 1, and 2), the power remains above 0.7 for any kind of noise and any value of

δ_{*}

. Moreover, the power curves associated to

T_{n}^{γ}

mainly overlap with the optimal test

T_{n, δ_{*}}^{1}

.

Under Gaussian noise, the power remains mostly stable over the values of $δ_{*}$ , as shown by Figure 5. The tests with statistics $T_{n}^{1}$ and $T_{n}^{2}$ are equivalently powerful for large values of $δ_{*}$ , while the first one achieves higher power when $δ_{*}$ is small.
When the noise follows a Laplace distribution, the three power curves overlap the NP power curve, and the different test statistics can be indifferently used. Under such a noise, the alternative hypotheses are extremely well distinguished by the class of tests considered, and this remains true as $δ_{*}$ increases (cf. Figure 6).
Under the Weibull hypothesis, $T_{n}^{1}$ and $T_{n}^{2}$ perform similarly well, and almost always as well as $T_{n, δ_{*}}^{1}$ , while the power curve associated to $T_{n}^{1 / 2}$ remains below. Figure 7 illustrates that, as $δ_{max}$ increases, the power does not decrease much.
Under a Cauchy assumption, the alternate hypotheses are less distinguishable than under any other parametric hypothesis on the noise, since the maximal power is about 0.84, while it exceeds 0.9 in cases a, b, and c (cf. Figure 5, Figure 6, Figure 7 and Figure 8). The capacity of the tests to discriminate between $H 0 (δ_{max})$ and $H 1 (δ_{max})$ is almost independent of the value of $δ_{max}$ , and the power curves are mainly flat.

4.2.2. Case B: The Tail Thickness Problem

With the noise defined by case A (Gaussian noise), for KL ( $γ = 1$ ), $δ_{*} = δ_{max}$ due to Proposition 4 and statistics $T_{n}^{1}$ provides the best power uniformly upon $δ_{max} .$ Figure 9 shows a net decrease of the power as $δ_{max}$ increases (recall that the power is evaluated under the least favorable alternative $G_{δ_{max}}$ ).
When the noise follows a Laplace distribution, the situation is quite peculiar. For any value of $δ$ in $Δ$ , the modes $M_{G_{δ_{max}}}$ and $M_{F_{δ_{max}}}$ of the distributions of $(f_{δ} / g_{δ}) (X)$ under $G_{δ_{max}}$ and under $F_{δ_{max}}$ are quite separated; both larger than $1 .$ Also, for $δ$ all the values of $| ϕ_{γ} (M_{G_{δ_{max}}}) - ϕ_{γ} (M_{F_{δ_{max}}}) |$ are quite large for large values of $γ .$ We may infer that the distributions of $ϕ_{γ} ((f_{δ} / g_{δ}) (X))$ under $G_{δ_{max}}$ and under $F_{δ_{max}}$ are quite distinct for all $δ$ , which in turn implies that the same fact holds for the distributions of $T_{n}^{γ}$ for large $γ .$ Indeed, simulations presented in Figure 10 show that the maximal power of the test tends to be achieved when $γ = 2 .$
When the noise follows a symmetric Weibull distribution, the power function when $γ = 1$ is very close to the power of the LRT between $F_{δ_{max}}$ and $G_{δ_{max}}$ (cf. Figure 11). Indeed, uniformly on $δ$ , and on x, the ratio $(f_{δ} / g_{δ}) (x)$ is close to 1. Therefore, the distribution of $T_{n}^{1}$ is close to that of $T_{n, δ_{max}}^{1}$ , which plays in favor of the KL composite test.
Under a Cauchy distribution, similarly to case A, Figure 12 shows that $T_{n}^{γ}$ achieves the maximal power for $γ = 1$ and 2, closely followed by $γ = 0.5$ .

5. Conclusions

We have considered a composite testing problem, where simple hypotheses in either

H 0

and

H 1

were paired, due to corruption in the data. The test statistics were defined through aggregation of simple likelihood ratio tests. The critical region for this test and a lower bound of its power was produced. We have shown that this test is minimax, evidencing the least-favorable hypotheses. We have considered the minimal power of the test under such a least favorable hypothesis, both theoretically and by simulation, and for a number of cases (including corruption by Gaussian, Laplacian, symmetrized Weibull, and Cauchy noise). Whatever this noise, the actual minimal power, as measured through simulation, was quite higher than obtained through analytic developments. Least-favorable hypotheses were defined in an asymptotic sense, and were proved to be the pair of simple hypotheses in

H 0

and

H 1

which are closest, in terms of the Kullback-Leibler divergence; this holds as a consequence of the Chernoff-Stein Lemma. We, next, considered aggregation of tests where the likelihood ratio was substituted by a divergence-based statistics. This choice extended the former one, and may produce aggregate tests with higher power than obtained through aggregation of the LRTs, as examplified and analysed. Open questions are related to possible extensions of the Chernoff-Stein Lemma for divergence-based statistics.

Author Contributions

Conceptualization, M.B. and J.J.; Methodology, M.B., J.J., A.K.M. and E.M.; Software, E.M.; Validation, M.B., A.K.M. and E.M.; Formal Analysis, M.B. and J.J.; Writing Original Draft Preparation, M.B., J.J., E.M. and A.K.M.; Supervision, M.B.

Funding

The research of Jana Jurečková was supported by the Grant 18-01137S of the Czech Science Foundation. Michel Broniatowski and M. Ashok Kumar would like to thank the Science and Engineering Research Board of the government of India for the financial support for their collaboration through the VAJRA scheme.

Acknowledgments

The authors are thankful to Jan Kalina for discussion; they also thank two anonymous referees for comments which helped to improve on a former version of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Proposition 2

Appendix A.1. The Critical Region of the Test

Define

Z_{δ^{'}} : = log \frac{g_{δ^{'}}}{f_{δ^{'}}} (X),

which satisfies

\begin{matrix} E_{F_{δ}} (Z_{δ^{'}}) & = \int log \frac{g_{δ^{'}}}{f_{δ^{'}}} (x) f_{δ} (x) d x \\ = \int log \frac{g_{δ^{'}}}{f_{δ}} (x) f_{δ} (x) d x + \int log \frac{f_{δ}}{f_{δ^{'}}} (x) f_{δ} (x) d x \\ = K (F_{δ}, G_{δ^{'}}) - K (F_{δ}, G_{δ^{'}}) . \end{matrix}

Note that, for all

δ

,

K (F_{δ}, G_{δ^{'}}) - K (F_{δ}, G_{δ^{'}}) = \int log \frac{g_{δ^{'}}}{f_{δ^{'}}} f_{δ}

is negative for

δ^{'}

close to

δ

, assuming that

δ^{'} \mapsto \int log \frac{g_{δ^{'}}}{f_{δ^{'}}} f_{δ}

is a continuous mapping. Assume, therefore, that (6) holds, which means that the classes of distributions

(G_{δ})

and

(F_{δ})

are somehow well separated. This implies that

E_{F_{δ}} (Z_{δ^{'}}) < 0

, for all

δ

and

δ^{'} .

In order to obtain an upper bound for

F_{δ} (T_{n, δ^{'}} (X_{n}) > A_{n})

, for all

δ, δ^{'}

in

Δ

, through the Chernoff Inequality, consider

φ_{δ, δ^{'}} (t) : = log E_{F_{δ}} (exp (t Z_{δ^{'}})) = log \int {(\frac{g_{δ^{'}} (x)}{f_{δ^{'}} (x)})}^{t} g_{δ} (x) d x .

Let

t^{+} (N_{δ, δ^{'}}) : = sup \{t \in N_{δ, δ^{'}} : φ_{δ, δ^{'}} (t) < \infty\} .

The function

(δ, δ^{'}, x) \mapsto J_{δ, δ^{'}} (x)

is continuous on its domain, and since

t \mapsto φ_{δ, δ^{'}} (t)

is a strictly convex function which tends to infinity as t tends to

t^{+} (N_{δ, δ^{'}})

, it holds that

lim_{x \to \infty} J_{δ, δ^{'}} (x) = + \infty

for all

δ, δ^{'}

in

Δ_{n} .

We now consider an upper bound for the risk of first kind on a logarithmic scale.

We consider

A_{n} > E_{F_{δ}} (Z_{δ^{'}}),

(A1)

for all

δ, δ^{'}

. Then, by the Chernoff inequality

\frac{1}{n} log F_{δ} (T_{n, δ^{'}} (X_{n}) > A_{n}) \leq - J_{δ, δ^{'}} (A_{n}) .

Since

A_{n}

should satisfy

exp (- n J_{δ, δ^{'}} (A_{n})) \leq α_{n},

with

α_{n}

bounded away from 1,

A_{n}

surely satisfies (A1) for large

n .

The mapping

m_{δ, δ^{'}} (t) : = (d / d t) φ_{δ, δ^{'}} (t)

is a homeomorphism from

N_{δ, δ^{'}}

onto the closure of the convex hull of the support of the distribution of

Z_{δ^{'}}

under

F_{δ}

(see, e.g., [14]). Denote

\underset{δ}{ess sup} Z_{δ^{'}} : = sup \{x : for all ϵ > 0, F_{δ} (Z_{δ^{'}} \in (x - ϵ, x) > 0)\} .

We assume that

\underset{δ}{ess sup} Z_{δ^{'}} = + \infty,

(A2)

which is convenient for our task, and quite common in practical industrial modelling. This assumption may be weakened, at notational cost mostly. It follows that

lim_{t \to t^{+} (N_{δ, δ^{'}})} m_{δ, δ^{'}} (t) = + \infty .

It holds that

J_{δ, δ^{'}} (E_{F_{δ}} (Z_{δ^{'}})) = 0,

and, as seen previously

lim_{x \to \infty} J_{δ, δ^{'}} (x) = + \infty .

On the other hand,

m_{δ, δ^{'}} (0) = E_{F_{δ}} (Z_{δ^{'}}) = K (F_{δ}, F_{δ^{'}}) - K (F_{δ}, G_{δ^{'}}) < 0 .

Let

\begin{matrix} I & : = (sup_{δ, δ^{'}} E_{F_{δ}} (Z_{δ^{'}}), \infty) \\ = (sup_{δ, δ^{'}} K (F_{δ}, F_{δ^{'}}) - K (F_{δ}, G_{δ^{'}}), \infty) . \end{matrix}

By (A2), the interval

I

is not void.

We now define

A_{n}

such that (4) holds, namely

P_{H 0} (H 1) \leq p_{n} max_{δ} max_{δ^{'}} F_{δ} (T_{n, δ^{'}} > A_{n}) \leq α_{n}

holds for any

α_{n}

in

(0, 1) .

Note that

A_{n} \geq max_{δ, δ^{'}} E_{F_{δ}} (Z_{δ^{'}}) = max_{(δ, δ^{'}) \in Δ \times Δ} K (F_{δ}, F_{δ^{'}}) - K (F_{δ}, G_{δ^{'}}),

(A3)

for all n large enough, since

α_{n}

is bounded away from

1 .

The function

J (x) : = min_{(δ, δ^{'}) \in Δ \times Δ} J_{δ, δ^{'}} (x)

is continuous and increasing, as it is the infimum of a finite collection of continuous increasing functions defined on

I

.

Since

P_{H 0} (H 1) \leq p_{n} exp (- n J (A_{n})),

given

α_{n}

, define

A_{n} : = J^{- 1} (- \frac{1}{n} log \frac{α_{n}}{p_{n}}) .

(A4)

This is well defined for

α_{n} \in (0, 1)

, as

{sup}_{(δ, δ^{'}) \in Δ \times Δ} E_{F_{δ}} (Z_{δ^{'}}) < 0

and

- (1 / n) log (α_{n} / p_{n}) > 0 .

Appendix A.2. The Power Function

We now evaluate a lower bound for the power of this test, making use of the Chernoff inequality to get an upper bound for the second risk.

Starting from (5),

P_{H_{1}} (H 0) \leq sup_{η \in Δ} G_{η} (T_{n, η} \leq A_{n}),

and define

W_{η} : = - log \frac{g_{η}}{f_{η}} (x) .

It holds that

E_{G_{η}} (W_{η}) = \int log \frac{f_{η} (x)}{g_{η} (x)} g_{η} (x) d x = - K (G_{η}, F_{η}),

and

m_{η} (t) : = (d / d t) log E_{G_{η}} (exp t W_{η}),

which is an increasing homeomorphism from

M_{η}

onto the closure of the convex hull of the support of

W_{η}

under

G_{η} .

For any

η

, the mapping

x \mapsto I_{η} (x)

is a strictly increasing function of

K_{η} : = (E_{G_{η}} (W_{η}), \infty)

onto

(0, + \infty)

, where the same notation as above holds for

{ess sup}_{η} W_{η}

(here under

G_{η}

), and where we assumed

\underset{η}{ess sup} W_{η} = \infty

(A5)

for all

η .

Assume that

A_{n}

satisfies

A_{n} \in K : = ⋂_{η \in Δ} K_{η}

(A6)

namely

A_{n} \geq sup_{η \in Δ} E_{G_{η}} (W_{η}) = - inf_{η \in Δ} K (G_{η}, F_{η}) .

(A7)

Making use of the Chernoff inequality, we get

P_{H_{1}} (H 0) \leq exp (- n inf_{η \in Δ} I_{η} (A_{n})) .

Each function

x \mapsto I_{η} (x)

is increasing on

(E_{G_{η}} (W_{η}), \infty) .

Therefore the function

x \mapsto I (x) : = inf_{η \in Δ} I_{η} (x)

is continuous and increasing, as it is the infimum of a finite number of continuous increasing functions on the same interval

K

, which is not void due to (A5).

We have proven that, whenever (A7) holds, a lower bound tor the test of

H 0

vs.

H 1

is given by

\begin{matrix} P_{H_{1}} (H 1) & \geq 1 - exp (- n I (A_{n})) \\ = 1 - exp (- n I (J^{- 1} (- \frac{1}{n} log \frac{α_{n}}{p_{n}}))) . \end{matrix}

(A8)

We now collect the above discussion, in order to complete the proof.

Appendix A.3. A Synthetic Result

The function J is one-to-one from I onto

K : = (J ({sup}_{(δ, δ^{'}) \in Δ \times Δ} E_{δ} (Z_{δ^{'}})), \infty) .

Since

F_{δ}, J_{δ, δ^{'}} (E_{δ} (Z_{δ^{'}}))

= 0, it follows that

J ({sup}_{(δ, δ^{'}) \in Δ \times Δ} E_{δ} (Z_{δ^{'}})) \geq 0 .

Since

E_{F_{δ}} (Z_{δ^{'}}) = K (F_{δ}, F_{δ^{'}}) - K (F_{δ}, G_{δ^{'}}) < 0

, whenever

α_{n}

in

(0, 1)

there exists a unique

A_{n} \in (- {inf}_{(δ, δ^{'}) \in Δ \times Δ} (K (F_{δ}, G_{δ^{'}}) - K (F_{δ}, F_{δ^{'}})), \infty)

which defines the critical region with level

α_{n} .

For the lower bound on the power of the test, we have assumed

A_{n} \in K = ({sup}_{η \in Δ} E_{η} (W_{η}), \infty) = (- {inf}_{η \in Δ} K (G_{η}, F_{η}), \infty)

.

In order to collect our results in a unified setting, it is useful to state some connection between

{inf}_{(δ, δ^{'}) \in Δ \times Δ} [K (F_{δ}, G_{δ^{'}}) - K (F_{δ}, F_{δ^{'}})]

and

{inf}_{η \in Δ} K (G_{η}, F_{η}) .

See (A3) and (A7).

Since

K (G_{δ}, F_{δ})

is positive, it follows from (6) that

sup_{(δ, δ^{'}) \in Δ \times Δ} \int log \frac{f_{δ^{'}}}{g_{δ^{'}}} f_{δ} < sup_{δ \in Δ} K (G_{δ}, F_{δ}),

(A9)

which implies the following fact:

Let

α_{n}

be bounded away from

1 .

Then (A3) is fulfilled for large n, and therefore there exists

A_{n}

such that

sup_{δ \in Δ} F_{δ} (T_{n} > A_{n}) \leq α_{n} .

Furthermore, by (A9), Condition (A7) holds, which yields the lower bound for the power of this test, as stated in (A8).

Appendix B. Proof of Theorem 3

We will repeatedly make use of the following result (Theorem 3 in [15]), which is an extension of the Chernoff-Stein Lemma (see [16]).

Theorem A1.

[Krafft and Plachky] Let

x_{n}

, such that

F_{δ} (T_{n, δ} > x_{n}) \leq α_{n}

with

l i m s u p_{n \to \infty} α_{n} < 1 .

Then

lim_{n \to \infty} \frac{1}{n} log G_{δ} (T_{n, δ} \leq x_{n}) = - K (F_{δ}, G_{δ}) .

Remark A2.

The above result indicates that the power of the Neyman-Pearson test only depends on its level on the second order on the logarithmic scale.

Define

A_{n, δ_{*}}

such that

F_{δ_{*}} (T_{n} \leq A_{n}) = F_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) .

This exists and is uniquely defined, due to the regularity of the distribution of

T_{n, δ_{*}}

under

F_{δ_{*}} .

Since

1 [T_{n, δ_{*}} > A_{n}]

is the likelihood ratio test of

H_{0} (δ_{*})

against

H_{1} (δ_{*})

of the size

α_{n},

it follows, by unbiasedness of the LRT, that

F_{δ_{*}} (T_{n} \leq A_{n}) = F_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) \geq G_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) .

We shall later verify the validity of the conditions of Theorem A1; namely, that

lim sup_{n \to \infty} F_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) < 1 .

(A10)

Assuming (A10) we get, by Theorem A1,

lim sup_{n \to \infty} \frac{1}{n} log F_{δ_{*}} (T_{n} \leq A_{n}) \geq lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) = - K (F_{δ_{*}}, G_{δ_{*}}) .

We shall now prove that

lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) = lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n} \leq A_{n}) .

Let

B_{n, δ_{*}}

, such that

G_{δ_{*}} (T_{n, δ_{*}} \leq B_{n, δ_{*}}) = G_{δ_{*}} (T_{n} \leq A_{n}) .

By regularity of the distribution of

T_{n, δ_{*}}

under

G_{δ_{*}}

, such a

B_{n, δ_{*}}

is defined in a unique way. We will prove that the condition in Theorem A1 holds, namely

lim sup_{n \to \infty} F_{δ_{*}} (T_{n, δ_{*}} \leq B_{n, δ_{*}}) < 1 .

(A11)

lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n, δ_{*}} \leq A_{n, δ_{*}}) = lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n} \leq A_{n}) = - K (F_{δ_{*}}, G_{δ_{*}}) .

Incidentally, we have obtained that

{lim}_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n} \leq A_{n})

exists. Therefore we have proven that

lim sup_{n \to \infty} \frac{1}{n} log F_{δ_{*}} (T_{n} \leq A_{n}) \geq lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n} \leq A_{n}),

which is a form of unbiasedness. For

δ \neq δ_{*}

, let

B_{n, δ}

be defined by

G_{δ} (T_{n, δ} \leq B_{n, δ}) = G_{δ} (T_{n} \leq A_{n}) .

As above,

B_{n, δ}

is well-defined. Assuming

lim sup_{n \to \infty} F_{δ} (T_{n, δ} \leq B_{n, δ}) < 1,

(A12)

it follows, from Theorem A1, that

lim_{n \to \infty} \frac{1}{n} log G_{δ} (T_{n} \leq A_{n}) = lim_{n \to \infty} \frac{1}{n} log G_{δ} (T_{n, δ} \leq B_{n, δ}) = - K (F_{δ}, G_{δ}) .

Since

K (F_{δ_{*}}, G_{δ_{*}}) \leq K (F_{δ}, G_{δ}),

we have proven

lim sup_{n \to \infty} \frac{1}{n} log F_{δ_{*}} (T_{n} \leq A_{n}) \geq lim_{n \to \infty} \frac{1}{n} log G_{δ_{*}} (T_{n} \leq A_{n}) \geq lim_{n \to \infty} \frac{1}{n} log G_{δ} (T_{n} \leq A_{n}) .

It remains to verify the conditions (A10)–(A12). We will only verify (A12), as the two other conditions differ only by notation. We have

\begin{matrix} G_{δ} (T_{n, δ} > B_{n, δ}) & = G_{δ} (T_{n} > A_{n}) \leq F_{δ} (T_{n} > A_{n}) + d_{T V} (F_{δ}, G_{δ}) \\ \leq α_{n} + d_{T V} (F_{δ}, G_{δ}) < 1, \end{matrix}

by hypothesis (13). By the law of large numbers, under

G_{δ}

lim_{n \to \infty} T_{n, δ} = K (G_{δ}, F_{δ}) [G_{δ} - a.s.] .

Therefore, for large n,

lim inf_{n \to \infty} B_{n, δ} \geq K (G_{δ}, F_{δ}) [G_{δ} - a.s.] .

Since, under

F_{δ}

,

lim_{n \to \infty} T_{n, δ} = - K (F_{δ}, G_{δ}) [F_{δ} - a.s.],

this implies that

lim_{n \to \infty} F_{δ} (T_{n, δ} > B_{n, δ}) < 1 .

Appendix C. Proof of Proposition 4

We now prove the three lemmas that we used.

Lemma A3.

Let P, Q, and R denote three distributions with respective continuous and bounded densities p, q, and

r .

Then

K (P * R, Q * R) \leq K (P, Q) .

(A13)

Proof.

Let

P : = (A_{1}, \dots, A_{K})

be a partition of

R

and

p : = (p_{1}, \dots, p_{K})

denote the probabilities of

A_{1}, \dots, A_{K}

under P. Set the same definition for

q_{1}, \dots, q_{K}

and for

r_{1}, \dots, r_{K} .

Recall that the log-sum inequality writes

(\sum a_{i}) log \frac{\sum b_{i}}{\sum c_{i}} \leq \sum a_{i} log \frac{b_{i}}{c_{i}}

for positive vectors

{(a_{i})}_{i}

,

{(b_{i})}_{i}

and

{(c_{i})}_{i} .

By the above inequality, for any

i \in {1, \dots, K}

, denoting

(p * r)

to be the convolution of p and r,

{(p * r)}_{j} log \frac{{(p * r)}_{j}}{{(q * r)}_{j}} \leq \sum_{i = 1}^{K} p_{j} r_{i - j} log \frac{p_{j} r_{i - j}}{q_{j} r_{i - j}} .

Summing over

j \in {1, \dots, K}

yields

\sum_{j = 1}^{K} {(p * r)}_{j} log \frac{{(p * r)}_{j}}{{(q * r)}_{j}} \leq \sum_{j = 1}^{K} p_{j} log \frac{p_{j}}{q_{j}},

which is equivalent to

K_{P} (P * R, Q * R) \leq K_{P} (P, Q),

where

K_{P}

designates the Kullback-Leibler divergence defined on

P .

Refine the partition and go to the limit (Riemann Integrals), to obtain (A13). □

We now set a classical general result which states that, when

R_{δ}

denotes a family of distributions with some decomposability property, then the Kullback-Leibler divergence between

P * R_{δ}

and

Q * R_{δ}

is a decreasing function of

δ .

Lemma A4.

Let P and Q satisfy the hypotheses of Lemma A3 and let

{(R_{δ})}_{δ > 0}

denote a family of p.m.’s on

R

, and denote accordingly

V_{δ}

to be a r.v. with distribution

R_{δ} .

Assume that, for all δ and η, there exists a r.v.

W_{δ, η}

, independent upon

V_{δ}

, such that

V_{δ + η} =_{d} V_{δ} + W_{δ, η} .

Then the function

δ \mapsto K (P * R_{δ}, Q * R_{δ})

is non-increasing.

Proof.

Using Lemma A3, it holds that, for positive η,

\begin{matrix} K (P * R_{δ + η}, Q * R_{δ + η}) & = K ((P * R_{δ}) * W_{δ, η}, (Q * R_{δ}) * W_{δ, η}) \\ \leq K (P * R_{δ}, Q * R_{δ}), \end{matrix}

which proves the claim. □

Lemma A5.

Let P, Q, and R be three probability distributions with respective continuous and bounded densities

p, q,

and r.Assume that

K (P, Q) \leq K (Q, P),

where all involved quantities are assumed to be finite. Then

K (P * R, Q * R) \leq K (Q * R, P * R) .

Proof.

We proceed as in Lemma A3, using partitions and denoting by

p_{1}, \dots, p_{K}

the induced probability of P on

P

. Then,

\begin{matrix} K_{P} (P * R, Q * R) - K_{P} (Q * R, P * R) & = \sum_{i} \sum_{j} (p_{j} r_{i - j} + q_{j} r_{i - j}) log \frac{\sum_{j} p_{j} r_{i - j}}{\sum_{j} q_{j} r_{i - j}} \\ \leq \sum_{j} \sum_{i} (p_{j} r_{i - j} + q_{j} r_{i - j}) log \frac{p_{j}}{q_{j}} \\ = \sum_{j} (p_{j} + q_{j}) log \frac{p_{j}}{q_{j}} \\ = K_{P} (P, Q) - K_{P} (Q, P) \leq 0, \end{matrix}

where we used the log-sum inequality and the fact that

K (P, Q) \leq K (Q, P)

implies

K_{P} (P, Q) \leq K_{P} (Q, P)

, by the data-processing inequality. □

References

Broniatowski, M.; Jurečková, J.; Kalina, J. Likelihood ratio testing under measurement errors. Entropy 2018, 20, 966. [Google Scholar] [CrossRef]
Guo, D. Relative entropy and score function: New information-estimation relationships through arbitrary additive perturbation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT 2009), Seoul, Korea, 28 June–3 July 2009; pp. 814–818. [Google Scholar]
Huber, P.; Strassen, V. Minimax tests and the Neyman-Pearson lemma for capacities. Ann. Stat. 1973, 2, 251–273. [Google Scholar] [CrossRef]
Eguchi, S.; Copas, J. Interpreting Kullback-Leibler divergence with the Neyman-Pearson lemma. J. Multivar. Anal. 2006, 97, 2034–2040. [Google Scholar] [CrossRef]
Narayanan, K.R.; Srinivasa, A.R. On the thermodynamic temperature of a general distribution. arXiv, 2007; arXiv:0711.1460. [Google Scholar]
Bahadur, R.R. Stochastic comparison of tests. Ann. Math. Stat. 1960, 31, 276–295. [Google Scholar] [CrossRef]
Bahadur, R.R. Some Limit Theorems in Statistics; Society for Industrial and Applied Mathematics: Philadelpha, PA, USA, 1971. [Google Scholar]
Birgé, L. Vitesses maximales de décroissance des erreurs et tests optimaux associés. Z. Wahrsch. Verw. Gebiete 1981, 55, 261–273. [Google Scholar] [CrossRef]
Tusnády, G. On asymptotically optimal tests. Ann. Stat. 1987, 5, 385–393. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. Convex Statistical Distances; Teubner: Leipzig, Germany, 1987. [Google Scholar]
Tsallis, C. Possible generalization of BG statistics. J. Stat. Phys. 1987, 52, 479–485. [Google Scholar] [CrossRef]
Goldie, C. A class of infinitely divisible random variables. Proc. Camb. Philos. Soc. 1967, 63, 1141–1143. [Google Scholar] [CrossRef]
Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Barndorff-Nielsen, O. Information and Exponential Families in Statistical Theory; John Wiley & Sons: New York, NY, USA, 1978. [Google Scholar]
Krafft, O.; Plachky, D. Bounds for the power of likelihood ratio tests and their asymptotic properties. Ann. Math. Stat. 1970, 41, 1646–1654. [Google Scholar] [CrossRef]
Chernoff, H. Large-sample theory: Parametric case. Ann. Math. Stat. 1956, 27, 1–22. [Google Scholar] [CrossRef]

Figure 1. Theoretical and numerical power bound of the test of case A under Gaussian noise (with respect to n), for the first kind risk

α = 0.05

.

Figure 1. Theoretical and numerical power bound of the test of case A under Gaussian noise (with respect to n), for the first kind risk

α = 0.05

.

Figure 2. Theoretical and numerical power bound of the test of case A under symmetrized Weibull noise (with respect to n), for the first kind risk

α = 0.05

.

Figure 2. Theoretical and numerical power bound of the test of case A under symmetrized Weibull noise (with respect to n), for the first kind risk

α = 0.05

.

Figure 3. Theoretical and numerical power bound of the test of case A under a symmetrized Laplacian noise (with respect to n), for the first kind risk

α = 0.05

.

Figure 3. Theoretical and numerical power bound of the test of case A under a symmetrized Laplacian noise (with respect to n), for the first kind risk

α = 0.05

.

Figure 4.

φ_{γ}

for

γ = 0.5, 1,

and 2.

Figure 4.

φ_{γ}

for

γ = 0.5, 1,

and 2.

Figure 5. Power of the test of case A under Gaussian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 5. Power of the test of case A under Gaussian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 6. Power of the test of case A under Laplacian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 6. Power of the test of case A under Laplacian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 7. Power of the test of case A under symmetrized Weibull noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 7. Power of the test of case A under symmetrized Weibull noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 8. Power of the test of case A under noise following a Cauchy distribution (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 8. Power of the test of case A under noise following a Cauchy distribution (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 9. Power of the test of case B under Gaussian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

. The NP curve corresponds to the optimal Neyman Pearson test under

δ_{max}

. The KL, Hellinger, and

G = 2

curves stand respectively for

γ = 1, γ = 0.5

, and

γ = 2

cases.

Figure 9. Power of the test of case B under Gaussian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

. The NP curve corresponds to the optimal Neyman Pearson test under

δ_{max}

. The KL, Hellinger, and

G = 2

curves stand respectively for

γ = 1, γ = 0.5

, and

γ = 2

cases.

Figure 10. Power of the test of case B under Laplacian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 10. Power of the test of case B under Laplacian noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 11. Power of the test of case B under symmetrized Weibull noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 11. Power of the test of case B under symmetrized Weibull noise (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 12. Power of the test of case B under a noise following a Cauchy distribution (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

Figure 12. Power of the test of case B under a noise following a Cauchy distribution (with respect to

δ_{max}

), for the first kind risk

α = 0.05

and sample size

n = 100

.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Broniatowski, M.; Jurečková, J.; Moses, A.K.; Miranda, E. Composite Tests under Corrupted Data. Entropy 2019, 21, 63. https://doi.org/10.3390/e21010063

AMA Style

Broniatowski M, Jurečková J, Moses AK, Miranda E. Composite Tests under Corrupted Data. Entropy. 2019; 21(1):63. https://doi.org/10.3390/e21010063

Chicago/Turabian Style

Broniatowski, Michel, Jana Jurečková, Ashok Kumar Moses, and Emilie Miranda. 2019. "Composite Tests under Corrupted Data" Entropy 21, no. 1: 63. https://doi.org/10.3390/e21010063

APA Style

Broniatowski, M., Jurečková, J., Moses, A. K., & Miranda, E. (2019). Composite Tests under Corrupted Data. Entropy, 21(1), 63. https://doi.org/10.3390/e21010063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Composite Tests under Corrupted Data

Abstract

1. Introduction

Statement of the Test Problem

2. An Extension of the Likelihood Ratio Test

3. Minimax Tests under Noisy Data, Least-Favorable Hypotheses

3.1. An Asymptotic Definition for the Least-Favorable Hypotheses

3.2. Identifying the Least-Favorable Hypotheses

3.3. Numerical Performances of the Minimax Test

3.3.1. Case A: The Shift Problem

Theoretical Power Bound

Numerical Power Bound

Comparison of the Two Power Curves

3.3.2. Case B: The Tail Thickness Problem

4. Some Alternative Statistics for Testing

4.1. A Family of Composite Tests Based on Divergence Distances

4.2. A Practical Choice for Composite Tests Based on Simulation

4.2.1. Case A: The Shift Problem

4.2.2. Case B: The Tail Thickness Problem

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Proposition 2

Appendix A.1. The Critical Region of the Test

Appendix A.2. The Power Function

Appendix A.3. A Synthetic Result

Appendix B. Proof of Theorem 3

Appendix C. Proof of Proposition 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI