Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory

Vidyashankar, Anand N.; Collamore, Jeffrey F.

doi:10.3390/e23040386

Open AccessArticle

Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory

by

Anand N. Vidyashankar

^1,*

and

Jeffrey F. Collamore

²

¹

Department of Statistics, George Mason University, Fairfax, VA 22030, USA

²

Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(4), 386; https://doi.org/10.3390/e23040386

Submission received: 22 February 2021 / Revised: 14 March 2021 / Accepted: 15 March 2021 / Published: 24 March 2021

(This article belongs to the Special Issue Robust Procedures for Estimating and Testing in the Framework of Divergence Measures)

Download Versions Notes

Abstract

:

Hellinger distance has been widely used to derive objective functions that are alternatives to maximum likelihood methods. While the asymptotic distributions of these estimators have been well investigated, the probabilities of rare events induced by them are largely unknown. In this article, we analyze these rare event probabilities using large deviation theory under a potential model misspecification, in both one and higher dimensions. We show that these probabilities decay exponentially, characterizing their decay via a “rate function” which is expressed as a convex conjugate of a limiting cumulant generating function. In the analysis of the lower bound, in particular, certain geometric considerations arise that facilitate an explicit representation, also in the case when the limiting generating function is nondifferentiable. Our analysis involves the modulus of continuity properties of the affinity, which may be of independent interest.

Keywords:

Hellinger distance; large deviations; divergence measures; rare event probabilities

1. Introduction

In a variety of applications, the use of divergence-based inferential methods is gaining momentum, as these methods provide robust alternatives to traditional maximum likelihood-based procedures. Since the work of [1,2], divergence-based methods have been developed for various classes of statistical models. A comprehensive treatment of these ideas is available, for instance, in [3,4]. The objective of this paper is to study the large deviation tail behavior of the minimum divergence estimators and, more specifically, the minimum Hellinger distance estimators (MHDE).

To describe the general problem, suppose

Θ \subset R^{d}

, and let

F = {f_{θ} (\cdot) : θ \in Θ}

denote a family of densities indexed by

θ

. Let

{X_{n} : n \geq 1}

denote a class of i.i.d. random variables, postulated to have a continuous density with respect to Lebesgue measure and belonging to the family

F

, and let X be a generic element of this class. We denote by

g (\cdot)

the true density of X.

Before providing an informal description of our results, we begin by recalling that the square of the Hellinger distance (SHD) between two densities

h_{1} (\cdot)

and

h_{2} (\cdot)

on

R

is given by

\begin{matrix} {HD}^{2} (h_{1}, h_{2}) = {∥h_{1}^{\frac{1}{2}} - h_{2}^{\frac{1}{2}}∥}_{2}^{2} = 2 - 2 \int_{R} {(h_{1} (x) h_{2} (x))}^{\frac{1}{2}} d x . \end{matrix}

The quantity

\int_{R} {(h_{1} (x) h_{2} (x))}^{\frac{1}{2}} d x

is referred to as the affinity between

h_{1} (\cdot)

and

h_{2} (\cdot)

and denoted by

A (h_{1}, h_{2})

. Hence, the SHD between the postulated density and the true density is given by

SHD (θ) = {HD}^{2} (f_{θ}, g) .

When

Θ

is compact, it is known that there exists a unique

θ_{g} \in Θ

minimizing the

SHD (θ)

. Furthermore, when

g (\cdot) = f_{θ_{0}} (\cdot)

and

F

satisfies an identifiability condition, it is well known that

θ_{g}

coincides with

θ_{0}

; cf. [1]. Turning to the sample version, we replace

g (\cdot)

by

g_{n} (\cdot)

in the definition of SHD, obtaining the objective function

{SHD}_{n} (θ) = {HD}^{2} (f_{θ}, g_{n})

and

\begin{matrix} g_{n} (x) = \frac{1}{n b_{n}} \sum_{i = 1}^{n} K (\frac{x - X_{i}}{b_{n}}), \end{matrix}

(1)

where the kernel

K (\cdot)

is a probability density function and

b_{n} ↘ 0

and

n b_{n} ↗ \infty

as

n \to \infty

.

It is known that when the parameter space

Θ

is compact, there exists a unique

{\hat{θ}}_{n} \in Θ

minimizing

{SHD}_{n} (θ)

, and that

{\hat{θ}}_{n}

converges almost surely to

θ_{g}

as

n \to \infty

; cf. [1]. Furthermore, under some natural assumptions,

\begin{matrix} n^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{g}) \overset{d}{\to} G, \end{matrix}

(2)

where, under the probability measure associated with

g (\cdot)

, G is a Gaussian random vector with mean vector

0

and covariance matrix

Σ_{g}

. If

g (\cdot) = f_{θ_{0}} (\cdot)

, then the variance of G coincides with the inverse of the Fisher information matrix

I (θ_{0})

, yielding statistical efficiency. When the true distribution

g (\cdot)

does not belong to

F

, we will call this the “model misspecifed case,” while when

g \in F

, we will say that the “postulated model” holds.

In this paper, we focus on the large deviation behavior of

{{\hat{θ}}_{n} : n \geq 1}

; namely, the asymptotic probability that the estimate

{\hat{θ}}_{n}

will achieve values within a set away from the central tendency described in (2). We establish results of the form

\begin{matrix} log P_{g} ({\hat{θ}}_{n} \in B) \approx - n inf_{θ \in B} I (θ), \end{matrix}

(3)

for some “rate function” I and given Borel subset

B \subset Θ

. Similar large deviation estimates for maximum likelihood estimators (MLE) have been investigated in [5,6,7], and for general M-estimators in [8,9]. These results allow for a precise description of the probabilities of Type I and Type II error in both the Neymann–Pearson and likelihood ratio test frameworks. Furthermore, large deviation bounds allow one to identify the best exponential rate of decrease of Type II error amongst all tests that satisfy a bound on the Type I error, as in Stein’s lemma (cf. [10]). Additional evidence of the importance of large deviation results for statistical inference has been described in [11] and in the book [12].

One of our initial goals was to derive sharp probability bounds for Type I and Type II error in the context of robust hypothesis testing using Hellinger deviance tests. This article is a first step towards this endeavor. A key issue that distinguishes our work from earlier works is that, in our case, the objective function is a nonlinear function of the smoothed empirical measure, and the analysis of this case requires more involved methods compared with those currently existing in the statistical literature on large deviations. Consistent with large deviation analysis more generally, we identify the rate function I as the convex conjugate of a certain limiting cumulant generating function, although in our problem, we uncover a subtle asymmetry between the upper and lower bounds when our limiting generating function is nondifferentiable. In the classical large deviation literature, similar asymmetries have been studied in other one-dimensional contexts (e.g. [13]), although the statistical problem is still quite different, as the dependence on the parameter

θ

arises explicitly—inhibiting the use of convexity methods typically exploited in the large deviation literature—and hence requiring novel techniques.

1.1. Large Deviations

In this subsection we provide relevant definitions and properties from large deviation theory required in the sequel. In the following,

R_{+}

will denote the set of non-negative real numbers.

Definition 1.

A collection of probability distributions

{P_{n} : n \geq 1}

on a topological space

(X, B)

is said to satisfy the weak large deviation principle if

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{n} (F) \leq - inf_{x \in F} I (x), for all closed F \in B, \end{matrix}

and

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log P_{n} (G) \geq - inf_{x \in G} I (x) for all open sets F \in B \end{matrix}

for some lower semicontinuous function

I : X \to [0, \infty]

. The function I is called the rate function. If the level sets of I are compact, we call I a good rate function and we say that

{P_{n}}

satisfies the large deviation principle (LDP).

We begin with a brief review of large deviation results for i.i.d. random variables and empirical measures. Let

{X_{n}} \subset R

be an i.i.d. sequence of real-valued random variables, and let

P_{n}

denote the distribution of the sample mean

{\bar{X}}_{n}

. If the moment generating function of

X_{1}

is finite in a neighborhood of the origin, then Cramér’s theorem states that

{P_{n}}

satisfies the LDP with good rate function

Λ^{*}

, where

Λ^{*}

is the convex conjugate (or Legendre–Fenchel transform) of

Λ

, and where

Λ (α) = log E [e^{α X_{1}}]

is the cumulant generating function of

X_{1}

(cf. [10], Section 2.2).

Next, consider the empirical measures

{μ_{n}}

, defined by

\begin{matrix} μ_{n} (B) = \frac{1}{n} \sum_{i = 1}^{n} I_{{X_{i} \in B}}, B \in B, \end{matrix}

(4)

where

B

denotes the collection of Borel subsets of

R

. It is well known (cf. [14]) that

{μ_{n}}

converges weakly to P, namely to the distribution of

X_{1}

. Then Sanov’s theorem asserts that

{μ_{n}}

satisfies a large deviation principle with rate function

I_{P}

given by

\begin{matrix} I_{P} (ν) = \{\begin{matrix} KL (ν, P) & if ν ≪ P, \\ \infty & otherwise, \end{matrix} \end{matrix}

(5)

where

KL (ν, P)

is the Kullback–Leibler information between the probability measures

ν

and P. When

ν

and P each possesses a density with respect to Lebesgue measure (say p and g, respectively), the above expression becomes

\begin{matrix} KL (p, g) : = \{\begin{matrix} \int_{S} p (x) log (\frac{p (x)}{g (x)}) d μ (x) & if p ≪ g, \\ \infty & otherwise . \end{matrix} \end{matrix}

(6)

In Sanov’s theorem, the rate function

I_{P}

is defined on the space of probability measures, which is a metric space with the open sets induced by weak convergence. Extensions of Sanov’s theorem to strong topologies have been investigated in the literature; cf., e.g., [15].

We now turn to a general result, which will play a central role in this paper, namely Varadhan’s integral lemma (cf. [10], Theorem 4.3.1). This result will allow us to infer the scaled limit of a sequence of generating functions from the existence of the large deviation principle.

Lemma 1

(Varadhan). Let

{Y_{n}}

be a sequence of random variables taking values in a regular topological space

(X, B)

, and assume that the probability law of

{Y_{n}}

satisfies the LDP with good rate function I. Then for any bounded continuous function

F : X \to R

,

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log E [exp (n F (Y_{n}))] = sup_{x \in X} \{F (x) - I (x)\} . \end{matrix}

(7)

1.2. Minimum Hellinger Distance Estimator and Large Deviations

We first observe that the MHDE is obtained by maximizing

\begin{matrix} A_{n} (θ) \equiv A_{n} (θ, g_{n}) : = \int_{R} f_{θ}^{\frac{1}{2}} (x) g_{n}^{\frac{1}{2}} (x) d μ (x), \end{matrix}

(8)

which involves solving the equation

\nabla A_{n} (θ) = 0 .

The idea behind the large deviation analysis is to observe that the large deviation behavior of the maximizer can be extracted from that of the objective function

\nabla A_{n} (θ)

near

0

. By the Gärtner–Ellis theorem (cf. [10], Section 2.3), this amounts to investigating the asymptotic behavior as

n \to \infty

of

\begin{matrix} \frac{1}{a_{n}} log E_{g} [exp \{a_{n} 〈 α, {\nabla A}_{n} (θ) 〉\}], α \in R^{d}, \end{matrix}

(9)

where

a_{n} ↗ \infty

as

n \to \infty

. In the case of maximum likelihood estimation (MLE) or minimum contrast estimation (MCE), the objective function can be expressed as

\begin{matrix} \sum_{i = 1}^{n} h_{θ} (X_{i}) = n \int_{R} h_{θ} (x) d μ_{n} (x), \end{matrix}

(10)

where

{μ_{n} : n \geq 1}

is the empirical measure associated with

{X_{k} : 1 \leq k \leq n}

. Thus, while the objective functions associated with the MLE and MCE are linear functions of the empirical measure, the affinity is a nonlinear function of the empirical measure. This creates certain complications in identifying the rate function

I (\cdot)

alluded to in (3). Of course, in the case of likelihood and minimum contrast estimator analysis, an explicit formula for

I (\cdot)

ensues as the Legendre–Fenchel transform of the cumulant generating function of

h_{θ} (X_{1})

, viz.

log E_{θ_{0}} [exp (α h_{θ} (X_{1}))]

. One approach to evaluating the limiting generating function is to apply Varadhan’s lemma as given above in (7). In the context of our problem, that requires an investigation into the large deviation principle for the density estimators

g_{n} (\cdot)

viewed as elements of

L_{1} (S)

, viz. the space of integrable functions on S. Equivalently, we require a version of Sanov’s theorem in

L_{1}

-space, which leads to certain topological considerations. The main issue here is that, when

L_{1}

is equipped with a norm topology, the sequence of kernel density estimates

{g_{n} (\cdot)}

possesses large deviation bounds, but the associated rate function may not have compact level sets, as is required for a typical application of Varadhan’s lemma. Nonetheless, one obtains a full LDP when

L_{1} (S)

is equipped with the weak topology.

The asymptotic properties of MHDE, such as consistency and asymptotic normality, are established using the norm convergence of

g_{n} (\cdot)

to

g (\cdot)

. For this reason, we focus on a subclass of densities

G

(see Proposition 1 below) possessing certain equicontinuity properties where norm convergence prevails. These issues are handled in Section 2, where the precise statements of our main results can also be found. Section 3 is devoted to the proofs of the main results. Section 4 contains some concluding remarks.

2. Notation, Assumptions, and Main Results

Let

f_{θ} (\cdot)

denote the postulated density of

{X_{n}}

, defined on a measure space

(Ω, F)

. Let

S \subset R

denote the support of X and

s_{θ} (\cdot) = f_{θ}^{\frac{1}{2}} (\cdot)

. Let the true density of

{X_{n}}

be given by

g (\cdot)

. Throughout the paper, we assume that the following regularity conditions hold.

Hypothesis 1.

Θ

is a compact and convex subset of

R^{d}

.

Hypothesis 2.

The family

F

is identifiable; namely, if

θ_{1} \neq θ_{2}

,

f_{θ_{1}} (\cdot) \neq f_{θ_{2}} (\cdot)

on a set of positive Lebesgue measure.

Hypothesis 3.

For every

θ \in Θ

,

s_{θ}

is three times continuously differentiable with respect to all components of

θ

. Denote by

\nabla s_{θ}

the gradient of

s_{θ}

and its components by

{\dot{s}}_{θ}^{i} (\cdot)

. Let

H_{θ}

denote the matrix of second partial derivatives of

s_{θ} (\cdot)

with respect to

θ

and

{\ddot{s}}_{θ}^{i j}

the

{(i, j)}^{t h}

element of

H_{θ}

.

Hypothesis 4.

Let the matrix of second partial derivatives of

A_{n} (θ)

and

A (θ)

be denoted by

H_{A_{n}} (θ)

and

H_{A} (θ)

, respectively. Assume that

H_{A_{n}} (θ)

and

H_{A} (θ)

are continuous in

θ

and that

H_{A} (θ)

is positive definite for every

θ \in Θ

. For

p \in G

and

θ \in Θ

, let

λ_{θ} (p)

denote the smallest eigenvalue of the matrix

\int_{S} H_{θ} (x) p^{\frac{1}{2}} (x) d x .

Assume that

inf {λ_{θ} (p) : p \in G} \geq c > 0

, where c is independent of

θ

.

These hypotheses on the family

F

are generally standard and are used to establish the asymptotic properties of the MHDE. Sufficient conditions on

F

for the validity of these hypotheses are described in [3,16], and [17]. A remark on Hypothesis 4 is warranted here. When

p = g

, this assumption is related to the positive definiteness of the Fisher information matrix. If one assumes

G = F

, then this hypothesis reduces to the condition that

inf {λ_{θ} : θ \in Θ} \geq c > 0

, which is standard. Finally, we remark that we have not attempted to provide the weakest regularity conditions, and we do believe some of these conditions can possibly be relaxed.

Recall that the MHDE of

θ

can be obtained by solving the equation

\begin{matrix} {\nabla A}_{n} (θ) : = \nabla_{θ} A (f_{θ}, g_{n}) = \frac{1}{2} \int_{R} u_{θ} (x) s_{θ} (x) g_{n}^{\frac{1}{2}} (x) d x = 0, \end{matrix}

(11)

where

u_{θ} (x) = \nabla_{θ} f_{θ} (x) {(f_{θ} (x))}^{- 1}

is the score function, which is obtained using

\nabla_{θ} s (x; θ) = \frac{1}{2} u (x; θ) s (x; θ)

.

We begin by providing some heuristics for the case

d = 1

. Let

{\dot{A}}_{n} (θ)

denote the derivative of

A_{n} (θ)

when

d = 1

. Let

{\hat{θ}}_{n}

denote the argzero of the function

{\dot{A}}_{n} (θ)

obtained from (11) above. Let

{\hat{θ}}_{n, l} = inf {θ \in Θ : {\dot{A}}_{n} (θ) \leq 0}

and

{\hat{θ}}_{n, u} = sup {θ \in Θ : {\dot{A}}_{n} (θ) \geq 0}

. Since

{\hat{θ}}_{n, l} \leq {\hat{θ}}_{n} \leq {\hat{θ}}_{n, u}

, we obtain using Markov’s inequality that for any

ϵ > 0

,

\begin{matrix} P_{g} ({\hat{θ}}_{n, l} \geq θ_{g} + ϵ) \leq P_{g} ({\dot{A}}_{n} (θ_{g} + ϵ) \geq 0) \leq E_{g} [exp (n α {\dot{A}}_{n} (θ_{g} + ϵ)], \end{matrix}

(12)

where

α > 0

. Similarly, for

α < 0

, it can be seen that

\begin{matrix} P_{g} ({\hat{θ}}_{n, u} \leq θ_{g} - ϵ) \leq P_{g} ({\dot{A}}_{n} (θ_{g} - ϵ) \leq 0) \leq E_{g} [exp (n α {\dot{A}}_{n} (θ_{g} - ϵ)] . \end{matrix}

(13)

Thus, an evaluation of (9) will allow us to obtain the logarithmic upper bound for

{\hat{θ}}_{n, l}

and

{\hat{θ}}_{n, u}

. Next, using the inequalities

\begin{matrix} P_{g} ({\hat{θ}}_{n, l} \geq θ_{g} + ϵ) & \leq & P_{g} ({\dot{A}}_{n} (θ_{g} + ϵ) \geq 0) \leq P_{g} ({\hat{θ}}_{n, u} \geq θ_{g} + ϵ), \end{matrix}

(14)

\begin{matrix} P_{g} ({\hat{θ}}_{n, u} \leq θ_{g} - ϵ) & \leq & P_{g} ({\dot{A}}_{n} (θ_{g} - ϵ) \leq 0) \leq P_{g} ({\hat{θ}}_{n, l} \leq θ_{g} - ϵ), \end{matrix}

(15)

under additional hypotheses, one can derive large deviation lower bounds for

{\hat{θ}}_{n}

. Deriving these bounds for MLE and MCE is rather standard, since the objective functions and their derivatives are linear functionals of the empirical distribution, as stated in (10), but this is not the case for the Hellinger distance.

Observe that the probabilities in (12) and (13) represent rare-event probabilities since, under the hypotheses described previously,

{\hat{θ}}_{n}

converges to

θ_{g}

almost surely as

n \to \infty

. The distributional results concerning

{\hat{θ}}_{n}

rely on the continuity and differentiability properties of

\nabla A_{n} (θ)

, which depend nonlinearly on

g_{n}

, and the norm convergence of

g_{n}

to g.

Let

G

denote the collection of all probability densities with support S. By Scheffe’s theorem, the pointwise convergence of

g_{n}

to g implies

g_{n} \overset{L_{1}}{\to} g

as

n \to \infty

. Additionally, when

g_{n} (\cdot)

is the kernel density estimator, then Glick’s Theorem guarantees that

g_{n} \overset{L_{1}}{\to} g

almost surely as

n \to \infty

when

b_{n} ↘ 0

and

n ↗ \infty

; cf. [18]. Since the MHDE are functionals of density estimators, it is natural to expect that the large deviations of density estimators will play a significant role in our analysis. For this reason, one is forced to consider the topological issues that arise in the large deviation analysis of density estimators. Interestingly, it turns out that the weak topology on

L_{1} (S)

plays a prominent role. This, in turn, leads to the question of whether certain continuity properties, which were part of the traditional theory of MHD analysis, continue to hold if

G

were viewed as a subset of

L_{1} (S)

equipped with weak topology. Expectedly, while the answer in general is no (cf. [19]), Proposition 1 provides sufficient conditions on the family

G

under which one additionally obtains norm convergence.

Before proceeding, we now introduce some further regularity conditions, as follows.

Hypothesis 5.

u_{θ} s_{θ} \in L_{2} (S)

and is an

L_{2} (S)

-continuous function of

θ

.

Hypothesis 6.

The family

F

consists of bounded equicontinuous densities.

Hypothesis 7.

The family

G

consists of bounded and equicontinuous densities.

Hypothesis 8.

u_{θ} g \in L_{2} (S)

and is an

L_{2} (S)

-continuous function of

θ

.

Here, we note that Hypotheses 6 and 7 are related. Furthermore, if one is willing to assume that

G = F

, then one does not need Hypothesis 7. On the other hand, if one believes that parametric distributions are approximations to

G

, then one needs to work with Hypothesis 7. For this reason, we have maintained both of these hypotheses in our main results. Hypotheses 5 and 8 are related to finiteness of the Fisher information and are standard in the statistical literature.

Before we state the first proposition, we recall the definition of weak topology on

L_{1}

(cf. [19]). A sequence

{h_{n} : n \geq 1}

is said to converge weakly in

L_{1}

if

\int_{S} h_{n} (x) b (x) d x \to \int_{S} h (x) b (x) d x

as

n \to \infty

for every

b \in L_{\infty} (S)

, where

L_{\infty} (S)

is a class of essentially bounded functions. We assume throughout the paper that the topology on

Θ

is the standard topology generated by the Euclidean metric.

Proposition 1.

Let

G

denote the class of densities, equipped with the weak topology. Further assume that Hypotheses 1–7 hold. Let

Θ \otimes G

be equipped with the product topology. Then the mapping

\nabla A : Θ \otimes G \to R^{d}

defined by

\begin{matrix} \nabla A (θ, g) \int_{R} u_{θ} (x) s_{θ} (x) g^{\frac{1}{2}} (x) d x \end{matrix}

(16)

is jointly continuous in

(θ, g)

. Furthermore, if

g_{n} \overset{w}{\to} g

, then

\begin{matrix} lim_{n \to \infty} sup_{θ \in Θ} | | \nabla A (θ, g_{n}) - \nabla A (θ, g) | | = 0 . \end{matrix}

(17)

Finally, under Hypothesis 7, the family

G

is a weakly sequentially closed subset of

L_{1} (S)

.

Our next result is concerned with the limit behavior of the generating function of

{\nabla A}_{n} (θ)

. In the following we use the notation

p ≪ g

to mean the probability measures associated with

p (\cdot)

and

g (\cdot)

are absolutely continuous.

Theorem 1.

Assume that Hypotheses 1–7 hold, and set

\begin{matrix} Λ_{n, θ} (α) : = \frac{1}{n} log E_{g} [exp (n 〈 α, {\nabla A}_{n} (θ) 〉], α \in R^{d} . \end{matrix}

(18)

Then

Λ_{θ} (α) : = {lim}_{n \to \infty} Λ_{n, θ} (α)

exists and is a convex function given by

\begin{matrix} Λ_{θ} (α) = sup_{p \in G} \{\int_{S} 〈 α, u_{θ} (x) 〉 s_{θ} (x) p^{\frac{1}{2}} (x) d x - KL (p, g)\}, \end{matrix}

(19)

where

\begin{matrix} KL (p, g) = \{\begin{matrix} \int_{S} p (x) log (\frac{p (x)}{g (x)}) d x & if p ≪ g, \\ \infty & otherwise . \end{matrix} \end{matrix}

(20)

Remark 1.

Since

Λ_{θ}

is defined via a limiting operation, it is hard to extract its qualitative properties. However, we can obtain a simple lower bound by observing that

KL (p, g) = 0

if and only if

p = g

, and an upper bound using that the Kullback–Leibler information is nonnegative. This results in the following bounds:

\begin{matrix} \int_{S} 〈 α, u_{θ} (x) 〉 s_{θ} (x) g^{\frac{1}{2}} (x) d x \leq Λ_{θ} (α) \leq sup_{p \in G} [\int_{S} 〈 α, u_{θ} (x) 〉 s_{θ} (x) p^{\frac{1}{2}} (x) d x] . \end{matrix}

(21)

Furthermore, if all densities in

G

are bounded by one, then

p^{\frac{1}{2}} (\cdot) \geq p (\cdot)

implies

\begin{matrix} Λ_{θ} (α) & \geq & sup_{p \in G} \{\int_{S} 〈 α, u_{θ} (x) 〉 s_{θ} (x) p (x) d x - KL (p, g)\} . \end{matrix}

(22)

Using a variational argument, it can be shown that the supremum on the right-hand side is attained at

p^{*}

given by

\begin{matrix} p^{*} (x) : = \frac{exp (〈 α, u_{θ} (x) 〉) s_{θ} (x)}{\int_{S} 〈 α, u_{θ} (x) 〉 s_{θ} (x) g (x) d x}; \end{matrix}

(23)

cf. [20]. Furthermore, the maximum that results from this choice of

p^{*} (\cdot)

is

log \int_{S} exp (〈 α, u_{θ} (x) 〉) s_{θ} (x) g (x) d x,

yielding yet another lower bound for

Λ_{θ} (α)

, although the comparison of these two lower bounds is not immediate.

Returning to our main discussion, recall from [21] that the convex conjugate of the function

Λ_{θ}

is defined by

\begin{matrix} Λ_{θ}^{*} (x) = sup_{α \in R^{d}} \{〈 α, x 〉 - Λ (α)\}, x \in R^{d} . \end{matrix}

(24)

Let

D_{θ}

denote the domain of

Λ_{θ}

; namely,

\begin{matrix} D_{θ} = {α \in R^{d} : Λ_{θ} (α) < \infty}; \end{matrix}

(25)

and let

R_{θ}

denote the range of the gradient map

\nabla Λ_{θ}

; that is,

R_{θ} = \{x \in R^{d} : \nabla Λ_{θ} (α) = x, some α \in R^{d}\} .

We begin with the discussion of the case

d = 1

. In this case, the generating function

Λ_{θ}

reduces to

\begin{matrix} Λ_{θ} (α) = sup_{p \in G} \{α \int_{S} exp (n α {\dot{A}}_{n} (θ) s (x; θ) p^{\frac{1}{2}} (x) d x - KL (p, g)\} . \end{matrix}

(26)

By the convexity of

Λ_{θ} (\cdot)

, this function is differentiable almost everywhere (cf. [21]), and in the proof, we would like to exploit the differentiability of this function at the point

α_{θ}^{*}

where it attains its minimum value. If

Λ_{θ}

is not differentiable at this point, it is helpful to consider the directional derivatives of

Λ_{θ}

. Specifically, let

Λ_{θ, +}^{'} (\cdot)

and

Λ_{θ, -}^{'} (\cdot)

denote the right and left derivatives of

Λ_{θ} (\cdot)

, respectively. When

x \in (Λ_{θ, -}^{'} (α), Λ_{θ, +}^{'} (α))

, then it is well known that

Λ_{θ}^{*} (x) = α x - Λ_{θ} (α)

, but this observation will not be sufficient to obtain a proper lower bound. For that to hold, we need a stronger condition, namely that

0 \in R_{θ}

, which will only be true if

Λ_{θ}

is differentiable at its point of minimum,

α_{θ}^{*}

. Otherwise, the expected lower bound turns out to be

Λ_{θ}^{*} (x)

, where

x = Λ_{θ, +}^{'} (α_{θ}^{*})

; cf. [13].

We now turn to our large deviation theorem in

R^{1}

, where we study the rare-event probabilities

P_{g} ({\hat{θ}}_{n} \in C)

for sets C that are away from the true value

θ_{g}

. Specifically, we establish an analogue of the LDP, but where a subtle difference arises in the lower bound in the absence of differentiability of

Λ_{θ}

.

We recall that

{\hat{θ}}_{n}

is defined using the kernel density estimator

g_{n} (\cdot)

defined in (1), whose behavior is dictated by the bandwidth sequence

{b_{n}}

.

Theorem 2.

Assume

d = 1

, Hypotheses 1–8 are satisfied, and

{\hat{θ}}_{n}

is the unique zero of

{\dot{A}}_{n} (θ) = 0

. Further assume that

b_{n} ↘ 0

and

n b_{n} ↗ \infty

as

n \to \infty

. Then for any closed set F not containing

θ_{g}

,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in F) \leq - inf_{θ \in F} Λ_{θ}^{*} (0) . \end{matrix}

(27)

Moreover, for any open set G not including

θ_{g}

,

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in G) \geq - inf_{θ \in G} I (θ), \end{matrix}

(28)

where

\begin{matrix} I (θ) = inf \{Λ_{θ}^{*} (x) : x \in R_{θ} \cap [0, \infty)\}, \end{matrix}

(29)

and the infimum is taken to be infinity if the set

R_{θ} \cap [0, \infty)

is empty.

Remark 2.

If

F = [θ, \infty)

where

θ > θ_{g}

, then in both the upper and lower bounds, it is sufficient to evaluate the infimum at the boundary point

θ

. That is,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in [θ, \infty)) \leq - Λ_{θ}^{*} (0) . \end{matrix}

Similarly, if

G = (θ, \infty)

where

θ > θ_{g}

, then

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in (θ, \infty)) \geq - I (θ) . \end{matrix}

Furthermore, if

{inf}_{α} Λ_{θ} (α)

is achieved at a unique point

α_{θ}^{*}

and

Λ_{θ}

is differentiable at

α_{θ}^{*}

, then the right-hand side of (28) reduces to

Λ_{θ}^{*} (0)

, i.e., the upper and lower bounds coincide and the limits exist. Since the rate function appearing in the upper and lower bounds coincide in this case, we obtain a proper LDP if the resulting rate function has the required regularity properties, in particular,

I (θ) = Λ_{θ}^{*} (0)

is lower semicontinuous and has compact level sets.

The proof of the above theorem relies on (14) and (15) combined with Theorem 1, together with a change of measure argument characteristic of large deviation analysis. The comparison inequalities in (14) and (15) are critical to obtaining the characterizations in the above theorem, but these are essentially one-dimensional results and their analogues in higher dimensions (

d \geq 2

) are not immediate. Consequently, when

Λ_{θ}

is not differentiable, new complications arise, which lead to a slightly different, and less explicit, representation of the lower bound.

Next we establish a large deviation theorem for

R^{d}

, generalizing the previous theorem to higher dimensions. In the following, let

dist (x, G) = {inf}_{y \in G} | | x - y | |

denote the distance between a point

x \in R^{d}

and a set

G \subset R^{d}

.

Theorem 3.

Assume Hypotheses 1–8 are satisfied, and assume that

b_{n} ↘ 0

and

n b_{n} ↗ \infty

as

n \to \infty .

Then for any closed set F not containing

θ_{g}

,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in F) \leq - inf_{θ \in F} Λ_{θ}^{★} (0) . \end{matrix}

(30)

Moreover, for any open set G not including

θ_{g}

,

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in G) \geq - inf_{θ \in G} I (θ), \end{matrix}

(31)

where

I (θ) = inf \{Λ_{θ}^{*} (x) : x \in R_{θ} \cap B (0; c_{θ})\}

and

c_{θ} = b dist (θ, Θ - G)

for some universal constant

b \in (0, \infty),

and the infimum is taken to be infinity if the set

R_{θ} \cap B (0; c_{θ})

is empty.

Remark 3.

As we noted for the one-dimensional case in Remark 2, under a differentiability assumption on

Λ_{θ}

, the function

I (θ)

can be identified as

Λ_{θ}^{*} (0)

, but in full generality, it is not immediately known that

I (θ)

is even nontrivial. Moreover, without differentiability, the infimum in the definition of

I (θ)

is more restrictive than what we encountered in the one-dimensional problem. However, if one assumes additional geometry on G, such as a translated cone structure, then one obtains improved estimates in the sense that one can take unbounded regions in the definition of

I (θ)

, just as we saw in Theorem 2.2. For further remarks in this direction, see the discussion given after the proof of the theorem.

3. Proofs

We turn first to Proposition 1.

Proof of Proposition 1.

Since

Θ \otimes G

is equipped with product topology, it is sufficient to show that if

θ_{n} \to θ

and

g_{n} \overset{w}{\to} g

, then

{\nabla A}_{n} (θ)

converges to

\nabla A (θ)

, where

\begin{matrix} \nabla A (θ) = \int_{S} u_{θ} (x) s_{θ} (x) g^{\frac{1}{2}} (x) d x . \end{matrix}

(32)

Let

r_{θ} (x) = u_{θ} (x) s_{θ} (x)

, and observe that

\begin{matrix} |\nabla A (θ_{n}, g_{n}) - \nabla A (θ, g)| & \leq & \int_{S} | r_{θ_{n}} (x) | | g_{n}^{\frac{1}{2}} (x) - g^{\frac{1}{2}} (x) | d x + \int_{S} | r_{θ_{n}} (x) - r_{θ} (x) | g^{\frac{1}{2}} (x) d x \\ \leq & | | r_{θ} {| |}_{2} HD (g_{n}, g) + \int_{S} | r_{θ_{n}} (x) - r_{θ} (x) | g^{\frac{1}{2}} (x) d x \\ = & T_{n, 1} + T_{n, 2}, \end{matrix}

(33)

where the penultimate equation follows by applying the Cauchy–Schwarz inequality. Then by the Cauchy–Schwarz inequality and Hypothesis 5,

T_{n, 2} \to 0

. Since Hellinger distance is dominated by the

L_{1}

-distance, in order to complete the proof, it is sufficient to show that

| | g_{n} - g {| |}_{1} \to 0

. Now since

g_{n} \overset{w}{\to} g

, it follows that as

n \to \infty

,

\begin{matrix} G_{n} (x) : = \int_{S} g_{n} (y) I_{{y \leq x}} d y \to \int_{S} g (y) I_{{y \leq x}} d y G (x) . \end{matrix}

(34)

Evidently,

G_{n} (\cdot)

and

G (\cdot)

are nondecreasing and right continuous. Furthermore, if

x_{*} = inf {x : x \in S}

and

x^{*} = sup {x : x \in S}

, then

G_{n} (x_{*}) \to G (x_{*})

and

G_{n} (x^{*}) \to G (x^{*})

, where

G_{n} (x_{*}) = {lim}_{x \to x_{*}} G_{n} (x)

,

G_{n} (x^{*}) = {lim}_{x \to x *} G_{n} (x)

,

G (x_{*}) = {lim}_{x \to x_{*}} G (x)

,

G (x^{*}) = {lim}_{x \to x *} G (x)

. Thus

G_{n}

converges to G, which is a proper distribution function. Then by Lemma 1 of Boos [22],

g_{n} (\cdot)

converges to

g (\cdot)

uniformly on compact sets. This, in turn, implies the

L_{1}

convergence of

g_{n} (\cdot)

to

g (\cdot)

(by Scheffe’s lemma), which establishes the convergence of

T_{n, 1}

to 0, thus completing the proof of the joint continuity of

\nabla A (θ, g)

.

Next, the uniform convergence (17) follows by Hypothesis 5, since

\begin{matrix} sup_{θ \in Θ} |\nabla A (θ, g_{n}) - \nabla A (θ, g)| & \leq & \int_{S} | r_{θ} (x) | | g_{n}^{\frac{1}{2}} (x) - g^{\frac{1}{2}} (x) | d x \\ \leq & sup_{θ \in Θ} | | r_{θ} {| |}_{2} HD (g_{n}, g) \to 0 . \end{matrix}

Finally, to prove that

G

is weakly sequentially closed, note that convergence in weak topology implies pointwise convergence, yielding

g (\cdot) \geq 0

. Noting that

\begin{matrix} \int_{S} g (x) d μ (x) = 1 + \int_{S} (g (x) - g_{n} (x)) d x, \end{matrix}

(35)

it follows that

g (\cdot)

integrates to one, using

L_{1}

convergence, thus completing the proof of the proposition. □

We now turn to the proof of Theorem 1. The proof relies on the large deviation theorem for the kernel density estimator

g_{n} (\cdot)

in the weak topology of

G

. The next proposition is concerned with the LDP for

{g_{n}}

in

G

, equipped with the inherited weak topology from

L_{1} (S)

. This issue has received considerable attention recently (cf. [23,24]), where it is established that the full LDP may not hold for

{g_{n}}

in norm topology, but does hold under the weak topology.

Proposition 2.

Assume Hypotheses 1–8 and that

b_{n} ↘ 0

and

n b_{n} ↗ \infty

as

n \to \infty

. Then

{g_{n}}

satisfies the LDP in the weak topology of

L_{1} (S)

with good rate function I given by

\begin{matrix} I (p) = \{\begin{matrix} \int_{S} p (x) log (\frac{p (x)}{g (x)}) d x & if g ≪ p, \\ \infty & otherwise . \end{matrix} \end{matrix}

(36)

Proof of Theorem 1.

As before, let

G

be equipped with the weak topology. Set

r_{θ} (x) = u_{θ} (x) s_{θ} (x)

, and define

F : G \to R

as follows:

\begin{matrix} F (h) = \int_{S} 〈 α, r_{θ} (x) 〉 h^{\frac{1}{2}} (x) d x . \end{matrix}

(37)

By Hypothesis 5,

r_{θ} \in L_{2} (S)

. To show that

F (\cdot)

is continuous, let

h_{n} \overset{w}{\to} h

as

n \to \infty

. Then

\begin{matrix} | F (h_{n}) - F (h) | & \leq & \int_{S} r_{θ} (x) | h_{n}^{\frac{1}{2}} (x) - h^{\frac{1}{2}} (x) | d μ (x) \\ \leq & | | r_{θ} {| |}_{2} HD (h_{n}, h) \leq | | r_{θ} {| |}_{2} | | h_{n} - h {| |}_{1} \to 0 as n \to \infty, \end{matrix}

(38)

where we have used the Cauchy–Schwarz inequality that the

L_{1}

distance dominates the Hellinger distance in (38). Now by Hypothesis 7, as in the proof of Proposition 1, we have that

| | h_{n} - h {| |}_{1} \to 0

as

n \to \infty

, establishing the continuity of

F (\cdot)

. Next, to show that

F (\cdot)

is bounded, note that

sup {F (p) : p \in G} \leq | | r_{θ} {| |}_{2}

by the Cauchy–Schwarz inequality. Then by Proposition 2, it follows by Varadhan’s integral lemma (see [10], Theorem 4.3.1) that

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log E [exp (n F (g_{n} (x))] & = & lim_{n \to \infty} \frac{1}{n} log E [exp (n \int_{S} 〈 α, u_{θ} (x) s_{θ} (x) 〉 g_{n}^{\frac{1}{2}} (x) d x)] \\ = & sup_{p \in G} \{\int_{S} 〈 α, u_{θ} (x) 〉 s_{θ} (x) p^{\frac{1}{2}} (x) d x - KL (p, g)\} \\ : = & Λ_{θ} (α) . \end{matrix}

(39)

This completes the proof of the theorem. □

The proofs of our main results will involve probability bounds on the modulus of continuity of

A_{n} (θ)

and

{\nabla A}_{n} (θ)

, respectively. Recall that the modulus of continuity

ω (h; r)

of a function

h : R^{d} \to R

is given by

\begin{matrix} ω (h; r) : = sup_{| | x_{1} - x_{2} | | \leq r} | h (x_{1}) - h (x_{2}) |, r > 0 . \end{matrix}

(40)

Observe that when

h (\cdot)

is replaced by

A_{n} (θ)

or

{\nabla A}_{n} (θ)

, the modulus of continuity becomes a random quantity. Our next proposition summarizes the continuity properties of

A_{n} (θ)

and

{\nabla A}_{n} (θ)

via their modulus of continuity as real-valued functionals from

G

equipped with the weak topology.

Proposition 3.

Assume that Hypotheses 1–8 hold and that

b_{n} ↘ 0

and

n b_{n} ↗ \infty

as

n \to \infty

. Then, with respect to

{A_{n}}

and

A

, the modulus of continuity satisfies the following relations, each with probability one:

\begin{matrix} (i) lim_{n \to \infty} ω (A_{n}; r) = ω (A, r); (ii) lim_{r \to 0} ω (A_{n}; r) = 0; and (iii) lim_{r \to 0} ω (A; r) = 0 . \end{matrix}

Similarly, the sequence

{\nabla A_{n}}

and

\nabla A

satisfy the analogous relations with probability one; namely,

\begin{matrix} (iv) lim_{n \to \infty} ω (\nabla A_{n}; r) = ω (\nabla A; r); (v) lim_{r \to 0} ω (\nabla A_{n}; r) = 0; and (vi) lim_{r \to 0} ω (\nabla A; r) = 0 . \end{matrix}

Proof.

First observe that

A_{n} (θ)

converges uniformly to

A (θ)

. To see this, note that if

g_{n} \overset{w}{\to} g

, then by Proposition 1, it converges in

L_{1}

. Hence

\begin{matrix} sup_{θ \in Θ} | A_{n} (θ) - A (θ) | & \leq & sup_{θ \in Θ} \int_{R} s_{θ} (x) | g_{n}^{\frac{1}{2}} (x) - g^{\frac{1}{2}} (x) | d x \\ \leq & | | g_{n}^{\frac{1}{2}} (x) - g^{\frac{1}{2}} {(x) | |}_{2} \leq | | g_{n} - g {| |}_{1} \to 0, \end{matrix}

(41)

where the last inequality follows using that the Hellinger distance is dominated by the

L_{1}

-distance. We now prove (i). For this we invoke the properties of the modulus of continuity. Observe that

\begin{matrix} ω (A_{n}; r) = ω (A_{n} - A + A; r) \leq ω (A_{n} - A; r) + ω (A; r), \end{matrix}

(42)

which yields

\begin{matrix} | ω (A_{n}; r) - ω (A; r) | \leq ω (A_{n} - A; r) . \end{matrix}

(43)

Next observe that

\begin{matrix} ω (A_{n} - A; r) & = & sup_{| | θ_{1} - θ_{2} | | \leq r} | (A_{n} - A) (θ_{1}) - (A_{n} - A) (θ_{2}) | \\ \leq & 2 sup_{θ \in Θ} | A_{n} (θ) - A (θ) | \to 0, \end{matrix}

(44)

where the last convergence follows from the uniform convergence of

(A_{n} - A) (θ)

to 0 as shown in (42). The proof of (iv) is similar, and specifically is obtained by using that

\begin{matrix} ω (\nabla (A_{n} - A); r) \leq 2 sup_{θ \in Θ} | | \nabla A_{n} (θ) - \nabla A (θ) | | \to 0, \end{matrix}

(45)

where the above convergence follows from (17).

We now turn to the proof of (ii). Using the Cauchy–Schwarz inequality and the definition of Hellinger distance,

\begin{matrix} ω (A_{n}; r) & = & sup_{| | θ_{1} - θ_{2} | | \leq r} | A_{n} (θ_{1}) - A_{n} (θ_{2}) | \\ = & sup_{| | θ_{1} - θ_{2} | | \leq r} |\int_{R} (s_{θ_{1}} (x) - s_{θ_{2}} (x)) g_{n}^{\frac{1}{2}} (x) d x| \leq HD (f_{θ_{1}}, f_{θ_{2}}) \\ \leq & sup_{| | θ_{1} - θ_{2} | | \leq r} | | f_{θ_{1}} - f_{θ_{2}} {| |}_{1} : = ω (H; r), \end{matrix}

(46)

where

H : (θ_{1}, θ_{2}) \to | | f_{θ_{1}} - f_{θ_{2}} {| |}_{1}

is continuous since

F

is continuous in

θ

. Also, since

Θ \times Θ

is compact,

H (\cdot, \cdot)

is uniformly continuous. Since the modulus of continuity converges to 0 if and only if

H (\cdot, \cdot)

is uniformly continuous, (ii) follows. Turning to (v), notice that, as before,

\begin{matrix} ω (\nabla A_{n}; r) \leq sup_{| | θ_{1} - θ_{2} | | \leq r} | | u_{θ_{1}} s_{θ_{1}} - u_{θ_{2}} s_{θ_{2}} {| |}_{2} . \end{matrix}

(47)

Now, since

u_{θ} s_{θ}

is

L_{2}

continuous, by Hypothesis 5, the proof follows as in (ii) due to to the compactness of

Θ

. The proofs of (iii) and (vi) are similar to (ii) and (v), respectively, and are therefore omitted. □

Proposition 4.

For any

0 < M < \infty

and

δ > 0

, there exists a positive number

r (M, δ)

such that

\begin{matrix} P_{g} (ω (A_{n}; r) \geq δ) \leq e^{- M n} a n d P_{g} (ω (\nabla A_{n}; r) \geq δ) \leq e^{- M n} . \end{matrix}

(48)

Proof.

By Markov’s inequality and (46), it follows that for any

β > 0

,

\begin{matrix} P_{g} (ω (A_{n}; r) \geq δ) \leq E_{g} [e^{n β ω (A_{n}; r)}] e^{- n β δ} \leq e^{- n β (δ - ω (H; r))} . \end{matrix}

(49)

Since

ω (H; r) \to 0

as

r ↘ 0

, there exists an

r_{0}

such that for all

r \leq r_{0}

,

(δ - ω (H; r)) > 0

. Since

β > 0

is arbitrary, the proposition follows by taking

β = M {(δ - ω (H; r))}^{- 1}

, for some

r \leq r_{0}

. The proof of the second inequality is similar, using (47). □

Proof of Theorem 2.

We begin with the proof of the upper bound. Since we assume that the equation

{\dot{A}}_{n} (θ) = 0

has a unique solution, it follows from the inequality in (12) that for any

α > 0

and

θ > θ_{g}

,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \geq θ) \leq \underset{n \to \infty}{lim sup} \frac{1}{n} log E_{g} [exp (n α {\dot{A}}_{n} (θ))] = Λ_{θ} (α), \end{matrix}

(50)

where the last equality follows by applying Theorem 1 with

d = 1

. Since the inequality holds for every

α > 0

,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \geq θ) & \leq & sup_{α > 0} Λ_{θ} (α) \leq sup_{α \in R} Λ_{θ} (α) . \end{matrix}

(51)

Now, noticing that

{sup}_{α \in R} Λ_{θ} (α) = - {inf}_{α \in R} - Λ_{θ} (α) = - Λ_{θ}^{*} (0)

, we then obtain

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \geq θ) \leq - Λ_{θ}^{*} (0) . \end{matrix}

(52)

Similarly, for

θ < θ_{g}

, using (13), one can show by an analogous calculation that

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \leq θ) & \leq & - Λ_{θ}^{*} (0) . \end{matrix}

(53)

Now let

θ_{1} = inf {θ > θ_{g} : θ \in F}

and

θ_{2} = sup {θ < θ_{g} : θ \in F}

. Then

\begin{matrix} P_{g} ({\hat{θ}}_{n} \in F) \leq P_{g} ({\hat{θ}}_{n} \geq θ_{1}) + P_{g} ({\hat{θ}}_{n} \leq θ_{2}), \end{matrix}

(54)

and so by (52) and (53), it follows that

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in F) \leq - min_{θ \in {θ_{1}, θ_{2}}} Λ_{θ}^{*} (0) \leq - inf_{θ \in F} Λ_{θ}^{*} (0), \end{matrix}

(55)

where the last step follows since F closed implies

{θ_{1}, θ_{2}} \subset F

.

Next we turn now to the proof of the lower bound. Let G be an open set, and let

θ \in G

. Then there exists an

ϵ > 0

(to be chosen) such that

I_{ϵ} : = (θ - ϵ, θ + ϵ) ⊊ G

. Note that

\begin{matrix} {{\hat{θ}}_{n} \in I_{ϵ}} & = & {{\dot{A}}_{n} ({\hat{θ}}_{n}) = 0, {\hat{θ}}_{n} \in I_{ϵ}} \\ \supset & {{\dot{A}}_{n} (θ) - {\dot{A}}_{n} ({\hat{θ}}_{n}) \geq δ} \cup {{\hat{θ}}_{n} \in I_{ϵ}, sup_{θ_{1}, θ_{2} \in I_{ϵ}} | {\dot{A}}_{n} (θ_{1}) - {\dot{A}}_{n} (θ_{2}) | \leq δ} . \end{matrix}

Thus,

\begin{matrix} P_{g} ({\hat{θ}}_{n} \in I_{ϵ}) & \geq & P_{g} ({\dot{A}}_{n} (θ) - {\dot{A}}_{n} ({\hat{θ}}_{n}) \geq δ) - P_{g} ({\hat{θ}}_{n} \notin I_{ϵ}, sup_{θ_{1}, θ_{2} \in I_{ϵ}} | {\dot{A}}_{n} (θ_{1}) - {\dot{A}}_{n} (θ_{2} | > δ)) \\ \geq & P_{g} ({\dot{A}}_{n} (θ) - {\dot{A}}_{n} ({\hat{θ}}_{n}) \geq δ) - P_{g} (sup_{θ_{1}, θ_{2} \in I_{ϵ}} | {\dot{A}}_{n} (θ_{1}) - {\dot{A}}_{n} (θ_{2}) | > δ) \\ = & P_{g} ({\dot{A}}_{n} (θ) \geq δ) - P_{g} (sup_{θ_{1}, θ_{2} \in I_{ϵ}} | {\dot{A}}_{n} (θ_{1}) - {\dot{A}}_{n} (θ_{2}) | > δ) \\ = & P_{g} ({\dot{A}}_{n} (θ) \geq δ) - P_{g} (ω ({\dot{A}}_{n}; ϵ) > δ) . \end{matrix}

(56)

We now investigate

P_{g} ({\dot{A}}_{n} (θ) \geq δ)

. Let

Q_{n}

denote the distribution of

{\dot{A}}_{n} (θ)

, and define

Q_{n, α}

as follows:

\begin{matrix} Q_{n, α} (B) = \frac{1}{Λ_{n, θ} (α)} \int_{B} e^{- n α y} d Q_{n} (y), B \in B . \end{matrix}

(57)

Let

B = (x - η, x + η)

, for some

η > 0

, where

B \subset (δ, \infty)

and

x \in R_{θ}

. Then

\begin{matrix} Q_{n} (B) \geq exp \{- n α x - n η | α | + n Λ_{n, θ} (α)\} Q_{n, α} (B) . \end{matrix}

(58)

Taking the logarithm, dividing by n, and then taking the limit as

n \to \infty

, we obtain

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log Q_{n} (B) \geq - α x - η | α | - Λ_{θ} (α) + \underset{n \to \infty}{lim inf} \frac{1}{n} log Q_{n, α} (B) . \end{matrix}

(59)

Now since

x \in R_{θ}

, we can apply Theorem IV.1 of [25] to obtain that the last term on the right-hand side of the previous equation converges to zero. Upon letting

η \to 0

, it follows that

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log Q_{n} (B) \geq - Λ_{θ}^{*} (x) . \end{matrix}

(60)

Since the above inequality holds for all

x \in R_{θ} \cap (δ, \infty)

, we conclude that

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log P_{g} ({\dot{A}}_{n} (θ) \geq δ) \geq - I_{δ} (θ), \end{matrix}

(61)

where

I_{δ} (θ) = {inf}_{x \in R_{θ} \cap (δ, \infty)} Λ_{θ}^{*} (x)

.

By Proposition 4, choosing

M > I_{δ} (θ)

, one can find

ϵ > 0

such that

\begin{matrix} P_{g} (ω ({\dot{A}}_{n}; ϵ) > δ) \leq e^{- M n} . \end{matrix}

(62)

Since

\begin{matrix} P_{g} ({\hat{θ}}_{n} \in G) \geq P_{g} ({\dot{A}}_{n} (θ) \geq δ) (1 - \frac{P_{g} (ω ({\dot{A}}_{n}; ϵ))}{P_{g} ({\dot{A}}_{n} (θ) \geq δ)}), \end{matrix}

(63)

by the choice of M, it follows from (61) that

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in G) \geq - I_{δ} (θ) . \end{matrix}

(64)

Taking the supremum on left- and right-hand side over all

δ > 0

yields the required lower bound. □

Turning to the higher dimensional case, we first need the following result, which provides a uniform bound on the Hessian of the objective function

A_{n} (θ)

.

Lemma 2.

Under Hypotheses 1–8, there exists a finite constant

0 < C < \infty

such that with probability one,

\begin{matrix} sup_{n \geq 1} sup_{θ \in Θ} | | H_{A_{n}} (θ) {| |}_{2} \leq C . \end{matrix}

(65)

Proof.

This is standard. Specifically, note that the

{(i, j)}^{th}

element of the matrix

H_{A_{n}} (θ)

is given by

\begin{matrix} h_{n, i j} = \int_{S} {\ddot{s}}_{θ}^{i j} (x) g_{n}^{\frac{1}{2}} (x) d x . \end{matrix}

(66)

Next, writing down the expression for

{\ddot{s}}_{θ}^{i j}

in terms of the derivatives of the score function

u_{θ}

, using the Cauchy–Schwarz inequality along with Hypotheses 3, 4, 6, and 8, and the definition of the matrix norm, the lemma follows. □

In the proof of the lower bound, we will take a somewhat different approach, involving the analysis of k constraints, and our strategy will be to reduce this to a problem involving a single constraint. Specifically, in (67) below, we establish that, instead of studying k constraints on a quantity

D_{n}

(which we are about to define), we can cast the problem in terms of a d-dimensional vector

Y_{n}

(defined in (70) below) belonging to a ball centered at

0

and of appropriate radius.

To be more precise, let

G \subset R^{d}

be open, and consider the probability that we obtain an estimated value

θ \in G

. Let

{θ_{1}, \dots, θ_{k}} \subset Θ - G

, and for any

δ > 0

, set

d_{n} (j) = A_{n} (θ) - A_{n} (θ_{j}) - δ, j = 1, \dots, k

and

D_{n} (θ) = (d_{n} (1), \dots d_{n} (k))

. If

θ

is chosen as the estimate, then we must have

A_{n} (θ) - A_{n} (θ_{j}) \geq 0

for all j, so, in particular,

\begin{matrix} P_{g} ({\hat{θ}}_{n} \in G) \geq P_{g} (D_{n} (θ) \geq 0) \end{matrix}

(67)

(by which we mean that

d_{n} (j) \geq 0

for all j in this last probability).

To evaluate the latter probability, observe that by a second-order Taylor expansion,

\begin{matrix} s_{θ} (x) - s_{θ_{j}} (x) = 〈 θ - θ_{j}, \nabla s_{θ} (x) 〉 + \frac{1}{2} (θ - θ_{j}) H (x; θ_{j}^{*}) {(θ - θ_{j})}^{'} . \end{matrix}

(68)

Using the positive definiteness and uniform boundedness of the matrix

\int_{R} H (x; θ) p^{\frac{1}{2}} (x) d x

, by Hypothesis 4, we have that for any unit vector

v \in R^{d}

,

sup_{p \in G} inf_{η \in Θ} \{v (\int_{R} H (x; η) p^{\frac{1}{2}} (x) d x) v^{'}\} \geq c,

where c is a positive constant independent of

v

. Thus, for each j,

\begin{matrix} sup_{p \in G} inf_{η \in Θ} \{(θ - θ_{j}) (\int_{R} H (x; η) p^{\frac{1}{2}} (x) d x) {(θ - θ_{j})}^{'}\} \geq c {∥ θ - θ_{j} ∥}^{2} . \end{matrix}

(69)

Integrating with respect to

g_{n}^{\frac{1}{2}} (\cdot)

and using the definition of

A_{n} (\cdot)

, we then obtain that

\begin{matrix} d_{n} (j) = \int_{R} [〈 θ - θ_{j}, \nabla s (x, θ) 〉] g_{n}^{\frac{1}{2}} (x) d x + R (θ, θ_{j}), \end{matrix}

(70)

where

R (θ, θ_{j}) \geq c {∥ θ - θ_{j} ∥}^{2} - δ .

Let

Y_{n} (θ) = (Y_{n, 1}, \dots, Y_{n, d})

, where for

s (x; θ) : = s_{θ} (x)

:

\begin{matrix} Y_{n, j} = \int_{S} \frac{\partial}{\partial θ_{j}} s (x; θ) g_{n}^{\frac{1}{2}} (x) d x, 1 \leq j \leq k . \end{matrix}

(71)

(We have suppressed

θ

in the notation for

Y_{n, j}

.) Then the inequality

d_{n} (j) \geq 0

corresponds to an event

E_{n, j}

described by the occurrence of the inequality

〈\frac{θ - θ_{j}}{| | θ - θ_{j} | |}, Y_{n}〉 \geq - c | | θ - θ_{j} | | + δ (| | θ - θ_{j} {| |)}^{- 1},

(72)

where the right-hand side is always negative for small

δ

(since

dist (θ, Θ - G) > 0

) and behaves like a constant multiple of

dist (θ, Θ - G)

as this distance tends to infinity. Thus, we can choose a positive constant

a_{δ}

such that

a_{δ} dist (θ, Θ - G) \leq c | | θ - θ_{j} | | - δ (| | θ - θ_{j} {| |)}^{- 1}, j = 1, \dots, k,

and set

c_{θ} (δ) : = a_{δ} dist (θ, Θ - G)

. Finally, let

{\tilde{E}}_{n}

denote the event that

\begin{matrix} 〈\frac{θ - θ_{j}}{| | θ - θ_{j} | |}, Y_{n}〉 \geq - c_{θ} (δ) . \end{matrix}

(73)

Then for all j,

E_{n, j} \supset {\tilde{E}}_{n}

, where we recall that

E_{n, j}

was defined via (72). Now, since the definition of the event

{\tilde{E}}_{n}

does not depend on any specific vector

θ_{j}

, one can replace the vector

(θ - θ_{j}) (| | θ - θ_{j} {| |)}^{- 1}

by any unit vector

v

in

R^{d}

. Hence

\begin{matrix} P_{g} (D_{n} \geq 0) \geq P_{g} (〈v, Y_{n}〉 \geq - c_{θ} (δ), for all unit vectors v) = P_{g} (Y_{n} \in \bar{B} (0; c_{θ} (δ))), \end{matrix}

(74)

and we now derive a large deviation lower bound for the probability on the right-hand side.

Proposition 5.

Assume that Hypotheses 1–8 hold, and suppose that G is an open subset of

R^{d}

. Assume that

b_{n} ↘ 0

and

n b_{n} ↗ \infty

as

n \to \infty

. Then for any

θ \in G

and

r > 0

,

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log P_{g} (Y_{n} \in B (0; r)) \geq - I_{r} (θ), \end{matrix}

(75)

where

I_{r} (θ) = inf \{Λ_{θ}^{*} (x) : x \in R_{θ} \cap B (0; r)\}

and the infimum is taken to be infinity if the set

R_{θ} \cap B (0; r)

is empty.

Proof.

We begin by studying the limiting generating function of

Y_{n}

. By Varadhan’s integral lemma, it follows that

\begin{matrix} lim_{n \to \infty} Λ_{n, θ} (α) : = lim_{n \to \infty} \frac{1}{n} log E_{g} [exp (n 〈 α, Y_{n} 〉] = Λ_{θ} (α), \end{matrix}

(76)

where

\begin{matrix} Λ_{θ} (α) = sup_{p \in G} [\int_{S} 〈 α, \nabla s_{θ} (x) 〉 p^{\frac{1}{2}} (x) d x - KL (p, g)] . \end{matrix}

(77)

Define the

α

-shifted distribution by

\begin{matrix} Q_{n, α} (B) = \frac{1}{Λ_{n, θ} (α)} \int_{B} e^{n 〈 α, y 〉} d Q_{n} (y), \end{matrix}

(78)

where

Q_{n}

denotes the distribution of

Y_{n}

. Note by the convexity of

Λ_{θ} (α)

that it is almost everywhere differentiable. Fix

x \in R_{θ} \cap B (0; r)

and choose

α

such that

\nabla Λ_{θ} (α) = x

. Let

δ > 0

be such that

B (x; δ) ⊊ B (0; r)

. Then

\begin{matrix} Q_{n} (B (x; δ)) & = & exp (n Λ_{n, θ} (α)) \int_{B (x; δ)} exp (- n 〈 α, y 〉) d Q_{n, α} (y) \\ \geq & exp (n (- 〈 α, x 〉 + Λ_{n, θ} (α) + | | α | | δ)) Q_{n, α} (B (x; δ)), \end{matrix}

(79)

implying

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log Q_{n} (B (x; δ)) \geq - 〈 α, x 〉 + Λ_{θ} (α) - | | α | | δ + \underset{n \to \infty}{lim inf} \frac{1}{n} log Q_{n, α} (B (x; δ)) . \end{matrix}

(80)

Now, notice that the limiting cumulant generating function of

Y_{n}

under the measure

Q_{n, α}

is given by

\begin{matrix} {\tilde{Λ}}_{θ} (β) = Λ_{θ} (α + β) - Λ_{θ} (β) . \end{matrix}

(81)

Since

{\tilde{Λ}}_{θ}

is a proper convex function, it is continuous since

Λ_{θ} (α)

is finite in the

R^{d}

, and moreover, by the choice of

x

, it is differentiable at

0

. Hence Condition II.1 of [25] is satisfied. Now, using Theorem IV.1 of [25], it follows that

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log Q_{n, α} (B (x; δ)) = 0 . \end{matrix}

(82)

Substituting the above into (80), we obtain

\begin{matrix} \underset{n \to \infty}{lim inf} \frac{1}{n} log P_{g} (Y_{n} \in B (0; r)) \geq - Λ_{θ}^{*} (x) . \end{matrix}

(83)

Taking the supremum in

x \in R_{θ} \cap B (0; r)

, the proposition follows. □

Proof of Theorem 3: Upper Bound.

Let F be a closed subset of

Θ

. Note

Θ

compact implies that F is compact. Let

{B (θ; r) : θ \in Θ}

denote an open cover of F, and let

{B (θ_{1}; r), \dots, B (θ_{k}; r)}

denote the finite subcover. Using that

\nabla A_{n} ({\hat{θ}}_{n}) = 0

, we then obtain that for any

α \in R^{d}

,

\begin{matrix} P_{g} ({\hat{θ}}_{n} \in F) & \leq & \sum_{j = 1}^{k} P_{g} ({\hat{θ}}_{n} \in B (θ_{k}; r)) \\ = & \sum_{j = 1}^{k} E_{g} [exp (n 〈 α, {\dot{A}}_{n} ({\hat{θ}}_{n}) 〉) I_{\{{\hat{θ}}_{n} \in B (θ_{j}; r)\}}] : = \sum_{j = 1}^{k} T_{n} (j) . \end{matrix}

(84)

Adding and subtracting

\nabla A_{n} (θ_{j})

to

\nabla A_{n} (θ)

and then applying Hölder’s inequality yields

T_{n} (j) \leq T_{n} (1, j, p) T_{n} (2, j, q)

, where

\begin{matrix} log T_{n} (1, j, p) & = & \frac{1}{p} log E_{g} [exp (n p {〈 α, \nabla A}_{n} (θ_{j}) 〉) I_{\{θ \in B (θ_{j}; r)\}}], \\ log T_{n} (2, j, q) & = & \frac{1}{q} log E_{g} [exp (n q 〈 α, {\nabla (A}_{n} ({\hat{θ}}_{n}) - A_{n} (θ_{j})) 〉) I_{{{\hat{θ}}_{n} \in B (θ_{j}; r)}}] . \end{matrix}

First we study

T_{n} (2, j, q)

. For

{\hat{θ}}_{n} \in B (θ_{j}, r_{j})

and

θ_{1}, θ_{2} \in Θ

, the Cauchy–Schwarz inequality gives

\begin{matrix} | 〈 α, \nabla A_{n} ({\hat{θ}}_{n}) - \nabla A_{n} (θ_{j})) 〉 | & \leq & {| | α | |}_{2} sup_{θ_{1}, θ_{2} \in B (θ_{j}, r)} | | \nabla A_{n} (θ_{1}) - \nabla A_{n} (θ_{2}) {) | |}_{2} \\ \leq & {| | α | |}_{2} | r | sup_{θ \in B (θ_{j}, r)} | | H_{A_{n}} (θ) {| |}_{2} \\ \leq & {| | α | |}_{2} | r | max_{1 \leq j \leq k} [sup_{θ \in B (θ_{j}, r_{j})} | | H_{A_{n}} (θ) {| |}_{2}], \end{matrix}

where

H_{A_{n}} (θ)

is the Hessian matrix consisting of the second partial derivatives of

A_{n} (θ)

. Hence we obtain for any

1 \leq j \leq k

that

\begin{matrix} \frac{1}{n} log T_{n} (2, j, q) & \leq & r \frac{1}{n q} {(n q | | α | |}_{2}) max_{1 \leq j \leq k} \{sup_{θ \in B (θ_{j}, r)} | | H_{A_{n}} (θ) {| |}_{2}\} \\ = & {r | | α | |}_{2} max_{1 \leq j \leq k} \{sup_{θ \in B (θ_{j}, r)} | | H_{A_{n}} (θ) {| |}_{2}\} . \end{matrix}

(85)

Now by Lemma 2,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log T_{n} (2, j, q) \leq C r . \end{matrix}

(86)

Also, for each

1 \leq j \leq k

, Theorem 1 provides that

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} log T_{n} (1, j, p) \leq \frac{1}{p} Λ_{θ_{j}} (p α) . \end{matrix}

(87)

Thus

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} P_{g} ({\hat{θ}}_{n} \in F) & \leq & max_{1 \leq j \leq k} \underset{n \to \infty}{lim sup} \frac{1}{n} log T_{n} (1, j, p) + max_{1 \leq j \leq k} \underset{n \to \infty}{lim sup} \frac{1}{n} log T_{n} (2, j, p) \\ \leq & max_{1 \leq j \leq k} \frac{1}{p} Λ_{θ_{j}} (p α) + C r . \end{matrix}

(88)

Since the last inequality holds for all

p > 1

,

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} P_{g} ({\hat{θ}}_{n} \in F) & \leq & max_{1 \leq j \leq k} \frac{1}{p} Λ_{θ_{j}} (p α) + C r \\ \to & max_{1 \leq j \leq k} Λ_{θ_{j}} (α) + C r as p ↘ 0 . \end{matrix}

(89)

Moreover, for each j,

\begin{matrix} Λ_{θ_{j}} (α) \leq sup_{α \in R^{d}} Λ_{θ_{j}} (α) : = - Λ_{θ_{j}} (0) . \end{matrix}

Hence

\begin{matrix} \underset{n \to \infty}{lim sup} \frac{1}{n} P_{g} ({\hat{θ}}_{n} \in F) & \leq & max_{1 \leq j \leq k} - Λ_{θ_{j}}^{*} (0) + C r \\ \leq & - inf_{θ \in F} Λ_{θ}^{*} (0) + C r . \end{matrix}

(90)

The upper bound follows by letting

r ↘ 0

. □

Proof of Theorem 3: Lower Bound.

Let G be an open subset of

Θ

, and let

θ \in G

. Then

G^{c} = Θ - G

is compact, and there exists a collection

T = {θ_{1}, \dots, θ_{k}} \subset G^{c}

such that

B (θ_{1}; ϵ), \dots, B (θ_{k}; ϵ)

forms a finite subcover of

Θ - G

, where

ϵ > 0 .

Since

\begin{matrix} \{A_{n} (θ) \geq sup_{t \in T} A_{n} (t)\} & \supset & \{A_{n} (θ) \geq max_{1 \leq j \leq k} A_{n} (θ_{j}) + max_{1 \leq j \leq k} sup_{t \in B (θ_{j}; ϵ)} [A_{n} (t) - A_{n} (θ_{j})]\} \\ \supset & \{A_{n} (θ) \geq max_{1 \leq j \leq k} A_{n} (θ_{j}) + sup_{| | θ_{1} - θ_{2} | | < ϵ} [A_{n} (θ_{1}) - A_{n} (θ_{2})]\} \end{matrix}

(91)

it follows that

\begin{matrix} P_{g} {\hat{θ}}_{n} \in G & \geq & P_{g} (A_{n} (θ) > max_{1 \leq j \leq k} A_{n} (θ_{j}) + δ, sup_{| | θ_{1} - θ_{2} | | < ϵ} [A_{n} (θ_{1}) - A_{n} (θ_{2})] \leq δ) \\ \geq & J_{n, 1} - J_{n, 2}, \end{matrix}

(92)

where

\begin{matrix} J_{n, 1} & : = & P_{g} (A_{n} (θ) > max_{1 \leq j \leq k} A_{n} (θ_{j}) + δ), \\ J_{n, 2} & : = & P_{g} (sup_{| | θ_{1} - θ_{2} | | \leq r} [A_{n} (θ_{1}) - A_{n} (θ_{2})] \geq δ) : = P_{g} (ω (A_{n}; ϵ) \geq δ) . \end{matrix}

We now investigate the behavior of

J_{n, 1}

and

J_{n, 2}

. Starting with

J_{n, 1}

, note that

\begin{matrix} J_{n, 1} \geq P_{g} (min_{1 \leq j \leq k} (A_{n} (θ) - A_{n} (θ_{j}) - δ) \geq 0) = P_{g} (D_{n} \geq 0) . \end{matrix}

(93)

Now by (74), it follows that

\begin{matrix} J_{n, 1} \geq P_{g} (Y_{n} \in B (0; r)), \end{matrix}

(94)

where

Y_{n}

is as in (71) and

r = c_{θ} (δ)

. Applying Proposition 3.4, we obtain

\begin{matrix} lim_{n \to \infty} \frac{1}{n} log J_{n, 1} \geq - I_{r} (θ), \end{matrix}

(95)

where

I_{r} (θ) = inf \{Λ_{θ}^{*} (x) : x \in R_{θ} \cap B (0; r)\}

, and we now observe that r may be chosen to be

c_{θ} : = {lim}_{δ ↓ 0} c_{θ} (δ) > 0

, where

c_{θ} (δ)

is given as in (73). Hence we may replace

I_{r} (\cdot)

with

I (\cdot)

on the right-hand side of the previous equation. Next, using Proposition 4 yields that

\underset{n \to \infty}{lim inf} \frac{1}{n} log P_{g} ({\hat{θ}}_{n} \in G) \geq \underset{n \to \infty}{lim inf} \frac{1}{n} log J_{n, 1} + lim_{n \to \infty} log (1 - \frac{J_{n, 2}}{J_{n, 1}}) \geq - I (θ) .

(96)

Finally, the required lower bound is obtained by maximizing the right-hand side over all

θ \in G

. □

In the proof of the lower bound, it is clear that the choice of

{θ_{1}, \dots, θ_{k}}

plays a central role, and the rate function

I (θ)

will be minimized when k is small. As a simple example, suppose that our goal is to obtain a lower bound for

P_{g} ({\hat{θ}}_{n} \in G)

, where

G = \{(θ_{1}, θ_{2}) : θ_{1} > a_{1} or θ_{2} > a_{2}\} \subset R^{2}, θ_{g} \notin G,

which is a union of two halfspaces, This can be expressed as

a + C

, where

a = (a_{1}, a_{2})

and

C = {(θ_{1}, θ_{2}) : θ_{1} > 0 or θ_{2} > 0}

, which is an example of a translated cone. Now if

θ \in G

, then we can find two elements which generate the entire set

Θ - G

, in the sense that all other normalized differences lie between these two unit vectors. These two representative points are the unit vectors

e_{1} = (- 1, 0)

and

e_{2} = (0, - 1)

, and all other normalized differences

(θ - \tilde{θ} / ∥ θ - \tilde{θ} ∥

lie between these vectors for all

\tilde{θ} \in Θ - G .

Now going back to (73), we see that this equation again holds. Furthermore, (74) holds with

B (0; c_{δ} (θ))

now replaced by an intersection of two halfspaces rather than of all halfspaces, yielding an unbounded region in the definition of

I (θ)

. This potentially improves the quality of the lower bound compared with what is presented in the statement of Theorem 3. This idea can be potentially generalized to other sets, such as other unions of halfspaces, and so from a practical perspective, could apply somewhat generally.

4. Concluding Remarks

In this article, we have derived large deviation results for the minimum Hellinger distance estimators of a family of continuous distributions satisfying an equicontinuity condition. These results extend large deviation asymptotics for M-estimators given, e.g., in [6,9]. In contrast to the case for M-estimators, our setting is complicated due to its inherent nonlinearity, leading to complications in the proofs of both the upper and lower bounds, and an unexpected subtlety in the form of the rate function for the lower bound. Our results suggest that one can, under additional hypotheses, establish saddlepoint approximations to the density of MHDE, which would enable one to sharpen inference for small samples.

Similar results are expected to hold for discrete distributions. However, the equicontinuity condition is not required in that case, since

ℓ_{1}

, unlike

L_{1} (S)

, possesses the Schur property. Hence the LDP in the weak topology of

ℓ_{1}

can be derived (more easily) using a standard Gärtner–Ellis argument, and utilizing this, one can, in principle, repeat all of the arguments above to derive results analogous to Theorems 2 and 3. Large deviations for other divergences under weak family regularity (such as noncompactness of the parameter space

Θ

)—and their connections to estimation and test efficiency—are interesting open problems requiring new techniques beyond those described in this article.

Author Contributions

Conceptualization, A.N.V. and J.F.C.; Methodology, A.N.V. and J.F.C.; Validation, A.N.V. and J.F.C.; Writing—original draft, A.N.V. and J.F.C.; Writing—review & editing, A.N.V. and J.F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Beran, R. Minimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5, 445–463. [Google Scholar] [CrossRef]
Lindsay, B.G. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Ann. Stat. 1994, 22, 1081–1114. [Google Scholar] [CrossRef]
Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Boca Raton, FL, USA, 2006. [Google Scholar]
Bahadur, R.R. Rates of convergence of estimates and test statistics. Ann. Math. Stat. 1967, 38, 303–324. [Google Scholar] [CrossRef]
Borovkov, A.A.; Mogulskii, A.A. Large Deviations and Testing Statistical Hypotheses. Sib. Adv. Math. 1992, 2, 43–72. [Google Scholar]
Fu, J.C. On a theorem of Bahadur on the rate of convergence of point estimators. Ann. Stat. 1973, 1, 745–749. [Google Scholar] [CrossRef]
Arcones, M.A. Large deviations for M-estimators. Ann. Inst. Stat. Math. 2006, 58, 21–52. [Google Scholar] [CrossRef]
Joutard, C. Large deviations for M-estimators. Math. Methods Stat. 2004, 13, 179–200. [Google Scholar]
Dembo, A.; Zeitouni, O. Large Deviations Techniques and Applications; Springer: Berlin, Germany, 1998. [Google Scholar]
Puhalskii, A.; Spokoiny, V. On large-deviation efficiency in statistical inference. Bernoulli 1998, 4, 203–272. [Google Scholar] [CrossRef]
Nikitin, Y. Asymptotic Efficiency of Nonparametric Tests; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
Biggins, J.; Bingham, N. Large deviations in the supercritical branching process. Adv. Appl. Probab. 1993, 25, 757–772. [Google Scholar] [CrossRef]
Billingsley, P. Convergence of Probability Measures, 2nd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 1999. [Google Scholar]
de Acosta, A. On large deviations of empirical measures in the τ-topology. J. Appl. Probab. 1993, 31, 41–47. [Google Scholar] [CrossRef]
Basu, A.; Sarkar, S.; Vidyashankar, A.N. Minimum negative exponential disparity estimation in parametric models. J. Statist. Plann. Inference 1997, 58, 349–370. [Google Scholar] [CrossRef]
Cheng, A.-L.; Vidyashankar, A.N. Minimum Hellinger distance estimation for randomized play the winner design. J. Statist. Plann. Inference 2006, 136, 1875–1910. [Google Scholar] [CrossRef]
Devroye, L.; Györfi, L. Nonparametric Density Estimation: The L₁ View; Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics; John Wiley & Sons, Inc.: New York, NY, USA, 1985. [Google Scholar]
Conway, J.B. A Course in Functional Analysis; Springer: New York, NY, USA, 1990. [Google Scholar]
Dupuis, P.; Ellis, R.S. A Weak Convergence Approach to the Theory of Large Deviations; John Wiley & Sons: New York, NY, USA, 1997. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
Boos, D.D. A converse to Scheffé’s theorem. Ann. Stat. 1985, 13, 423–427. [Google Scholar] [CrossRef]
Lei, L. Large Deviations for Kernel Density Estimators and Study for Random Decrement Estimator. Ph. D. Thesis, Université Blaise Pascal-Clermont-Ferrand II, Clermont-Ferrand, France, 2005. [Google Scholar]
Louani, D.; Maouloud, S.M.O. Some functional large deviations principles in nonparametric function estimation. J. Theor. Probab. 2012, 25, 280–309. [Google Scholar] [CrossRef]
Ellis, R.S. Large deviations for a general class of random vectors. Ann. Probab. 1984, 12, 1–12. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vidyashankar, A.N.; Collamore, J.F. Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory. Entropy 2021, 23, 386. https://doi.org/10.3390/e23040386

AMA Style

Vidyashankar AN, Collamore JF. Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory. Entropy. 2021; 23(4):386. https://doi.org/10.3390/e23040386

Chicago/Turabian Style

Vidyashankar, Anand N., and Jeffrey F. Collamore. 2021. "Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory" Entropy 23, no. 4: 386. https://doi.org/10.3390/e23040386

APA Style

Vidyashankar, A. N., & Collamore, J. F. (2021). Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory. Entropy, 23(4), 386. https://doi.org/10.3390/e23040386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rare Event Analysis for Minimum Hellinger Distance Estimators via Large Deviation Theory

Abstract

1. Introduction

1.1. Large Deviations

1.2. Minimum Hellinger Distance Estimator and Large Deviations

2. Notation, Assumptions, and Main Results

3. Proofs

4. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI