Hellinger Information Matrix and Hellinger Priors

Hellinger information as a local characteristic of parametric distribution families was first introduced in 2011. It is related to the much older concept of the Hellinger distance between two points in a parametric set. Under certain regularity conditions, the local behavior of the Hellinger distance is closely connected to Fisher information and the geometry of Riemann manifolds. Nonregular distributions (non-differentiable distribution densities, undefined Fisher information or denisities with support depending on the parameter), including uniform, require using analogues or extensions of Fisher information. Hellinger information may serve to construct information inequalities of the Cramer–Rao type, extending the lower bounds of the Bayes risk to the nonregular case. A construction of non-informative priors based on Hellinger information was also suggested by the author in 2011. Hellinger priors extend the Jeffreys rule to nonregular cases. For many examples, they are identical or close to the reference priors or probability matching priors. Most of the paper was dedicated to the one-dimensional case, but the matrix definition of Hellinger information was also introduced for higher dimensions. Conditions of existence and the nonnegative definite property of Hellinger information matrix were not discussed. Hellinger information for the vector parameter was applied by Yin et al. to problems of optimal experimental design. A special class of parametric problems was considered, requiring the directional definition of Hellinger information, but not a full construction of Hellinger information matrix. In the present paper, a general definition, the existence and nonnegative definite property of Hellinger information matrix is considered for nonregular settings.


Introduction
Information geometry plays an important role in parametrical statistical analysis. Fisher information is the most common information measure that is instrumental in the construction of lower bounds for quadratic risk (information inequalities of Cramer-Rao type), optimal experimental designs (E-optimality) and noninformative priors in Bayesian analysis (Jeffreys prior). These applications of Fisher information require certain regularity conditions on the distributions of the parametric family, which include the existence and integrability of the partial derivatives of the distribution density function with respect to the components of vector parameter and independence of the density support on the parameter. If these regularity conditions are not satisfied, Cramer-Rao lower bounds might be violated, and Jeffreys prior might not be defined.
There exist a number of ways to define information quantities (for the scalar parameter case) and matrices (for the vector parameter case) in the nonregular cases when Fisher information might not exist. One of such suggestions is the Wasserstein information matrix [1], which has been recently applied to the construction of objective priors [2]. The Wasserstein matrix does not require the differentiablity of the distribution density function, but it cannot be extended to the case of discontinuous densities. This latter case requires a more general definition of information.
Our approach is based on analyzing the local behavior of parametric sets using finite differences of pdf values at two adjacent points instead of derivatives at a point, which allows us to include differentiable densities as a special case but also to treat non-differentiable densities including jumps and other types of nonregular behavior (for classification of nonregularities, see [3]). A logical approach is to use (in lieu of Fisher information) the Hellinger information closely related to the definition of Hellinger distance between adjacent points of the parametric set.
Hellinger information for the case of scalar parameter was first defined in [4] and suggested for the construction of noninformative Hellinger priors. Section 2 is dedicated to the revision of this definition and the relationship of Hellinger information to the information inequalities for the scalar parameter proven in [5,6]. It contains some examples of noninformative Hellinger priors comparing the priors obtained in [4] with more recent results.
Lin et al. [7] extended the definition of Hellinger information to a special multiparameter case, where all components of a parameter expose the same type of nonregularity. This is effective in the resolution of some optimization problems in experimental design. However, the most interesting patterns of local behavior of Hellinger distance bringing about differences in the behavior of matrix lower bounds of the quadratic risk and multidimensional noninformative priors are observed when the parametric distribution family has different orders of nonregularity [3] for different components of the vector parameter. Thus, the main challenge in the construction of a Hellinger information matrix in the general case consists in the necessity to consider different magnitudes of increments in different directions of the vector parametric space.
A general definition of the Hellinger information matrix was attempted in [6] as related to information inequalities and in [4] as related to noninformative priors. Important questions were left out, such as the conditions of Hellinger information matrix being positive definite and the existence of non-trivial matrix lower bounds for the quadratic risk in case of the vector parameter. These questions are addressed in Section 3 of the paper. The main results are formulated, and several new examples are considered. General conclusions and possible future directions of study are discussed in Section 4.

Hellinger Information for Scalar Parameter
In this section, we address the case of probability measures parametrized by a single parameter. We provide necessary definitions of information measures along with the discussion of their properties, including the new definition of Hellinger information. Then, we consider the applications of Hellinger information to the information inequalities of the Cramér-Frechet-Rao type, the construction of objective priors, and problems of optimal design. Definition (2) is modified from [8]. Inequality (5) was obtained in [5]. Examples in Section 2.3 are modified from [4].

Definitions
A family of probability measures {P θ , θ ∈ Θ ⊂ R} is defined on a measurable space (X, B) so that all the measures from the family are absolutely continuous with respect to some σ-finite measure λ on B. The square of the Hellinger distance between any two parameter values can be defined in terms of densities p(x; θ) = dP θ dλ as This definition of the Hellinger distance (also known as Hellinger-Bhattacharyya distance) in its modern form was given in [9]. We use this definition to construct a new information measure. If for almost all θ from Θ (with regard to measure λ) there exists an α ∈ (0, 2] (index of regularity) such that we define Hellinger information at a point θ as J(θ). The index of regularity is related to the local behavior of the density p(x; θ). Using the classification of [3], 1 < α < 2 corresponds to singularities of the first and the second type, α = 1 to densities with jumps, and 0 < α < 1 to singularities of the third type. Notice that in the regular situations classified in [3] (p(x; θ) is twice continuously differentiable with respect to θ for almost all x ∈ X with respect to λ, the density support {x : p(x; θ) > 0} does not depend on parameter θ, Fisher information is continuous, strictly positive and finite for almost all θ from Θ), it is true that α = 2, J(θ) = 1 4 I(θ). Under the regularity conditions above, the score function d dθ log p(x; θ) has mean zero and Fisher information as variance. This helps to establish the connection of Fisher information to the limiting distribution of maximum likelihood estimators, its additivity with respect to i.i.d. sample observations, and its role in the lower bounds of risk (information inequalities).
Wasserstein information [1] can be defined for the scalar parameter through the c.d.f.
which does not require differentiablity of the density function p(x; θ). That opens new possibilities for the construction of an objective prior in the case of non-differentiable densities, see [2]. However, we are interested in even less regular situations (including uniform densities with support depending on parameter) for which neither Fisher information I(θ) nor Wasserstein information W(θ) can be helpful, while Hellinger information J(θ) may function as their substitute.

Information Inequalities
We define the quadratic Bayes risk for an estimator θ * = θ * (X (n) ) constructed by an independent identically distributed sample X (n) = (X 1 , . . . , X n ) of size n from the model considered above with p(x (n) ; θ) = dP (n) θ dλ n and prior π(θ) as Let us consider an integral version of the classical Cramér-Frechet-Rao inequality, which under certain regularity conditions leads to the following asymptotic lower bound for the Bayes risk in terms of Fisher information: This lower bound, which can be proven to be tight, was first obtained by [10], also in [11,12] under slightly different regularity assumptions. This bound can be extended to the nonregular case, when Fisher information may not exist. One of these extensions is Hellinger information inequality, providing an asymptotic lower bound obtained in [6] under the assumptions of Hellinger information J(θ) being strictly positive, almost surely continuous, bounded on any compact subset of Θ, and satisfying condition is an open subset of real numbers and the constant C(α) is related to technical details of the proof and is not necessarily tight.
The key identity establishing the role of J(θ) in the case of i.i.d. samples x (n) = (x 1 , . . . , x n ) easily follows from the definition and independence of {X i }. Similar to the additivity of Fisher information, it allows for a transition from a single observation to a sample.
For many regular parameter families, probability matching and reference priors both satisfy the Jeffreys rule. However, it is not necessary in case of multi-parametric families and the loss of regularity. Let us focus on the nonregular case, when Fisher information may not be defined. Most comprehensive results on reference priors in the nonregular case were obtained in [18].
Define Hellinger prior for the parametric set as in [4,8]: Hellinger priors will often coincide with well-known priors obtained by the approaches described above. However, there are some distinctions. A special role might be played by Hellinger priors in nonregular cases. We provide two simple examples of densities with support depending on the parameter.
The same prior can be constructed as the probability matching prior g or the reference prior [18].

Optimal Design
A polynomial model of the experimental design may be presented as in [20], where x i are scalars, θ is the unknown vector parameter of interest, and errors i are non-negative i.i.d variables with density p 0 (y; α) ∼ αc(α)y α−1 (e.g., Weibull or Gamma). The space of balanced designs is defined as and there exist several definitions of optimal design. Lin et al. [7] suggest using criterion which is similar to the definition given in Section 1, but notice the difference with (2) in the treatment of powers: α versus 2 in the denominator. Notice also that index of regularity α is assumed to be the same for all components of the vector parameter.

Hellinger Information Matrix for Vector Parameter
In this section, we concentrate on the multivariate parameter case allowing for different degrees of regularity for different components of the vector parameter. We define the Hellinger information matrix, determine our understanding of matrix information inequalities, formulate main results establishing lower bounds for the Bayes risk in terms of Hellinger information matrix, and provide examples of Hellinger priors illustrating the conditions of Theorems 1 and 2.
Proofs of the main results use the approach developed in [5]; Example 3 was previously mentioned in [8].

Definitions
Extending definitions of Section 1 to the vector case Θ ⊂ R m , m = 1, 2, . . . , we first introduce, as in [6], the Hellinger distance matrix H with elements where increments u i are columns of an m × m matrix U. Define also vectors α = (α 1 , . . . , α m ) (index of regularity with components 0 < α i ≤ 2 ) and δ = (δ 1 , . . . , δ m ) with components δ i = 2/α i , ∆ = Diag(δ) such that for all i = 1, . . . , m there exist finite non-degenerate limits Then, the Hellinger information matrix will be defined by its components Notice that components of the vector index of regularity α can be different, and therefore, components of the vector of increments δ can have different orders of magnitude with respect to . As a result, while the elements of matrix H(θ, ∆) may expose different local behavior depending on the components of α, the elements of matrix J(θ) are all finite.

Information Inequalities
Define the matrix of Bayes risk for an i.i.d. sample x (n) = (x 1 , . . . , x n ) with p(x (n) ; θ) = dP (n) θ dλ n as (14) and matrix ordering A(n) B(n) as asymptotic positive semi-definite property Let also denote the expectation over (X (n) , Θ). The following results formulate conditions under which the lower bounds for risk (14) in the sense of (15) are obtained in terms of Hellinger information and the index of regularity.

Main Results
Proofs of Theorems 1 and 2 are technically similar to the proof of the main results of [5], although the definition of Hellinger information was not explicitly provided in that paper.

Hellinger Priors
If J > 0, as in the conditions of Theorems 1 and 2, the vector Hellinger prior can be defined as In the case of all components α i ≡ 2, Hellinger information reduces to the Fisher information matrix, and our approach leads to the Jeffreys prior [12].
where ψ(z) is the polygamma function of order 1.

Discussion
A Hellinger information matrix can be defined in a reasonable way to serve as a substitute for the Fisher information matrix in multivariate nonregular cases. It can be used as a technically simple tool for the elicitation of non-informative priors.
Properties of the Hellinger distance (symmetry, etc.) grant certain advantages vs analogous constructions based on Kullback-Leibler divergence.
Some interesting nonregularities are not covered by the conditions of Theorems 1 and 2 (see Example 6). More general results related to a positive definite property of J(θ) would be interesting.
It is tempting to obtain Hellinger priors as the solution of a particular optimization problem (similar to the reference priors).

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.