Identiﬁcation in Parametric Models: The Minimum Hellinger Distance Criterion

: This note studies the criterion for identiﬁability in parametric models based on the minimization of the Hellinger distance and exhibits its relationship to the identiﬁability criterion based on the Fisher matrix. It shows that the Hellinger distance criterion serves to establish identiﬁability of parameters of interest, or lack of it, in situations where the criterion based on the Fisher matrix does not apply, like in models where the support of the observed variables depends on the parameter of interest or in models with irregular points of the Fisher matrix. Several examples illustrating this result are provided.


Introduction
There are values of unknown parameters of interest in data analysis that cannot be determined even in the most favorable situation where the maximum amount of data is available, i.e., when the distribution of the population is known. This difficulty has been tackled by either introducing criteria securing that the parameter of interest is (local) identifiable or by delineating the set of observationally equivalent values of the parameter of interest; for a review of these approaches, see, e.g., (Paulino and Pereira 1998) or (Lewbel 2019). This note contributes to these efforts by studying the criterion for identifiability based on the minimization of the Hellinger distance, which was introduced by Beran (1977), and exhibiting its relationship to the criterion for local identifiability based on the nonsingularity the Fisher matrix, which was introduced by Rothenberg (1971). The similarities and differences between these two criteria for identifiability have so far not been studied.
The main result in this note is to show that the Hellinger distance criterion can be used to verify the (local) identifiability of a parameter of interest, or lack of it, either in models or points in the parameter space where the Fisher matrix criterion does not apply. This note illustrates this result with several examples, including a parametric procurement auction model, the uniform, normal squared, and Laplace location models. These models are either irregular because the support of the observed variables depends on the parameter of interest or the parameter space has irregular points of the Fisher matrix. Additional examples of irregular models and models with irregular points of the Fisher matrix are referenced below after defining the concepts of a regular point of the Fisher matrix and a regular model according to conventional usage, see, e.g., Rothenberg (1971).
Let Y be a vector-valued random variable in Y ⊆ R L with probability function P o . Let the available data be a sample {Y i } N i=1 of independent and identically distributed replications of Y. Consider a family F of probability density functions f : Y → [0, ∞) defined with respect to a common dominating measure µ, which will allow us to dispense with the need to distinguish between continuous and discrete random variables. 1 Let F Θ denote a subset of densities in F indexed by θ ∈ Θ, where the parameter space Θ is a subset of R K , with K a positive integer. Let f θ denote an element of F Θ .
Definition 1 (Identifiability). A parameter point θ o in Θ is said to be identifiable if there is no other θ in Θ such that f θ (y) = f θ o (y) for µ-a.s y.
Definition 2 (Local Identifiability). A parameter point θ o in Θ is said to be locally identifiable if there exists an open neighborhood Θ ⊆ Θ of θ o containing no other θ such that Yes, it has been f θ (y) = f θ o (y) for µ-a.s y.
Definition 3 (Regular Points). The Fisher matrix I(θ) is the variance-covariance of the score (θ) := ln f θ , The point θ o ∈ Θ is said to be a regular point of the Fisher matrix if there exists an open neighborhood of θ o in which I(θ) has constant rank.
The (local) identifiability of regular points of the Fisher matrix in parametric models has been extensively studied, see, e.g., Rothenberg (1971). In contrast, the identifiability of irregular points has been less studied and the literature is rather unclear about what may happen about (local) identifiability of irregular points of the Fisher matrix. The study of irregular points is worthy of consideration because, first, there are several models of interest with this type of point in the parameter space (see the list below), and second, because irregular points may either correspond to: • points in the parameter space that are not locally identifiable and for which a consistent estimator cannot not exist, e.g., the measurement error model studied by Reiersol (1950), or a consistent estimator can only exist after a normalization; or • points in the parameter space that are locally identifiable and for which a √ Nconsistent estimator cannot exist (and some algorithms, e.g., Newton-Raphson method based on the Fisher matrix, will face difficulties in converging) or a √ N-consistent estimator can only exist after a reparametrization of the model, see, e.g., the bivariate probit model in Han and McCloskey (2019). Hinkley (1973) noted that an irregular point of the Fisher matrix arises in the normal unsigned location model when the location parameter is zero. Sargan (1983) constructed simultaneous equation models with irregular points of the Fisher matrix. Lee and Chesher (1986) showed that the normal regression model with non-ignorable non-response has irregular points of the Fisher matrix in the vicinity of ignorable non-response. Li et al. (2009) noted that finite-mixture density models have irregular points of the Fisher matrix in the vicinity of homogeneity. Hallin and Ley (2012) showed that skew-symmetric density models have irregular points of the Fisher matrix in the vicinity of symmetry. We use below the normal squared location model (see Example 3) to illustrate in a transparent way the notion of an irregular point of the Fisher matrix.
The next Section shows that the criterion for local identifiability based on minimizing the Hellinger distance, unlike the criterion based on the non-singularity of the Fisher matrix, does apply to both regular and irregular points of the Fisher matrix and to regular and irregular models, to be defined below in Section 3. Section 3 shows that, for regular points of the Fisher matrix in the class of regular models studied by Rothenberg (1971), the criterion based on the Fisher matrix is a particular case of the criterion based on minimizing the Hellinger distance (but not for irregular models or irregular points of the Fisher matrix). Section 4 relates the minimum Hellinger distance criterion with the criterion based on the reversed Kullback-Liebler criterion, introduced by Bowden (1973), by showing that both are particular cases of the criterion for identifiability based on the minimization of a ϕ-divergence.

The Minimum Hellinger Distance Criterion
Identifying θ o is the problem of distinguishing f θ o from the other members of F Θ . It is then convenient to begin by introducing a notion of how densities differ from each other. The squared Hellinger distance for the pair of densities f θ , f θ 0 in F Θ is the square of the L 2 (µ)-norm of the difference between the squared-root of the densities: The squared Hellinger distance has the following well-known properties (see, e.g., Pardo 2005, p. 51), which are going to be used later.
Lemma 1. ρ can take values from 0 to 1, which are independent of the choice of the dominating measure µ, and ρ(θ) = 0 if and only if f θ (y) = f θ o (y) for µ-a.s y.
(All the proofs are in Appendix A) Alternative notions of divergence between densities, other than the squared Hellinger distance, are studied in the Section 4. Since ρ(θ) is equal to zero if and only if f θ and f θ o are equal, one has the following characterization of identifiability.
Moreover, since θ → ρ(θ) is non-negative and reaches a minimum at θ = θ o , one obtains the following criterion for identifiability based on minimizing the squared Hellinger distance.
This criterion applies to models where: • the support of Y depends on the parameter of interest (see Examples 1 and 2 below); • θ o is not a regular point of the Fisher matrix (see Example 3 below); • some elements of the Fisher matrix I(θ o ) are not defined (see Example 5 below); • θ → I(θ) is not continuous (see Example 6 below); • Θ is infinite-dimensional, as in semiparametric models (which are out of the scope of this note). 2 The following examples illustrate the use of Proposition 1 and the definitions introduced so far. They are also going to illustrate, in the next section, the regularity conditions employed by Rothenberg (1971) to obtain a criterion for local identifiability based on the Fisher matrix. In these examples, µ denotes the Lebesgue measure. The Supplementary Materials presents step-by-step calculations of the squared Hellinger distance in Examples 1-5.
The Hellinger distance is Since the unique solution to 1 Figure 1a, one has arg min θ∈Θ ρ(θ) = θ o .
The Fisher matrix is I(θ o ) = 0, which is a singular matrix.
Example 2 (First-Price Auction Model). Consider the first-price procurement auction model with m bidders introduced in (Paarsch 1992, Section 4.2.2). For bidders with independent private valuations, following an exponential distribution with parameter θ (Paarsch 1992, Display 4.18) shows that the density of the wining bid Y i in the i-th auction is Set Y = R + and Θ = (0, ∞). The Hellinger distance in this case is Figure 1b. Hence, by continuity of θ → ρ(θ), arg min θ∈Θ ρ(θ) = θ o . The Fisher matrix is I(θ o ) = 0, which is a singular matrix. Example 3 (Normal Squared Location Model). Set Y = R and Θ = R. Consider the normal squared location model This model would arise, for example, if Y is the difference between a matched pair of random variables whose control and treatment labels are not observed. The Hellinger distance is o , which implies that I(0) = 0 is a singular matrix and θ o = 0 is an irregular point of the Fisher matrix.
Example 4 (Demand-and-Supply Model). Let Y = (P, Q) denote the observed price and quantity of a good transacted in a market at a given period of time. Linear approximations to the demand and supply functions are where α, β, γ, δ are unknown parameters and (U, V) is an unobserved random vector. Assume that U and V are independent and jointly normal distributed with mean zero and unknown variance σ 11 and σ 22 , respectively. Set θ = (α, β, γ, δ, σ 11 , σ 22 ). The density of the observed variables is then where det(·) is the determinant of the matrix in the parenthesis and The squared Hellinger distance is To show that θ o is not identifiable, by Proposition 1, it suffices to verify that arg min θ ρ(θ) is not a singleton. We elaborate on this point in the Supplementary Material. Example 5 (Laplace Location Model). Set Y = R, Θ = R. Consider the Laplace location model The squared Hellinger distance is . By continuity, θ → ρ(θ) has then a unique minimizer at θ = θ o , and, by Proposition 1, θ o is identifiable. The Fisher matrix is I(θ) = 1, which is a non-singular matrix.
The previous examples also illustrate the difference between identifiable and local identifiable points in the parameter space.
Example 7 (Normal Squared Location Model, Continued). In this example, any θ o ∈ Θ is locally identifiable-even the irregular point θ o = 0 to the Fisher matrix-and only θ o = 0 is identifiable, see Figure 2.
We also have the following criterion for local identifiability based on minimizing the squared Hellinger distance. This criterion, unlike the criterion based on the Fisher matrix by Rothenberg (1971) and re-stated below as Lemma 3 for the sake of completeness, applies to the case when: • the support of Y depends on the parameter of interest; • θ o is not a regular point of the Fisher matrix; • some elements of the Fisher matrix I(θ o ) are not defined; Proposition 2 reduces local identifiability to a unique solution of a well-defined minimization problem. One general criterion, and, as argued, e.g., (Rockafellar and Wets 1998), virtually the only available one, to check in advance for the uniqueness of a minimizer of an optimization problem is the strict convexity of the objective function. The application of this general criterion to the characterization of local identifiability in Proposition 2 yields the following result: Proposition 3 leads to the observation that local identifiability can be seen to be related to the local convexity of the Hellinger distance. As with our earlier propositions, it holds when the support of Y depends on the parameter of interest, θ o is not a regular point of the Fisher matrix, some elements of the Fisher matrix I(θ o ) are not defined or θ → I(θ) is not continuous. Rothenberg (1971) gives a criterion for local identifiability in terms of the non-singularity of the Fisher matrix. Additional insight about the relevance-and limitations-of the Fisher matrix criterion for local identifiability may then be gained by relating it to the criterion based on minimizing the Hellinger distance. To study this relationship, we now focus on the regular models studied by Rothenberg (1971). We now replicate the characterization of local identifiability by Rothenberg (1971) Theorem 1 based on the non-singularity of the Fisher matrix. This characterization of local identifiability only applies to the regular models defined by Assumption R and to the regular points of the Fisher matrix, which may be a subset of the parameter space (see Example 3). These conditions do not have themselves any direct statistical or economic interpretation: their role is just to permit a characterization of local identifiability. 3 We have already referenced in the introduction a list of models with irregular points of the Fisher matrix, for which the characterization in Lemma 3 does not apply. We now use Examples 1-5 to illustrate the notions of regular and irregular models and their implications for the analysis of identifiability. The richness of the possibilities that follow is a recall of the care needed in using the Fisher matrix criterion for showing local identifiability (or lack of it). It also highlights the convenience of the identifiability criterion based on minimizing the Hellinger distance as a unifying approach to study the identifiability of regular or irregular points of the Fisher matrix in either regular or irregular models. Specifically:

The Fisher Matrix Criterion
• The uniform location model in Example 1 and the first-price auction model in Example 2 have, respectively, supp( f θ ) = [0, 1/θ] and supp( f θ ) = [θ/[m − 1], ∞), which means that these models violate the regularity condition (A3). We have seen that θ o is identifiable in Examples 1 and 2, which implies that (A3) is not necessary for identifiability. These models also have a singular Fisher matrix, which implies that, in irregular models violating (A3), the non-singularity of the Fisher matrix is not a necessary condition for (local) identifiability. • One can verify that the normal squared location model in Example 3 and the normal supply-and-demand model in Example 4 both satisfy the regularity conditions in Assumption R. We have seen that in Example 3 the parameter of interest is locally identifiable while in Example 4 it is not, which means that the regularity conditions in Assumption R are not sufficient or necessary for (local) identifiability, they are just convenient. In Example 3, moreover, θ o = 0 is not a regular point of the Fisher matrix and is locally identifiable, which implies that, for irregular points of the Fisher matrix, the non-singularity of the Fisher matrix is not a necessary condition for (local) identifiability. • In Example 5, the function θ → ln(1/2) − |y − θ| is not differentiable when y = θ, which means that the Laplace location model is an irregular model because it violates (A4).
We also have the following result linking the Hellinger distance to the Fisher matrix, which we are going to use to show that, in regular models with irregular points to the Fisher matrix, the non-singularity of the Fisher matrix is only a sufficient condition for local identifiability.
Lemma 4. Let the regularity conditions in Assumption R hold and assume that θ → (θ) := f 1/2 θ is continuously differentiable µ-a.e. Then, the Hellinger distance and the Fisher matrix are related by Though this result is known, see, e.g., (Borovkov 1998), its implications for local identifiability have so far not been drawn.
Since the Fisher matrix is a variance-covariance matrix, one has that I(θ) is, under (A5), a real symmetric semi-definite positive matrix for every θ ∈ Θ, and then the following result follows from Lemma 4 and the characterization of a convex function in terms of its Hessian, see, e.g., (Rockafellar and Wets 1998, Theorem 2.14).
Proposition 4. Let the regularity conditions in Assumption R hold and assume that θ → f 1/2 θ is continuously differentiable µ-a.e. Then, θ → ρ(θ) is a locally convex function around θ o . Furthermore, if I(θ o ) is non-singular, then θ → ρ(θ) is a locally strict convex function around θ o and θ o is locally identifiable.
Two remarks are in order. First, notice that, unlike Lemma 3, the result in Proposition 4 also applies when θ o is not a regular point of the Fisher matrix and the non-singularity of the Fisher matrix becomes only sufficient for local identifiability. Second, if I(θ o ) is singular, the function ρ : Θ → [0, 1] is still locally convex (because I(θ o ) is positive semi-definite) and arg min θ∈Θ ρ(θ) is a convex, but not necessarily bounded, set, which is a result that can be used to delineate the set of observational equivalent values of θ o . This note does not pursue this interesting direction. Table 1 summarizes the information in this note about the necessity and sufficiency of the non-singularity of the Fisher matrix for local identifiability. Table 1. For local identifiability, the non-singularity of the Fisher Matrix is . . . .

Regular Points Irregular Points
Regular Models necessary and sufficient (Lemma 3) only sufficient (Proposition 4 and Example 3) Irregular Models not necessary (Examples 1, 2, and 5) We conclude this section by mentioning that, in response to the misbehavior of the Fisher matrix when informing about the difficulty to estimate parameters of interest in parametric models, alternative notions of information, other than the Fisher matrix, have been proposed in the literature (see, e.g., Donoho and Liu 1987). Without further elaboration, these alternative notions of information are not directly applicable to construct new criteria of identifiability. In particular, the geometric information based on the modulus of continuity of θ o → arg min θ ρ(θ) with respect to the Hellinger distance, introduced by Donoho and Liu (1987) to geometrize convergence rates, cannot be used to construct a criterion for local identifiability because this modulus of continuity, in its current format, is not defined for parameters that are not locally identifiable. 4

The Kullback-Liebler Divergence and Other Divergences
Some of the examples where we have had success in using the Hellinger distance to analyze identifiability share the same structure: the Hellinger distance is a locally convex function, see Figure 2, and so the results from convex optimization become available. If the Hellinger distance proves to be difficult to analyze, one can set out a criterion for identifiability based on another divergence function, such as the reversed Kullback-Liebler divergence (see, e.g., Bowden 1973) One can unify the identification criteria based on the Hellinger distance and the reversed Kullback-Liebler divergence by using the family of ϕ-divergences defined as where f θ / f θ o is the likelihood ratio and ϕ : R → [0, +∞] is a proper closed convex function with ϕ(1) = 0 and such that x → ϕ(x) is strictly convex on a neighborhood of x = 1. The squared Hellinger distance corresponds to the member of this family with ϕ(x) = 1 2 (1 − √ x) 2 , whereas the reversed Kullback-Liebler divergence corresponds to ϕ(x) = − ln x + x − 1. The following result is an immediate consequence of the property that δ ϕ is non-negative and it is equal to zero if and only if f θ = f θ o (see, e.g., Pardo 2005, Proposition 1.1)). This result, which is a generalization of Proposition 2, shows that the choice of a ϕ-divergence for analyzing the identifiability of a parameter of interest only hinges on the difficulty to characterize the set arg min θ δ ϕ (θ) for a given ϕ-divergence. The choice of the Hellinger distance over the reversed Kullback-Liebler divergence is, however, not inconsequential when choosing ϕ-divergence to construct an estimator for the parameter of interest. The use of the Hellinger distance may lead to an estimator that is more robust than the maximum likelihood estimator and equally efficient, see, e.g., Beran (1977) and Jimenez and Shao (2002). 5 We conclude this section with the following result showing that, for the regular models analyzed by Rothenberg (1971), the Hellinger distance and the reversed Kullback-Liebler divergence are both locally convex around a minimizer.
Lemma 5. Let the regularity conditions in Assumption R hold and assume that θ → f 1/2 θ is continuously differentiable µ-a.e. Let us assume, furthermore, that, in a neighborhood of θ o , f θ and ln f θ are twice differentiable in θ, with derivatives continuous in y ∈ supp( f θ ). Then, the Hellinger distance and the Kullback-Liebler divergence are related by 2 ρ(θ o ) = c 2 κ(θ o ) f orc = 1/4. Acknowledgments: I would like to thank Sami Stouli and Vincent Han for offering constructive suggestions on previous versions of this paper. All errors are mine.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Proof
Proof of Lemma 1. Write Hence, ρ(θ) = 0 if and only if f θ = f θ o and ρ(θ) = 1 if and only if f θ f θ o = 0. To show that ρ(θ) does not depend on the choice of the dominating measure µ, let g θ and g θ o denote the densities of P θ and P θ o relative to another dominating measure ν. Let h and k denote the densities of µ,ν relative to µ + ν. The density of P θ relative to µ + ν is f θ h and also g θ k.
Thus, f θ h = g θ k and also f θ o h = g θ o k. Hence, ( f θ f θ o ) 1/2 h = (g θ g θ o ) 1/2 k and Proof of Lemma 2. In the text.
Proof of Lemma 3. We replicate the proof by Rothenberg (1971) Theorem 1. By the mean value theorem, there is θ between θ and θ o such that The sequence {q j } j belongs to the unit sphere and therefore is convergent to a limit q o . As θ j approaches θ o , q j approaches q o and in the limit q o θ (θ o ). However, this implies that and, hence, I θ o must be singular.
To show the converse, suppose that I θ has constant rank r < K in a neighborhood of θ o . Consider then the eigenvector v θ associated to one of the zero eigenvalues of I θ . Since Since I θ is continuous and has constant rank, the function θ → v θ is continuous in a neighborhood of θ o . Consider now the curve γ : [0, t ] → R K defined by the function θ(t), which solves the differential equation ∂θ(t) ∂t = v θ with θ(0) = θ o for 0 ≤ t ≤ t . The log density function is differentiable in t with ∂ θ(t) ∂t = v θ(t) θ (θ(t)).
However, by the preceding display this is zero for all 0 ≤ t ≤ t . Thus θ → θ is constant on the curve γ and θ o is not locally identifiable.

1
We use → in ' f : Y → [0, ∞)' to declare the domain (Y) and codomain ([0, ∞)) of the function f and we use the arrow notation ' →' to define the rule of a function inline. We use ':=' to indicate that an expression is 'defined to be equal to'. This notation is in line with conventional usage. 2 See, e.g., Escanciano (2021) for a systematic approach to identification in semiparametric models. 3 As a referee has pointed out, necessary and sufficient conditions for (local) identification require different assumptions. Some of the conditions in R are not necessary if we only seek sufficient conditions: differentiability of the score function and non-singularity of the Fisher matrix would suffice.

4
A related modulus of continuity has been introduced by Escanciano (2021, Online Supplementary Materials, Lemma 1.3) to provide sufficient conditions for (local) identification in semiparametric models. The analysis of these models is out of the scope of this paper.

5
Appendix B elaborates more on this point by using the variational representation of the Hellinger distance to construct a minimum distance estimator which does not require a non-parametric estimator of the density of the data.