Previous Article in Journal
Distinguishing Between Healthy and Unhealthy Newborns Based on Acoustic Features and Deep Learning Neural Networks Tuned by Bayesian Optimization and Random Search Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geometry of Statistical Manifolds

Department of Public Health, Brody School of Medicine, East Carolina University, Greenville, NC 27834, USA
Entropy 2025, 27(11), 1110; https://doi.org/10.3390/e27111110 (registering DOI)
Submission received: 30 September 2025 / Revised: 21 October 2025 / Accepted: 23 October 2025 / Published: 27 October 2025

Abstract

A statistical manifold M can be defined as a Riemannian manifold each of whose points is a probability distribution on the same support. In fact, statistical manifolds possess a richer geometric structure beyond the Fisher information metric defined on the tangent bundle T M . Recognizing that points in M are distributions and not just generic points in a manifold, T M can be extended to a Hilbert bundle H M . This extension proves fundamental when we generalize the classical notion of a point estimate—a single point in M—to a function on M that characterizes the relationship between observed data and each distribution in M. The log likelihood and score functions are important examples of generalized estimators. In terms of a parameterization θ : M Θ R k , θ ^ is a distribution on Θ while its generalization g θ ^ = θ ^ E θ ^ as an estimate is a function over Θ that indicates inconsistency between the model and data. As an estimator, g θ ^ is a distribution of functions. Geometric properties of these functions describe statistical properties of g θ ^ . In particular, the expected slopes of g θ ^ are used to define Λ ( g θ ^ ) , the Λ -information of g θ ^ . The Fisher information I is an upper bound for the Λ -information: for all g, Λ ( g ) I . We demonstrate the utility of this geometric perspective using the two-sample problem.

1. Introduction

Statistical manifolds provide a geometric framework for understanding families of probability distributions. While traditionally defined as Riemannian manifolds equipped with the Fisher information metric, their structure extends beyond this basic framework. Lauritzen [1] identified an additional skewness tensor, and Amari [2] also noticed this additional structure which he used to define a family of connections including both the metric connection and a dual pair—the mixture and exponential connections. This duality, first observed by Efron [3], reveals geometric structure beyond the Riemannian setting, though this previous work remained confined to the tangent bundle.
Amari [4] introduced a Hilbert space extension of the tangent bundle which Amari and Kumon [5] applied to estimating functions. Kass and Vos (Section 10.3) [6] also describe statistical Hilbert bundles which Pistone [7] extends to other statistical bundles in the nonparametric setting where extra care is required when the sample space is not finite. Recent developments have expanded the geometric perspective on the role of the Hilbert bundle in parametric inference when the traditional approach to statistical inference is replaced with Fisher’s view of estimation.
Classical statistical inference separates estimation and hypothesis testing into distinct frameworks. Point estimators map from the sample space to the parameter space, with their local properties described through the tangent bundle. Test statistics similarly rely on tangent bundle geometry. The log likelihood and its derivative, the score function, bridge these approaches by providing both estimation methods (maximum likelihood) and testing procedures (likelihood ratio and score tests). Godambe [8] extended the score’s role in estimation through estimating equations, yet the fundamental separation between testing and estimation persisted.
Building on Fisher’s [9] conception of estimation as a continuum of hypothesis tests, Vos [10] unified these approaches by replacing point estimators with generalized estimators—functions on the parameter space that geometrically represent surfaces over the manifold. These generalized estimators shift the inferential focus from individual parameter values to entire functions, whose properties are naturally characterized within the Hilbert bundle framework.
This paper demonstrates the advantages of generalized estimators and the utility of the Hilbert bundle perspective specifically for the two-sample problem. We show how the orthogonalized score achieves information bounds as a consequence of its membership in the tangent bundle, while other generalized estimators, residing only in the larger Hilbert bundle, suffer information loss measured by their angular deviation from the tangent space.

2. Statistical Manifolds

Let M X be a family of probability measures with common support X . While X can be an abstract space, for most applications, X R d . Each point in M X represents a candidate model for a population whose individuals take values in X .
We consider inference based on a sample denoted by y, with corresponding sample space Y . The relationship between X and Y depends on three factors: the sampling plan, any conditioning applied, and dimension reduction through sufficient statistics. In the simplest case—a simple random sample of size n without conditioning or dimension reduction—we have Y = X n .
Let M = M Y denote the family of probability measures on Y induced by M X through the sampling plan. For the simple random sampling case:
M = m : m ( y ) = i = 1 n m X ( x i ) , m X M X .
For any real-valued measurable function h, we define its expected value at m M as
E m h = Y h ( y ) m ( y ) d μ when Y is continuous y Y h ( y ) m ( y ) when Y is discrete
The Hilbert space associated with M consists of all square-integrable functions:
H M = h : E m h 2 < , m M .
This space carries a family of inner products indexed by points in M:
h , h m = E m h h for all h , h H M .
When h , h m = 0 , we say that h and h are m-orthogonal and write h m h .
We construct the Hilbert bundle over M by associating a copy of H M to each point:
H M = M × H M .
The fiber at m, denoted H m or H m M , inherits the inner product · , · m . For inference purposes, we decompose each fiber into the space of constant functions and its orthogonal complement:
H m = H m H m 0 where H m m H m 0 .
Here, H m = { h H m : E m h = 0 } consists of centered functions, while H m 0 contains the constants. Note that E m h = h , 1 m and H m 0 is independent of m. As decomposition (1) holds fiberwise, we obtain a global decomposition:
H M = H M H 0 M where H M H 0 M .
The bundle H M extends the tangent bundle T M , which emerges naturally through parameterization. We assume that M admits a global parameterization—while not strictly necessary, this simplifies our exposition by avoiding coordinate charts. We require this parameterization to be a diffeomorphism.
Consider a parameterization θ : M Θ R d with inverse θ 1 : Θ M . For a specific distribution m M , we write θ = θ ( m ) for its parameter value. When considering all distributions simultaneously, we write θ = θ ( m ) , where context distinguishes between θ as a point in Θ (left side) and as a function (right side).
For notational convenience, we denote the distribution corresponding to parameter value θ as m θ = θ 1 ( θ ) . This allows us to write the following:
m θ = θ 1 ( θ )
where, again, context clarifies whether m θ refers to the function θ 1 or its value.
With this parameterization, the Hilbert bundle can be expressed as
H M = Θ × H M
allowing us to index fibers by parameter values: H θ M = H m θ M .
The log likelihood function plays a fundamental role in our geometric framework. On M, it is the function M : Y × M R defined by M ( y , m ) = log m ( y ) . Through the parameterization, this induces Θ : Y × Θ R given by Θ ( y , θ ) = log m θ ( y ) . When the parameterization is clear from context, we simply write for Θ .
The partial derivatives of with respect to the parameters,
θ 1 , θ 2 , , θ d ,
evaluated at θ , form a basis for the tangent space T θ M . For all i = 1 , , d and all m M , 0 < E m ( / θ i ) 2 < , ensuring that T M H M . In fact, T M H M as E m ( / θ i ) vanishes on M.

3. Functions on M

The log likelihood function and its derivatives are central to statistical inference. Traditionally, these serve as tools to find point estimates—particularly the maximum likelihood estimate (MLE)—and to characterize the estimator’s properties. We adopt a different perspective: we treat and its derivatives as primary inferential objects rather than mere computational tools. This approach aligns with Fisher’s conception of estimation as a continuum of tests.
As the log likelihood ratio for comparing models with parameters θ 1 and θ 2 is the difference log m θ 1 ( y ) log m θ 2 ( y ) and adding an arbitrary constant to each term does not affect this difference, we define the log likelihood so that 0 for each fixed y. Thus we work with
( y , θ ) = log m θ ( y ) sup m M log m ( y ) .
As an inferential function, | ( y , θ ) | quantifies the dissonance between observation y and distribution m θ . While the MLE set { θ : ( y , θ ) = 0 } identifies parameters with minimal dissonance, our emphasis shifts to characterizing the full landscape of dissonance across the manifold. While “dissonance” lacks a precise mathematical definition, it can be thought of as the evidence in y against the model at θ —essentially, a test statistic evaluated at y for the null hypothesis specifying m θ . We use the notation θ ^ for the MLE when it is unique. When the MLE set is empty we say that the MLE does not exist. Note that is defined even when the MLE does not exist or is not unique; the only requirement is that sup m log m ( y ) < .
The log likelihood exemplifies a broader class of generalized estimators: functions G : Y × Θ R where, for almost every y Y , the function G ( y , · ) measures dissonance between y and distributions across M. Like , we can normalize G so that G 0 .
Consider the geometric interpretation. For a function f : Θ R , let ( f ) = { ( θ , f ( θ ) ) : θ Θ } Θ × R denote its graph. The graphs ( ( y , · ) ) and ( G ( y , · ) ) form d-dimensional surfaces over Θ R d . We compare these surfaces through their gradients:
s ( y , θ ) = ( y , θ ) and g ( y , θ ) = G ( y , θ )
where = ( / θ 1 , , / θ d ) t .
Viewing these as estimators requires replacing the fixed observation y with the random variable Y. Then ( ( Y , · ) ) and ( G ( Y , · ) ) become random surfaces, while s ( Y , θ ) and g ( Y , θ ) become random gradient fields. The score components { s i ( Y , θ ) } i = 1 d span the tangent space: span { s ( Y , θ ) } = T θ M . The key difference between generalized estimation described in this paper and the estimating equations of Godambe and Thompson [11] lies in their inferential approach: the former focuses on the distribution of graph slopes (gradients in a linear space), while the latter examines the distribution of where graphs intersect the horizontal axis (roots of g).
Under mild regularity conditions, the components of g ( Y , θ ) span a subspace of H θ M of dimension d although generally not T θ M . Strictly speaking, span { s ( Y , θ ) } is isomorphic rather than equal to T θ M , as the former consists of vectors attached to the surface ( ( y , · ) ) while the latter are attached to M (equivalently, to Θ ). As shown in Vos and Wu [12], this precise relationship between the log likelihood surface and the manifold ensures that score-based estimators attain the information bound.
This perspective fundamentally shifts our focus. Rather than comparing point estimators through their variance or mean squared error on the parameter space Θ , we compare the linear spaces spanned by the components of generalized estimators within the Hilbert bundle H M .
For point estimator θ ˇ , define its associated generalized estimator:
g θ ˇ = θ ˇ E θ ˇ .
The estimator must have nonzero variance, 0 < V ( θ ˇ ) < for all θ Θ , so that θ ˇ H M . Instead of traditional comparisons between θ ^ (the MLE) and θ ˇ , we compare the spaces spanned by s = s ( Y , θ ) and g θ ˇ = g θ ˇ ( Y , θ ) through their Λ -information—a generalization of Fisher information to arbitrary generalized estimators. Geometrically, the relationship between s and g θ ˇ is characterized by angles between their component vectors. Statistically, this translates to correlations between the corresponding random variables. The Λ -information is defined by the left-hand side of Equation (13) which also shows the role of the correlation.
Generalized estimators offer particular advantages when nuisance parameters are present. For point estimators, one seeks a parameterization where nuisance and interest parameters are orthogonal—a goal not always achievable. When working in H M rather than Θ , orthogonalization remains important but becomes more flexible: the choice of nuisance parameterization becomes immaterial as orthogonalization occurs within H M itself.
The information bound for the interest parameter is attained by restricting generalized estimators to be orthogonal to the nuisance parameter’s score components. The general framework is developed in Vos and Wu [12]; we illustrate the approach through the special case for comparing two populations in the following section.

4. Comparing Two Populations

We now develop the general framework for comparing two distributions from the same parametric family. The next section applies this framework to the more specific case of 2 × 2 contingency tables.
Let M X be a one-parameter family of distributions on X R , and let M Y be the corresponding family of sampling distributions on Y . While we work primarily with sampling distributions in M Y , we use superscripts to distinguish when necessary: m Y denotes a sampling distribution obtained from population distribution m X .
For simple random sampling outside exponential families, Y = X n . Within exponential families, Y represents the support of the sufficient statistic. For example, when M X consists of Bernoulli distributions with success probability p ( 0 , 1 ) , the family M Y consists of binomial distributions for n trials with sample space Y = { 0 , 1 , 2 , , n } .
Let θ Y : M Y Θ R parameterize M Y . We define the population parameterization θ X to ensure consistency: θ Y 1 θ X ( m X ) = m Y . Thus, each parameter value simultaneously labels both a population distribution in M X and its corresponding sampling distribution in M Y . As our focus is on sampling distributions, we simplify notation by dropping the subscript θ = θ Y .
The score function for parameter θ is
/ θ = θ = α Z
where we factor the score into its magnitude α = V ( / θ ) and its standardized version Z with E Z = 0 and V ( Z ) = 1 . Both α and Z depend on θ and, thus, vary across M Y .
Under reparameterization ξ of θ , the standardized score Z remains invariant while the coefficient transforms as α θ / ξ . The coefficient α equals the square root of the total Fisher information: α = I θ = n i θ where i θ is the Fisher information per observation. For the binomial family, Z = ( Y n p ) / n p ( 1 p ) .
Now consider independent samples of sizes n 1 and n 2 from two distributions in M X . The manifold of joint population distributions is
M X × X = m = m 1 m 2 : m 1 , m 2 M X
with corresponding manifold of joint sampling distributions:
M Y 1 × Y 2 = m = m 1 m 2 : m 1 M Y 1 , m 2 M Y 2 .
The parameterization θ X of M X induces natural parameterizations θ X × X = ( θ X , θ X ) t on M X × X and θ Y 1 × Y 2 = ( θ Y 1 , θ Y 2 ) t on M Y 1 × Y 2 . These share the same image Θ 1 × Θ 2 where Θ 1 = Θ 2 = θ X ( M X ) . Setting M = M Y 1 × Y 2 and θ = ( θ 1 , θ 2 ) t , each point in Θ labels both a joint sampling distribution and its generating population distribution.
The hypothesis that both samples arise from the same distribution corresponds to the diagonal submanifold:
M d i a g X × X = m = m 1 m 2 : m 1 = m 2 M X
with parameter space:
Θ d i a g = ( θ 1 , θ 2 ) t Θ : θ 1 = θ 2 .
The joint parameter θ = ( θ 1 , θ 2 ) t yields two score functions:
/ θ 1 = α 1 Z 1 and / θ 2 = α 2 Z 2
where Z 1 and Z 2 are orthonormal at each m M :
Z 1 , Z 2 m = E m ( Z 1 Z 2 ) = 0 .
To compare the distributions, we reparameterize using the difference δ = θ 1 θ 2 as our interest parameter and τ = θ 1 + θ 2 as the nuisance parameter. The inverse transformation gives θ 1 = 1 2 ( τ + δ ) and θ 2 = 1 2 ( τ δ ) , yielding scores:
/ δ = θ 1 δ / θ 1 + θ 2 δ / θ 2 = 1 2 α 1 Z 1 1 2 α 2 Z 2
/ τ = θ 1 τ / θ 1 + θ 2 τ / θ 2 = 1 2 α 1 Z 1 + 1 2 α 2 Z 2 .
Let Z ν denote the unit vector in the direction of / τ , satisfying Z ν , / τ > 0 , E Z ν = 0 , and | Z ν | = 1 . As Z ν = / τ / | / τ | remains invariant under monotonic reparameterizations of τ , we use the subscript ν (for nuisance). In terms of the basis { Z 1 , Z 2 } :
Z ν = α 1 Z 1 + α 2 Z 2 α 1 2 + α 2 2 = I θ 1 1 / 2 Z 1 + I θ 2 1 / 2 Z 2 I θ 1 + I θ 2 .
Let h be a point estimator or test statistic for δ . The function h is a generalized pre-estimator provided h E h is a generalized estimator. For any pre-estimator h of δ , define its orthogonalized version:
h = ( h E h ) h , Z ν Z ν = | h | ( Z h ρ h ν Z ν )
where Z h = ( h E h ) / | h | is the standardized direction and ρ h ν = Z h , Z ν is the correlation with the nuisance direction.
To ensure that inference is independent of the nuisance parameter, we work with orthogonalized generalized estimators g = h :
g = | h | ( Z h ρ h ν Z ν ) = | g | Z g = 1 ρ h ν 2 | h | Z g .
When h is the score for δ , the orthogonalized score becomes
s = I δ Z s = ( 1 ρ s ν 2 ) I δ Z s
where I δ is the information after orthogonalization. The proportion of information loss due to the nuisance parameter is the square of the correlation between the interest and nuisance parameters
ρ s ν 2 = I δ I δ I δ .
This loss cannot be recovered by reparameterization. Geometrically, ρ s ν = cos ( Z s , Z ν ) , so the proportional information loss equals the squared cosine of the angle between the score and the tangent space of M δ = δ 1 ( δ ) . The submanifold M δ depends on the choice of interest parameter and is integral to the inference problem.
The orthogonalized Fisher information I δ is additive on the reciprocal scale:
I δ 1 = I θ 1 1 + I θ 2 1 .
Equation (8) is established as follows. The orthogonalized score is a linear combination of the orthonormal basis vectors Z 1 and Z 2 ,
s = / δ / δ , / τ / τ , / τ / τ = 1 2 α 1 Z 1 1 2 α 2 Z 2 α 1 2 α 2 2 α 1 2 + α 2 2 1 2 α 1 Z 1 + 1 2 α 2 Z 2 = α 1 α 2 α 1 2 + α 2 2 α 2 Z 1 α 1 Z 2 = α 1 α 2 α 1 2 + α 2 2 Z s .
As | s | 2 = I δ and α i 2 = I θ i :
I δ = α 1 2 α 2 2 α 1 2 + α 2 2 = I θ 1 I θ 2 I θ 1 + I θ 2
and taking the reciprocal of both sides of (10) gives (8). Substituting into (7) with I δ = 1 4 ( I θ 1 + I θ 2 ) shows
ρ s ν 2 = ( I θ 1 I θ 2 ) 2 I θ 1 + I θ 2
which means that the information loss due to the nuisance parameter is proportional to the squared difference in the Fisher information for the distributions being compared. Using Equation (9), the orthogonalized score in terms of the basis vectors Z 1 and Z 2 is
s = I δ Z 1 I θ 1 Z 2 I θ 2 .
The basis { Z s , Z ν } is obtained from { Z 1 , Z 2 } using the linear transformation
Z s Z ν = 1 α 1 2 + α 2 2 α 2 α 1 α 1 α 2 Z 1 Z 2
which is a rotation through an angle of cos 1 I θ 2 / ( I θ 1 + I θ 2 ) . When θ is a location parameter, I θ 1 and I θ 2 are constant on M. With equal sample sizes ( n 1 = n 2 ), the rotation angle is π / 4 and Z s ( Z 1 Z 2 ) .
While Z s T M , for general estimators g we have Z g H M but Z g T M unless g = s . This distinction explains why general estimators fail to achieve the information bound. The Λ -information of g is Λ ( g ) = ρ g s 2 I δ , where ρ g s is the correlation between Z g and Z s .
The null hypothesis H 0 : δ = 0 deserves special attention. While H : δ = δ generally depends on the parameterization choice, H 0 : δ = 0 is parameterization-invariant as it is equivalent to H 0 : θ 1 = θ 2 . Under simple random sampling with I θ = n I θ , the standardized orthogonalized score on M 0 = δ 1 ( 0 ) becomes
Z s = n 2 1 / 2 Z 1 n 1 1 / 2 Z 2 n 1 + n 2
which is invariant across all parameterizations θ . This invariance does not hold for test statistics based on point estimators like θ ^ 1 θ ^ 2 , whose form depends on whether we parameterize using proportions, log-proportions, or log-odds.

5. Comparing Two Bernoulli Distributions

We now specialize the general framework to comparing two Bernoulli distributions, establishing the geometric structure that underlies inference for 2 × 2 contingency tables.
For the Bernoulli sample space X = { 0 , 1 } , the manifold of population distributions is
M X = { m : 0 < m ( 1 ) < 1 , m ( 0 ) + m ( 1 ) = 1 }
with natural parameterization p X ( m ) = m ( 1 ) . For a sample of size n, the sufficient statistic has support Y = { 0 , 1 , , n } , yielding the manifold of binomial sampling distributions:
M Y = m : m ( y ) = n y p y ( 1 p ) n y , 0 < p < 1 .
A natural bijection exists between M X and M Y : each population distribution determines a unique sampling distribution. We define p Y to make this bijection p Y 1 p X . Similarly, for any alternative parameterization θ X (such as θ X ( m ) = log ( m ( 1 ) / m ( 0 ) ) ), we define θ Y so that the bijection equals θ Y 1 θ X .
For independent samples of sizes n 1 and n 2 , the joint manifolds are
M X × X = { m = m 1 m 2 : m 1 , m 2 M X } M = M Y 1 × Y 2 = { m = m 1 m 2 : m 1 M Y 1 , m 2 M Y 2 } .
Using the proportion parameterization p = ( p 1 , p 2 ) t ( 0 , 1 ) 2 , the sampling distribution at p is
m ( y 1 , y 2 ) = n 1 y 1 n 2 y 2 p 1 y 1 ( 1 p 1 ) n 1 y 1 p 2 y 2 ( 1 p 2 ) n 2 y 2
for ( y 1 , y 2 ) t Y 1 × Y 2 , with corresponding population distribution:
m X × X ( x 1 , x 2 ) = p 1 x 1 ( 1 p 1 ) 1 x 1 p 2 x 2 ( 1 p 2 ) 1 x 2
for ( x 1 , x 2 ) t { 0 , 1 } 2 .
The Hilbert space for this manifold consists of all real-valued functions on the finite sample space:
H M = h : Y 1 × Y 2 R : y 1 , y 2 h 2 ( y 1 , y 2 ) m ( y 1 , y 2 ) < , m M .
As the support is finite, H M includes all finite-valued functions. The tangent space at m is the two-dimensional subspace:
T m M = span { Y 1 n 1 p 1 , Y 2 n 2 p 2 }
where p 1 = p 1 ( m ) and p 2 = p 2 ( m ) .
Table 1 summarizes the Fisher information per observation for three common parameterizations of the Bernoulli distribution, each offering different advantages for inference.
We illustrate our geometric framework using data from Mendenhall et al. [13], who conducted a retrospective analysis of laryngeal carcinoma treatment. Disease was controlled in 18 of 23 patients treated with surgery alone and 11 of 19 patients treated with irradiation alone ( y 1 = 18 , n 1 = 23 , y 2 = 11 , n 2 = 19 ). We use this data to compare the orthogonalized score s with other generalized estimators when the interest parameter is the log odds ratio δ = log ( p 1 / ( 1 p 1 ) ) log ( p 2 / ( 1 p 2 ) ) .

5.1. Orthogonalized Score

The score has two key properties: at each point in the sample space / δ is a smooth function on the parameter space, and at each each point in the manifold / δ is a distribution on the sample space. Formally, for y fixed / δ = / δ ( y , · ) C 1 ( Δ ) and for δ fixed / δ = / δ ( · , δ ) H m M when there is no nuisance parameter. As δ is the interest parameter we use the notation s for the score / δ . These properties persist after orthogonalization and standardization to obtain s and Z s .
Figure 1 illustrates these properties for Z s using the cancer data. The black curve shows Z s evaluated at the observed sample ( y 1 = 18 , y 2 = 11 ) as a function of δ , with the nuisance parameter fixed at ξ = 29 . Each of the 480 points in the sample space Y 1 × Y 2 generates such a curve; two additional examples appear in gray. We distinguish the family of curves Z s (uppercase) from the specific observed curve z s (lowercase).
For any fixed δ , the vertical line δ = δ intersects all 480 curves, yielding a distribution of Z s values. Together with the probability mass function m δ , ξ , this defines the sampling distribution of Z s when δ = δ and ξ = 29 . Crucially, every such vertical distribution has mean zero and variance one, reflecting the standardization of the score.
The intersection of horizontal lines with z s provides confidence intervals through inversion. The lines z = ± 2 intersect the observed curve at points ( δ l o , 2 ) and ( δ h i , 2 ) , partitioning the parameter space into three regions:
  • For δ δ l o : the observed s exceeds 2 standard deviations above its expectation.
  • For δ l o < δ < δ h i : the observed s lies within 2 standard deviations.
  • For δ δ h i : the observed s falls below −2 standard deviations.
The interval ( δ l o , δ h i ) forms an approximate 95% confidence interval for δ . The approximation quality depends on the normality of the vertical distributions, while the interval width depends on the slope of z s —steeper slopes yield narrower intervals.
These calculations are conditional on ξ = 29 . Different nuisance parameter values yield different intervals, motivating our choice of the orthogonal parameterization ( δ , ξ ) where ξ = n 1 p 1 + n 2 p 2 . With this choice, the one-dimensional submanifolds M ξ = ξ 1 ( ξ ) and M δ = δ 1 ( δ ) intersect transversally, and their tangent spaces are orthogonal at the intersection point.
Varying the horizontal line height provides confidence intervals at different levels. For all z 0 , these lines intersect each of the 480 curves, ensuring that confidence intervals exist for every sample point. The intersection of all confidence levels can be interpreted as a point estimate for δ . For sample points other than ( 0 , 0 ) and ( n 1 , n 2 ) , this intersection equals the MLE—the point where z s crosses zero. At the boundary points ( 0 , 0 ) and ( n 1 , n 2 ) , the curves never cross zero, yielding an empty intersection that corresponds to the nonexistence of the MLE.
The 2-standard deviation confidence interval ( z = ± 2 ) for the log odds ratio δ is (−0.35, 2.27). The exact 95% confidence interval is (−0.40, 2.40) for nuisance parameter ξ equal to 29. This interval is a function of ξ . To obtain an interval that is the same for all values of the nuisance parameter, we take the union of intervals as ξ takes all values to obtain (−0.46, 2.42). The exact 95% confidence interval from Fisher’s exact test is (−0.57, 2.55).

5.2. Other Generalized Estimators

Point estimators naturally induce generalized estimators, though the relationship depends on the parameterization. For a parameterization θ and point estimator θ ^ , if θ ^ H M , then the generalized estimator is g θ ^ = θ ^ E θ ^ when no nuisance parameters exist, or g θ ^ = ( θ ^ E θ ^ ) with nuisance parameters present.
Consider the binomial family with proportion parameter p. The MLE p ^ = y / n yields g p ^ = p ^ p , which is proportional to the score. However, for the log odds parameterization η = log ( p / ( 1 p ) ) , the MLE η ^ = log ( p ^ / ( 1 p ^ ) ) satisfies η ^ ( 0 ) = and η ^ ( n ) = + , so η ^ H M . No generalized estimator exists for the unmodified log odds MLE.
A standard remedy adds a small constant ϵ to each cell, yielding the modified MLE:
η ˜ ( y ) = log y + ϵ n y + ϵ H M .
This modification ensures finite values throughout the sample space, enabling construction of the corresponding generalized estimator.
While the proportion MLE p ^ could similarly be modified, this is rarely performed despite the MLE’s failure at the boundaries. The MLE’s parameter invariance allows its definition without reference to any specific parameterization: for y { 0 , n } ,
m ^ y = arg max m M m ( y ) .
This coordinate-free definition emphasizes the MLE’s geometric nature but obscures its boundary behavior.
For comparing two populations using log odds, the modified MLE yields the difference estimator δ ^ = η ˜ 1 η ˜ 2 , with orthogonalized generalized estimator:
g = g δ ^ = ( δ ^ E δ ^ ) .
Like the orthogonalized score, g exhibits smoothness in parameters and distributional properties in the sample space: g = g ( y , · ) C 1 ( Δ ) for fixed y, and g ( · , δ ) H m M for fixed δ . Both are orthogonal to the nuisance space. The critical distinction lies in their geometric location: while s T M , generally g T M unless g = s .
Figure 2 illustrates this distinction for the cancer data. The black curve shows z g for the observed sample with ϵ 1 = ϵ 2 = 0.5 (adding 0.5 to each cell) and nuisance parameter ξ = 29 . Each of the 480 sample points generates a smooth curve, with two shown in gray. Vertical lines at any δ intersect these curves to yield distributions with mean zero and unit variance.
As with z s , horizontal lines at z = ± z 0 determine confidence intervals through their intersections with z g . Steeper slopes produce narrower intervals, making the expected slope a natural efficiency measure. Differentiating the identity E Z g = 0 with respect to δ yields
E Z g δ + E ( Z g Z s ) I δ = 0
Rearranging gives the fundamental inequality
E Z g δ 2 = ρ g s 2 I δ I δ
where ρ g s is the correlation between g and s . Vos and Wu [12] define the left-hand side of (13) as the Λ δ ( g ) , Λ -information in g for parameter δ . The bound is attained only when g = s , establishing the optimality of the orthogonalized score:
Λ δ ( g ) = ρ g s 2 I δ .
The square of the correlation is the same for any reparameterization of δ , so we can define the Λ -efficiency of g as
Eff Λ ( g ) = ρ g s 2 .
Λ -efficiency is independent of the choice of interest or nuisance parameter. For example, Λ -efficiency will be the same whether we use the odds ratio or the log odds ratio. Λ -information, like Fisher information I δ , is a tensor.
The geometric interpretation is revealing: ρ g s measures the cosine of the angle between g and s in H m M . The information loss ( 1 ρ g s 2 ) I δ equals the squared sine of this angle times the total information. Estimators achieve full Λ -efficiency only when perfectly aligned with the orthogonalized score.
A crucial distinction emerges when testing H 0 : δ = 0 . Under this null hypothesis with simple random sampling, the standardized orthogonalized score becomes
Z s = n 2 1 / 2 Z 1 n 1 1 / 2 Z 2 n 1 + n 2
which remains invariant across all parameterizations θ . This invariance reflects the geometric fact that H 0 : δ = 0 is equivalent to H 0 : θ 1 = θ 2 regardless of the choice of θ .
In contrast, test statistics based on point estimators like θ ^ 1 θ ^ 2 depend critically on the parameterization. Tests based on proportions, log proportions, and log odds yield different statistics with different null distributions, even though they test the same hypothesis. The orthogonalized score provides a canonical, parameterization-invariant test that achieves maximum power against local alternatives.
The 2-standard deviation confidence interval ( z = ± 2 ) for the log odds ratio δ is (−1.09, 2.95). The exact 95% confidence interval is (−0.43, 2.39) for nuisance parameter ξ equal to 29. The union of intervals over values for ξ is at least (−0.68, 2.47).

5.3. Discussion

Table 2 presents confidence intervals for the log odds difference δ computed using various methods. These intervals reveal substantial variation in both width and location, highlighting the importance of understanding the underlying geometric principles.
The orthogonalized score interval, whether computed at ξ = 29 or maximized over all nuisance parameter values, falls within both the modified MLE and Fisher’s exact test intervals for this particular dataset. However, this nesting relationship is sample-specific and should not guide method selection. The choice among methods should depend on their theoretical properties rather than their behavior for any particular observed data.
The orthogonalized score offers three key advantages:
  • It attains the Fisher information bound, achieving maximum Λ -efficiency.
  • It requires no ad hoc modifications to handle boundary cases (unlike the MLE for log odds).
  • It provides parameterization-invariant inference for H 0 : δ = 0 , yielding identical test statistics whether we parameterize using proportions, log proportions, or log odds.
The R (version 4.4.1) package (version 4.4.1) exact2x2 (version 1.6.8) [14,15] implements several additional unconditional methods, each corresponding to different generalized estimators. While this diversity offers flexibility, it also highlights the need for principled comparison methods.
The geometric framework of generalized estimation provides this principled approach. By working in the Hilbert bundle, we obtain
  • Unified treatment: Point estimators and test statistics become special cases of generalized estimators.
  • Parameter invariance: Generalized estimators transform properly under reparameterization.
  • Linear structure: The Hilbert bundle provides a natural vector space framework for combining and comparing estimators.
  • Consistent comparison: Λ -information offers a single efficiency measure, replacing the multiple criteria (bias, variance, MSE) used for point estimators.
This geometric perspective reveals why the orthogonalized score achieves optimality: it lies in the tangent bundle T M , while other generalized estimators reside only in the larger Hilbert bundle H M . The information loss of any estimator equals I δ times the squared sine of its angle from the tangent space—a geometric characterization that unifies and clarifies classical efficiency results.

6. Conclusions

This paper has demonstrated how the Hilbert bundle structure of statistical manifolds provides a unified geometric framework for statistical inference. By recognizing that points in a statistical manifold are probability distributions rather than abstract points, we extend the traditional tangent bundle framework to encompass a richer geometric structure that naturally accommodates both estimation and hypothesis testing.
The central insight is that generalized estimators—functions on the parameter space—serve as the fundamental inferential objects. The Λ -information of a generalized estimator g captures both its smooth structure across models in M and its distributional properties at each point. These dual aspects require different geometric descriptions: the smooth structure manifests through the graph of Z g in the ( δ , z ) plane, while the distributional properties are naturally characterized within the Hilbert bundle H M .
The information bound emerges as a geometric principle: the mean slope of Z g equals Λ ( g ) , and this slope is maximized precisely when g lies in the tangent bundle T M . Statistically, the bound is attained when g = s , the orthogonalized score. For any other generalized estimator, the information loss equals ( 1 ρ g s 2 ) I δ , where ρ g s measures the correlation between g and s as elements of H M . This correlation has a direct geometric interpretation: it equals the cosine of the angle between these functions in the Hilbert space.
The presence of nuisance parameters introduces an additional layer of geometric structure. Information loss due to nuisance parameters equals ρ s ν 2 I δ , where ρ s ν is the correlation between the score s and the nuisance direction Z ν . Crucially, this correlation—and hence the information loss—remains invariant under reparameterization of either interest or nuisance parameters. This invariance reflects a fundamental geometric fact: specifying a value δ for the interest parameter defines a submanifold M δ = δ 1 ( δ ) rather than a single point in M. The increased inferential difficulty is precisely quantified by ρ s ν 2 , the squared correlation between the score and the tangent space of M δ .
Our analysis of 2 × 2 contingency tables illustrates these principles concretely. The orthogonalized score achieves three key advantages over traditional approaches: it attains the information bound, requires no ad hoc modifications for boundary cases, and provides parameterization-invariant inference. The geometric framework explains why different confidence interval methods yield different results—they correspond to different generalized estimators with varying degrees of alignment with the tangent bundle.
This geometric perspective resolves longstanding tensions between estimation and testing frameworks. Rather than treating these as separate endeavors united only by computational tools like the likelihood function, we see them as complementary aspects of a single geometric structure. Point estimators, test statistics, and estimating equations all become special cases of generalized estimators, whose efficiency is uniformly measured by their Λ -information.
The Hilbert bundle framework thus provides both conceptual clarity and practical benefits. It reveals why certain statistical procedures are optimal, quantifies the cost of using suboptimal methods, and suggests principled ways to construct new inferential procedures. By shifting focus from points in parameter space to functions on the manifold, we gain a richer, more complete understanding of statistical evidence and its geometric foundations.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No data created.

Acknowledgments

The author thanks all reviewers for their valuable comments. Special acknowledgment goes to the reviewer who identified key references for the Hilbert bundle discussed in the introduction and emphasized the contribution of Godambe and Thompson [11] in Section 3, thereby strengthening the exposition of the distinction between generalized estimation and estimating equations. During the preparation of this manuscript, the author used Claude Opus 4.1 for the purposes of (1) clarifying and improving presentation of the penultimate draft, (2) summarizing the paper in the conclusion section with input from the author, and (3) writing the original R code (ggplot2) for the figures later modified by the author. The author has reviewed and edited the output and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Lauritzen, S.L. Chapter 4: Statistical Manifolds. In Institute of Mathematical Statistics; Lecture Notes—MONOGRAPH series; Institute of Mathematical Statistics: Hayward, CA, USA, 1987; pp. 163–216. [Google Scholar]
  2. Amari, S.I. Differential-Geometrical Methods in Statistics. In Lecture Notes in Statistics; Springer: New York, NY, USA, 1990. [Google Scholar]
  3. Efron, B. The geometry of exponential families. Ann. Stat. 1978, 6, 362–376. [Google Scholar] [CrossRef]
  4. Amari, S.I. Dual connections on the Hilbert bundles of statistical models. Geom. Stat. Theory 1987, 6, 123–152. [Google Scholar]
  5. Amari, S.I.; Kumon, M. Estimation in the presence of infinitely many nuisance parameters–geometry of estimating functions. Ann. Stat. 1988, 16, 1044–1068. [Google Scholar] [CrossRef]
  6. Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; Wiley Series in Probability and Statistics; Wiley: New York, NY, USA, 1997. [Google Scholar]
  7. Pistone, G. Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space. Inf. Geom. 2024, 7, 109–130. [Google Scholar] [CrossRef]
  8. Godambe, V.P. An Optimum Property of Regular Maximum Likelihood Estimation. Ann. Math. Stat. 1960, 31, 1208–1211. [Google Scholar] [CrossRef]
  9. Fisher, R. Statistical Methods and Scientific Induction. J. R. Stat. Soc. Ser. Methodol. 1955, 17, 69–78. [Google Scholar] [CrossRef]
  10. Vos, P. Generalized estimators, slope, efficiency, and fisher information bounds. Inf. Geom. 2022, 7, S151–S170. [Google Scholar] [CrossRef]
  11. Godambe, V.P.; Thompson, M.E. Some aspects of the theory of estimating equations. J. Stat. Plan. Inference 1978, 2, 95–104. [Google Scholar] [CrossRef]
  12. Vos, P.; Wu, Q. Generalized Estimation and Information. Inf. Geom. 2025, 8, 99–123. [Google Scholar] [CrossRef]
  13. Mendenhall, W.M.; Million, R.R.; Sharkey, D.E.; Cassisi, N.J. Stage T3 squamous cell carcinoma of the glottic larynx treated with surgery and/or radiation therapy. Int. J. Radiat. Oncol. Biol. Phys. 1984, 10, 357–363. [Google Scholar] [CrossRef] [PubMed]
  14. Michael, P. Fay Keith Lumbard. Confidence Intervals for Difference in Proportions for Matched Pairs Compatible with Exact McNemars or Sign Tests. Stat. Med. 2021, 40, 1147–1159. [Google Scholar]
  15. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2025. [Google Scholar]
Figure 1. Cancer data—orthogonalized score. Standardized orthogonal score z s as a function of the log odds difference δ . The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points shown in gray illustrate the family of 480 possible curves. The nuisance parameter is fixed at ξ = n 1 p 1 + n 2 p 2 = 29 . Horizontal lines at z = ± 2 intersect the observed curve to yield an approximate 95% confidence interval for δ .
Figure 1. Cancer data—orthogonalized score. Standardized orthogonal score z s as a function of the log odds difference δ . The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points shown in gray illustrate the family of 480 possible curves. The nuisance parameter is fixed at ξ = n 1 p 1 + n 2 p 2 = 29 . Horizontal lines at z = ± 2 intersect the observed curve to yield an approximate 95% confidence interval for δ .
Entropy 27 01110 g001
Figure 2. Cancer data—modified MLE estimator. Standardized generalized estimator z g based on the modified log odds difference δ ^ , where 0.5 is added to each cell count. The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points are shown in gray. The nuisance parameter is fixed at ξ = 29 . Compare with Figure 1 to observe the flatter slope indicating lower Λ -information.
Figure 2. Cancer data—modified MLE estimator. Standardized generalized estimator z g based on the modified log odds difference δ ^ , where 0.5 is added to each cell count. The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points are shown in gray. The nuisance parameter is fixed at ξ = 29 . Compare with Figure 1 to observe the flatter slope indicating lower Λ -information.
Entropy 27 01110 g002
Table 1. Fisher information per observation, i θ , for common parameterizations of the Bernoulli distribution expressed in the success parameter p.
Table 1. Fisher information per observation, i θ , for common parameterizations of the Bernoulli distribution expressed in the success parameter p.
Parameter θ Information i θ per Observation
log p 1 p p ( 1 p )
p 1 p ( 1 p )
log p 1 p p
Table 2. Confidence intervals for the log odds difference δ using cancer treatment data. Computed using R version 4.4.1 with package exact2x2 version 1.6.8 [14]. All methods except Fisher’s exact test are unconditional; Fisher’s conditions on the observed total of 29 successes.
Table 2. Confidence intervals for the log odds difference δ using cancer treatment data. Computed using R version 4.4.1 with package exact2x2 version 1.6.8 [14]. All methods except Fisher’s exact test are unconditional; Fisher’s conditions on the observed total of 29 successes.
MethodR Function95% Confidence Interval
Fisher-type adjusteduncondExact2x2( , 0.57)
Simple asymptoticuncondExact2x2( 3.90 , 2.48)
Score-baseduncondExact2x2( 2.99 , 0.43)
Orthogonalized score s ( 0.46 , 2.42)
Fisher’s exact testfisher.test( 0.57 , 2.55)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vos, P.W. Geometry of Statistical Manifolds. Entropy 2025, 27, 1110. https://doi.org/10.3390/e27111110

AMA Style

Vos PW. Geometry of Statistical Manifolds. Entropy. 2025; 27(11):1110. https://doi.org/10.3390/e27111110

Chicago/Turabian Style

Vos, Paul W. 2025. "Geometry of Statistical Manifolds" Entropy 27, no. 11: 1110. https://doi.org/10.3390/e27111110

APA Style

Vos, P. W. (2025). Geometry of Statistical Manifolds. Entropy, 27(11), 1110. https://doi.org/10.3390/e27111110

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop