Geometry of Statistical Manifolds

Vos, Paul W.

doi:10.3390/e27111110

Open AccessArticle

Geometry of Statistical Manifolds

by

Paul W. Vos

Department of Public Health, Brody School of Medicine, East Carolina University, Greenville, NC 27834, USA

Entropy 2025, 27(11), 1110; https://doi.org/10.3390/e27111110 (registering DOI)

Submission received: 30 September 2025 / Revised: 21 October 2025 / Accepted: 23 October 2025 / Published: 27 October 2025

(This article belongs to the Special Issue SUURI of Information Geometry: Dedicated to SUURI Engineer Professor Shun’ichi Amari on the Occasion of His 90th Birthday)

Download

Browse Figures

Versions Notes

Abstract

A statistical manifold M can be defined as a Riemannian manifold each of whose points is a probability distribution on the same support. In fact, statistical manifolds possess a richer geometric structure beyond the Fisher information metric defined on the tangent bundle

T M

. Recognizing that points in M are distributions and not just generic points in a manifold,

T M

can be extended to a Hilbert bundle

H M

. This extension proves fundamental when we generalize the classical notion of a point estimate—a single point in M—to a function on M that characterizes the relationship between observed data and each distribution in M. The log likelihood and score functions are important examples of generalized estimators. In terms of a parameterization

θ : M \to Θ \subset R^{k}

,

\hat{θ}

is a distribution on

Θ

while its generalization

g_{\hat{θ}} = \hat{θ} - E \hat{θ}

as an estimate is a function over

Θ

that indicates inconsistency between the model and data. As an estimator,

g_{\hat{θ}}

is a distribution of functions. Geometric properties of these functions describe statistical properties of

g_{\hat{θ}}

. In particular, the expected slopes of

g_{\hat{θ}}

are used to define

Λ (g_{\hat{θ}})

, the

Λ

-information of

g_{\hat{θ}}

. The Fisher information I is an upper bound for the

Λ

-information: for all g,

Λ (g) \leq I

. We demonstrate the utility of this geometric perspective using the two-sample problem.

Keywords:

Hilbert bundle; generalized estimation; slope; information; parameter-invariance; nuisance parameters; orthogonalization

1. Introduction

Statistical manifolds provide a geometric framework for understanding families of probability distributions. While traditionally defined as Riemannian manifolds equipped with the Fisher information metric, their structure extends beyond this basic framework. Lauritzen [1] identified an additional skewness tensor, and Amari [2] also noticed this additional structure which he used to define a family of connections including both the metric connection and a dual pair—the mixture and exponential connections. This duality, first observed by Efron [3], reveals geometric structure beyond the Riemannian setting, though this previous work remained confined to the tangent bundle.

Amari [4] introduced a Hilbert space extension of the tangent bundle which Amari and Kumon [5] applied to estimating functions. Kass and Vos (Section 10.3) [6] also describe statistical Hilbert bundles which Pistone [7] extends to other statistical bundles in the nonparametric setting where extra care is required when the sample space is not finite. Recent developments have expanded the geometric perspective on the role of the Hilbert bundle in parametric inference when the traditional approach to statistical inference is replaced with Fisher’s view of estimation.

Classical statistical inference separates estimation and hypothesis testing into distinct frameworks. Point estimators map from the sample space to the parameter space, with their local properties described through the tangent bundle. Test statistics similarly rely on tangent bundle geometry. The log likelihood and its derivative, the score function, bridge these approaches by providing both estimation methods (maximum likelihood) and testing procedures (likelihood ratio and score tests). Godambe [8] extended the score’s role in estimation through estimating equations, yet the fundamental separation between testing and estimation persisted.

Building on Fisher’s [9] conception of estimation as a continuum of hypothesis tests, Vos [10] unified these approaches by replacing point estimators with generalized estimators—functions on the parameter space that geometrically represent surfaces over the manifold. These generalized estimators shift the inferential focus from individual parameter values to entire functions, whose properties are naturally characterized within the Hilbert bundle framework.

This paper demonstrates the advantages of generalized estimators and the utility of the Hilbert bundle perspective specifically for the two-sample problem. We show how the orthogonalized score achieves information bounds as a consequence of its membership in the tangent bundle, while other generalized estimators, residing only in the larger Hilbert bundle, suffer information loss measured by their angular deviation from the tangent space.

2. Statistical Manifolds

Let

M^{X}

be a family of probability measures with common support

X

. While

X

can be an abstract space, for most applications,

X \subset R^{d}

. Each point in

M^{X}

represents a candidate model for a population whose individuals take values in

X

.

We consider inference based on a sample denoted by y, with corresponding sample space

Y

. The relationship between

X

and

Y

depends on three factors: the sampling plan, any conditioning applied, and dimension reduction through sufficient statistics. In the simplest case—a simple random sample of size n without conditioning or dimension reduction—we have

Y = X^{n}

.

Let

M = M^{Y}

denote the family of probability measures on

Y

induced by

M^{X}

through the sampling plan. For the simple random sampling case:

M = \{m : m (y) = \prod_{i = 1}^{n} m^{X} (x_{i}), m^{X} \in M^{X}\} .

For any real-valued measurable function h, we define its expected value at

m \in M

as

E_{m} h = \{\begin{matrix} \int_{Y} h (y) m (y) d μ & when Y is continuous \\ \sum_{y \in Y} h (y) m (y) & when Y is discrete \end{matrix}

The Hilbert space associated with M consists of all square-integrable functions:

H_{M} = \{h : E_{m} h^{2} < \infty, \forall m \in M\} .

This space carries a family of inner products indexed by points in M:

{〈 h, h^{'} 〉}_{m} = E_{m} (h h^{'}) for all h, h^{'} \in H_{M} .

When

{〈 h, h^{'} 〉}_{m} = 0

, we say that h and

h^{'}

are m-orthogonal and write

h ⊥_{m} h^{'}

.

We construct the Hilbert bundle over M by associating a copy of

H_{M}

to each point:

H M = M \times H_{M} .

The fiber at m, denoted

H_{m}

or

H_{m} M

, inherits the inner product

{〈 \cdot, \cdot 〉}_{m}

. For inference purposes, we decompose each fiber into the space of constant functions and its orthogonal complement:

H_{m} = H_{m}^{⊥} \oplus H_{m}^{0} where H_{m}^{⊥} ⊥_{m} H_{m}^{0} .

(1)

Here,

H_{m}^{⊥} = {h \in H_{m} : E_{m} h = 0}

consists of centered functions, while

H_{m}^{0}

contains the constants. Note that

E_{m} h = {〈 h, 1 〉}_{m}

and

H_{m}^{0}

is independent of m. As decomposition (1) holds fiberwise, we obtain a global decomposition:

H M = H^{⊥} M \oplus H^{0} M where H^{⊥} M ⊥ H^{0} M .

(2)

The bundle

H M

extends the tangent bundle

T M

, which emerges naturally through parameterization. We assume that M admits a global parameterization—while not strictly necessary, this simplifies our exposition by avoiding coordinate charts. We require this parameterization to be a diffeomorphism.

Consider a parameterization

θ : M \to Θ \subset R^{d}

with inverse

θ^{- 1} : Θ \to M

. For a specific distribution

m_{\circ} \in M

, we write

θ_{\circ} = θ (m_{\circ})

for its parameter value. When considering all distributions simultaneously, we write

θ = θ (m)

, where context distinguishes between

θ

as a point in

Θ

(left side) and as a function (right side).

For notational convenience, we denote the distribution corresponding to parameter value

θ_{\circ}

as

m_{θ_{\circ}} = θ^{- 1} (θ_{\circ})

. This allows us to write the following:

m_{θ} = θ^{- 1} (θ)

where, again, context clarifies whether

m_{θ}

refers to the function

θ^{- 1}

or its value.

With this parameterization, the Hilbert bundle can be expressed as

H M = Θ \times H_{M}

allowing us to index fibers by parameter values:

H_{θ} M = H_{m_{θ}} M

.

The log likelihood function plays a fundamental role in our geometric framework. On M, it is the function

ℓ_{M} : Y \times M \to R

defined by

ℓ_{M} (y, m) = log m (y)

. Through the parameterization, this induces

ℓ_{Θ} : Y \times Θ \to R

given by

ℓ_{Θ} (y, θ) = log m_{θ} (y)

. When the parameterization is clear from context, we simply write ℓ for

ℓ_{Θ}

.

The partial derivatives of ℓ with respect to the parameters,

\{\frac{\partial ℓ}{\partial θ^{1}}, \frac{\partial ℓ}{\partial θ^{2}}, \dots, \frac{\partial ℓ}{\partial θ^{d}}\},

evaluated at

θ_{\circ}

, form a basis for the tangent space

T_{θ_{\circ}} M

. For all

i = 1, \dots, d

and all

m \in M

,

0 < E_{m} {(\partial ℓ / \partial θ^{i})}^{2} < \infty

, ensuring that

T M \subset H M

. In fact,

T M \subset H^{⊥} M

as

E_{m} (\partial ℓ / \partial θ^{i})

vanishes on M.

3. Functions on $M$

The log likelihood function ℓ and its derivatives are central to statistical inference. Traditionally, these serve as tools to find point estimates—particularly the maximum likelihood estimate (MLE)—and to characterize the estimator’s properties. We adopt a different perspective: we treat ℓ and its derivatives as primary inferential objects rather than mere computational tools. This approach aligns with Fisher’s conception of estimation as a continuum of tests.

As the log likelihood ratio for comparing models with parameters

θ_{1}

and

θ_{2}

is the difference

log m_{θ_{1}} (y) - log m_{θ_{2}} (y)

and adding an arbitrary constant to each term does not affect this difference, we define the log likelihood so that

ℓ \leq 0

for each fixed y. Thus we work with

ℓ (y, θ) = log m_{θ} (y) - sup_{m \in M} log m (y) .

As an inferential function,

| ℓ (y, θ) |

quantifies the dissonance between observation y and distribution

m_{θ}

. While the MLE set

{θ : ℓ (y, θ) = 0}

identifies parameters with minimal dissonance, our emphasis shifts to characterizing the full landscape of dissonance across the manifold. While “dissonance” lacks a precise mathematical definition, it can be thought of as the evidence in y against the model at

θ

—essentially, a test statistic evaluated at y for the null hypothesis specifying

m_{θ}

. We use the notation

\hat{θ}

for the MLE when it is unique. When the MLE set is empty we say that the MLE does not exist. Note that ℓ is defined even when the MLE does not exist or is not unique; the only requirement is that

{sup}_{m} log m (y) < \infty

.

The log likelihood exemplifies a broader class of generalized estimators: functions

G : Y \times Θ \to R

where, for almost every

y \in Y

, the function

G (y, \cdot)

measures dissonance between y and distributions across M. Like ℓ, we can normalize G so that

G \leq 0

.

Consider the geometric interpretation. For a function

f : Θ \to R

, let

ℷ (f) = {(θ, f (θ)) : θ \in Θ} \subset Θ \times R

denote its graph. The graphs

ℷ (- ℓ (y, \cdot))

and

ℷ (- G (y, \cdot))

form d-dimensional surfaces over

Θ \subset R^{d}

. We compare these surfaces through their gradients:

s (y, θ) = \nabla ℓ (y, θ) and g (y, θ) = \nabla G (y, θ)

where

\nabla = {(\partial / \partial θ^{1}, \dots, \partial / \partial θ^{d})}^{t}

.

Viewing these as estimators requires replacing the fixed observation y with the random variable Y. Then

ℷ (- ℓ (Y, \cdot))

and

ℷ (- G (Y, \cdot))

become random surfaces, while

s (Y, θ)

and

g (Y, θ)

become random gradient fields. The score components

{s^{i} (Y, θ)}_{i = 1}^{d}

span the tangent space:

span {s (Y, θ)} = T_{θ} M

. The key difference between generalized estimation described in this paper and the estimating equations of Godambe and Thompson [11] lies in their inferential approach: the former focuses on the distribution of graph slopes (gradients in a linear space), while the latter examines the distribution of where graphs intersect the horizontal axis (roots of g).

Under mild regularity conditions, the components of

g (Y, θ)

span a subspace of

H_{θ} M

of dimension d although generally not

T_{θ} M

. Strictly speaking,

span {s (Y, θ)}

is isomorphic rather than equal to

T_{θ} M

, as the former consists of vectors attached to the surface

ℷ (- ℓ (y, \cdot))

while the latter are attached to M (equivalently, to

Θ

). As shown in Vos and Wu [12], this precise relationship between the log likelihood surface and the manifold ensures that score-based estimators attain the information bound.

This perspective fundamentally shifts our focus. Rather than comparing point estimators through their variance or mean squared error on the parameter space

Θ

, we compare the linear spaces spanned by the components of generalized estimators within the Hilbert bundle

H M

.

For point estimator

\overset{ˇ}{θ}

, define its associated generalized estimator:

g_{\overset{ˇ}{θ}} = \overset{ˇ}{θ} - E \overset{ˇ}{θ} .

The estimator must have nonzero variance,

0 < V (\overset{ˇ}{θ}) < \infty

for all

θ \in Θ

, so that

\overset{ˇ}{θ} \in H_{M}

. Instead of traditional comparisons between

\hat{θ}

(the MLE) and

\overset{ˇ}{θ}

, we compare the spaces spanned by

s = s (Y, θ)

and

g_{\overset{ˇ}{θ}} = g_{\overset{ˇ}{θ}} (Y, θ)

through their

Λ

-information—a generalization of Fisher information to arbitrary generalized estimators. Geometrically, the relationship between s and

g_{\overset{ˇ}{θ}}

is characterized by angles between their component vectors. Statistically, this translates to correlations between the corresponding random variables. The

Λ

-information is defined by the left-hand side of Equation (13) which also shows the role of the correlation.

Generalized estimators offer particular advantages when nuisance parameters are present. For point estimators, one seeks a parameterization where nuisance and interest parameters are orthogonal—a goal not always achievable. When working in

H M

rather than

Θ

, orthogonalization remains important but becomes more flexible: the choice of nuisance parameterization becomes immaterial as orthogonalization occurs within

H M

itself.

The information bound for the interest parameter is attained by restricting generalized estimators to be orthogonal to the nuisance parameter’s score components. The general framework is developed in Vos and Wu [12]; we illustrate the approach through the special case for comparing two populations in the following section.

4. Comparing Two Populations

We now develop the general framework for comparing two distributions from the same parametric family. The next section applies this framework to the more specific case of

2 \times 2

contingency tables.

Let

M^{X}

be a one-parameter family of distributions on

X \subset R

, and let

M^{Y}

be the corresponding family of sampling distributions on

Y

. While we work primarily with sampling distributions in

M^{Y}

, we use superscripts to distinguish when necessary:

m^{Y}

denotes a sampling distribution obtained from population distribution

m^{X}

.

For simple random sampling outside exponential families,

Y = X^{n}

. Within exponential families,

Y

represents the support of the sufficient statistic. For example, when

M^{X}

consists of Bernoulli distributions with success probability

p \in (0, 1)

, the family

M^{Y}

consists of binomial distributions for n trials with sample space

Y = {0, 1, 2, \dots, n}

.

Let

θ_{Y} : M^{Y} \to Θ \subset R

parameterize

M^{Y}

. We define the population parameterization

θ_{X}

to ensure consistency:

θ_{Y}^{- 1} \circ θ_{X} (m^{X}) = m^{Y}

. Thus, each parameter value simultaneously labels both a population distribution in

M^{X}

and its corresponding sampling distribution in

M^{Y}

. As our focus is on sampling distributions, we simplify notation by dropping the subscript

θ = θ_{Y}

.

The score function for parameter

θ

is

ℓ_{/ θ} = \frac{\partial ℓ}{\partial θ} = α Z

where we factor the score into its magnitude

α = \sqrt{V (ℓ_{/ θ})}

and its standardized version Z with

E Z = 0

and

V (Z) = 1

. Both

α

and Z depend on

θ

and, thus, vary across

M^{Y}

.

Under reparameterization

ξ

of

θ

, the standardized score Z remains invariant while the coefficient transforms as

α \partial θ / \partial ξ

. The coefficient

α

equals the square root of the total Fisher information:

α = \sqrt{I_{θ}} = \sqrt{n i_{θ}}

where

i_{θ}

is the Fisher information per observation. For the binomial family,

Z = (Y - n p) / \sqrt{n p (1 - p)}

.

Now consider independent samples of sizes

n_{1}

and

n_{2}

from two distributions in

M^{X}

. The manifold of joint population distributions is

M^{X \times X} = \{m = m_{1} m_{2} : m_{1}, m_{2} \in M^{X}\}

with corresponding manifold of joint sampling distributions:

M^{Y_{1} \times Y_{2}} = \{m = m_{1} m_{2} : m_{1} \in M^{Y_{1}}, m_{2} \in M^{Y_{2}}\} .

The parameterization

θ_{X}

of

M^{X}

induces natural parameterizations

θ_{X \times X} = {(θ_{X}, θ_{X})}^{t}

on

M^{X \times X}

and

θ_{Y_{1} \times Y_{2}} = {(θ_{Y_{1}}, θ_{Y_{2}})}^{t}

on

M^{Y_{1} \times Y_{2}}

. These share the same image

Θ_{1} \times Θ_{2}

where

Θ_{1} = Θ_{2} = θ_{X} (M^{X})

. Setting

M = M^{Y_{1} \times Y_{2}}

and

θ = {(θ_{1}, θ_{2})}^{t}

, each point in

Θ

labels both a joint sampling distribution and its generating population distribution.

The hypothesis that both samples arise from the same distribution corresponds to the diagonal submanifold:

M_{d i a g}^{X \times X} = \{m = m_{1} m_{2} : m_{1} = m_{2} \in M^{X}\}

with parameter space:

Θ_{d i a g} = \{{(θ_{1}, θ_{2})}^{t} \in Θ : θ_{1} = θ_{2}\} .

The joint parameter

θ = {(θ_{1}, θ_{2})}^{t}

yields two score functions:

ℓ_{/ θ_{1}} = α_{1} Z_{1} and ℓ_{/ θ_{2}} = α_{2} Z_{2}

where

Z_{1}

and

Z_{2}

are orthonormal at each

m \in M

:

{〈 Z_{1}, Z_{2} 〉}_{m} = E_{m} (Z_{1} Z_{2}) = 0 .

To compare the distributions, we reparameterize using the difference

δ = θ_{1} - θ_{2}

as our interest parameter and

τ = θ_{1} + θ_{2}

as the nuisance parameter. The inverse transformation gives

θ_{1} = \frac{1}{2} (τ + δ)

and

θ_{2} = \frac{1}{2} (τ - δ)

, yielding scores:

\begin{matrix} ℓ_{/ δ} & = \frac{\partial θ_{1}}{\partial δ} ℓ_{/ θ_{1}} + \frac{\partial θ_{2}}{\partial δ} ℓ_{/ θ_{2}} = \frac{1}{2} α_{1} Z_{1} - \frac{1}{2} α_{2} Z_{2} \end{matrix}

(3)

\begin{matrix} ℓ_{/ τ} & = \frac{\partial θ_{1}}{\partial τ} ℓ_{/ θ_{1}} + \frac{\partial θ_{2}}{\partial τ} ℓ_{/ θ_{2}} = \frac{1}{2} α_{1} Z_{1} + \frac{1}{2} α_{2} Z_{2} . \end{matrix}

(4)

Let

Z_{ν}

denote the unit vector in the direction of

ℓ_{/ τ}

, satisfying

〈 Z_{ν}, ℓ_{/ τ} 〉 > 0

,

E Z_{ν} = 0

, and

| Z_{ν} | = 1

. As

Z_{ν} = ℓ_{/ τ} / | ℓ_{/ τ} |

remains invariant under monotonic reparameterizations of

τ

, we use the subscript

ν

(for nuisance). In terms of the basis

{Z_{1}, Z_{2}}

:

Z_{ν} = \frac{α_{1} Z_{1} + α_{2} Z_{2}}{\sqrt{α_{1}^{2} + α_{2}^{2}}} = \frac{I_{θ_{1}}^{1 / 2} Z_{1} + I_{θ_{2}}^{1 / 2} Z_{2}}{\sqrt{I_{θ_{1}} + I_{θ_{2}}}} .

Let h be a point estimator or test statistic for

δ

. The function h is a generalized pre-estimator provided

h - E h

is a generalized estimator. For any pre-estimator h of

δ

, define its orthogonalized version:

h^{⊥} = (h - E h) - 〈 h, Z_{ν} 〉 Z_{ν} = | h | (Z_{h} - ρ_{h ν} Z_{ν})

where

Z_{h} = (h - E h) / | h |

is the standardized direction and

ρ_{h ν} = 〈 Z_{h}, Z_{ν} 〉

is the correlation with the nuisance direction.

To ensure that inference is independent of the nuisance parameter, we work with orthogonalized generalized estimators

g = h^{⊥}

:

\begin{matrix} g & = & | h | (Z_{h} - ρ_{h ν} Z_{ν}) \\ = & | g | Z_{g} = \sqrt{1 - ρ_{h ν}^{2}} | h | Z_{g} . \end{matrix}

(5)

When h is the score for

δ

, the orthogonalized score becomes

s^{⊥} = \sqrt{I_{δ}^{⊥}} Z_{s^{⊥}} = \sqrt{(1 - ρ_{s ν}^{2}) I_{δ}} Z_{s^{⊥}}

(6)

where

I_{δ}^{⊥}

is the information after orthogonalization. The proportion of information loss due to the nuisance parameter is the square of the correlation between the interest and nuisance parameters

ρ_{s ν}^{2} = \frac{I_{δ} - I_{δ}^{⊥}}{I_{δ}} .

(7)

This loss cannot be recovered by reparameterization. Geometrically,

ρ_{s ν} = cos ∠ (Z_{s}, Z_{ν})

, so the proportional information loss equals the squared cosine of the angle between the score and the tangent space of

M_{δ_{\circ}} = δ^{- 1} (δ_{\circ})

. The submanifold

M_{δ_{\circ}}

depends on the choice of interest parameter and is integral to the inference problem.

The orthogonalized Fisher information

I_{δ}^{⊥}

is additive on the reciprocal scale:

{(I_{δ}^{⊥})}^{- 1} = I_{θ_{1}}^{- 1} + I_{θ_{2}}^{- 1} .

(8)

Equation (8) is established as follows. The orthogonalized score is a linear combination of the orthonormal basis vectors

Z_{1}

and

Z_{2}

,

\begin{matrix} s^{⊥} & = & ℓ_{/ δ} - \frac{〈 ℓ_{/ δ}, ℓ_{/ τ} 〉}{〈 ℓ_{/ τ}, ℓ_{/ τ} 〉} ℓ_{/ τ} \\ = & (\frac{1}{2} α_{1} Z_{1} - \frac{1}{2} α_{2} Z_{2}) - (\frac{α_{1}^{2} - α_{2}^{2}}{α_{1}^{2} + α_{2}^{2}}) (\frac{1}{2} α_{1} Z_{1} + \frac{1}{2} α_{2} Z_{2}) \\ = & \frac{α_{1} α_{2}}{α_{1}^{2} + α_{2}^{2}} (α_{2} Z_{1} - α_{1} Z_{2}) \\ = & \frac{α_{1} α_{2}}{\sqrt{α_{1}^{2} + α_{2}^{2}}} Z_{s^{⊥}} . \end{matrix}

(9)

As

| s^{⊥} |^{2} = I_{δ}^{⊥}

and

α_{i}^{2} = I_{θ_{i}}

:

I_{δ}^{⊥} = \frac{α_{1}^{2} α_{2}^{2}}{α_{1}^{2} + α_{2}^{2}} = \frac{I_{θ_{1}} I_{θ_{2}}}{I_{θ_{1}} + I_{θ_{2}}}

(10)

and taking the reciprocal of both sides of (10) gives (8). Substituting into (7) with

I_{δ} = \frac{1}{4} (I_{θ_{1}} + I_{θ_{2}})

shows

ρ_{s ν}^{2} = \frac{{(I_{θ_{1}} - I_{θ_{2}})}^{2}}{I_{θ_{1}} + I_{θ_{2}}}

which means that the information loss due to the nuisance parameter is proportional to the squared difference in the Fisher information for the distributions being compared. Using Equation (9), the orthogonalized score in terms of the basis vectors

Z_{1}

and

Z_{2}

is

\begin{matrix} s^{⊥} & = I_{δ}^{⊥} (\frac{Z_{1}}{\sqrt{I_{θ_{1}}}} - \frac{Z_{2}}{\sqrt{I_{θ_{2}}}}) . \end{matrix}

The basis

{Z_{s^{⊥}}, Z_{ν}}

is obtained from

{Z_{1}, Z_{2}}

using the linear transformation

(\begin{matrix} Z_{s^{⊥}} \\ Z_{ν} \end{matrix}) = \frac{1}{\sqrt{α_{1}^{2} + α_{2}^{2}}} (\begin{matrix} α_{2} & - α_{1} \\ α_{1} & α_{2} \end{matrix}) (\begin{matrix} Z_{1} \\ Z_{2} \end{matrix})

(11)

which is a rotation through an angle of

{cos}^{- 1} \sqrt{I_{θ_{2}} / (I_{θ_{1}} + I_{θ_{2}})}

. When

θ

is a location parameter,

I_{θ_{1}}

and

I_{θ_{2}}

are constant on M. With equal sample sizes (

n_{1} = n_{2}

), the rotation angle is

π / 4

and

Z_{s^{⊥}} \propto (Z_{1} - Z_{2})

.

While

Z_{s^{⊥}} \in T M

, for general estimators g we have

Z_{g} \in H M

but

Z_{g} \notin T M

unless

g = s^{⊥}

. This distinction explains why general estimators fail to achieve the information bound. The

Λ

-information of g is

Λ (g) = ρ_{g s^{⊥}}^{2} I_{δ}^{⊥}

, where

ρ_{g s^{⊥}}

is the correlation between

Z_{g}

and

Z_{s^{⊥}}

.

The null hypothesis

H_{0} : δ = 0

deserves special attention. While

H : δ = δ_{\circ}

generally depends on the parameterization choice,

H_{0} : δ = 0

is parameterization-invariant as it is equivalent to

H_{0} : θ_{1} = θ_{2}

. Under simple random sampling with

I_{θ} = n I_{θ}

, the standardized orthogonalized score on

M_{0} = δ^{- 1} (0)

becomes

Z_{s^{⊥}} = \frac{n_{2}^{1 / 2} Z_{1} - n_{1}^{1 / 2} Z_{2}}{\sqrt{n_{1} + n_{2}}}

which is invariant across all parameterizations

θ

. This invariance does not hold for test statistics based on point estimators like

{\hat{θ}}_{1} - {\hat{θ}}_{2}

, whose form depends on whether we parameterize using proportions, log-proportions, or log-odds.

5. Comparing Two Bernoulli Distributions

We now specialize the general framework to comparing two Bernoulli distributions, establishing the geometric structure that underlies inference for

2 \times 2

contingency tables.

For the Bernoulli sample space

X = {0, 1}

, the manifold of population distributions is

M^{X} = {m : 0 < m (1) < 1, m (0) + m (1) = 1}

with natural parameterization

p_{X} (m) = m (1)

. For a sample of size n, the sufficient statistic has support

Y = {0, 1, \dots, n}

, yielding the manifold of binomial sampling distributions:

M^{Y} = \{m : m (y) = (\binom{n}{y}) p^{y} {(1 - p)}^{n - y}, 0 < p < 1\} .

A natural bijection exists between

M^{X}

and

M^{Y}

: each population distribution determines a unique sampling distribution. We define

p_{Y}

to make this bijection

p_{Y}^{- 1} \circ p_{X}

. Similarly, for any alternative parameterization

θ_{X}

(such as

θ_{X} (m) = log (m (1) / m (0))

), we define

θ_{Y}

so that the bijection equals

θ_{Y}^{- 1} \circ θ_{X}

.

For independent samples of sizes

n_{1}

and

n_{2}

, the joint manifolds are

\begin{matrix} M^{X \times X} & = {m = m_{1} m_{2} : m_{1}, m_{2} \in M^{X}} \\ M = M^{Y_{1} \times Y_{2}} & = {m = m_{1} m_{2} : m_{1} \in M^{Y_{1}}, m_{2} \in M^{Y_{2}}} . \end{matrix}

Using the proportion parameterization

p = {(p_{1}, p_{2})}^{t} \in {(0, 1)}^{2}

, the sampling distribution at p is

m (y_{1}, y_{2}) = (\binom{n_{1}}{y_{1}}) (\binom{n_{2}}{y_{2}}) p_{1}^{y_{1}} {(1 - p_{1})}^{n_{1} - y_{1}} p_{2}^{y_{2}} {(1 - p_{2})}^{n_{2} - y_{2}}

for

{(y_{1}, y_{2})}^{t} \in Y_{1} \times Y_{2}

, with corresponding population distribution:

m_{X \times X} (x_{1}, x_{2}) = p_{1}^{x_{1}} {(1 - p_{1})}^{1 - x_{1}} p_{2}^{x_{2}} {(1 - p_{2})}^{1 - x_{2}}

for

{(x_{1}, x_{2})}^{t} \in {0, 1}^{2}

.

The Hilbert space for this manifold consists of all real-valued functions on the finite sample space:

H_{M} = \{h : Y_{1} \times Y_{2} \to R : \sum_{y_{1}, y_{2}} h^{2} (y_{1}, y_{2}) m (y_{1}, y_{2}) < \infty, \forall m \in M\} .

(12)

As the support is finite,

H_{M}

includes all finite-valued functions. The tangent space at m is the two-dimensional subspace:

T_{m} M = span {Y_{1} - n_{1} p_{1}, Y_{2} - n_{2} p_{2}}

where

p_{1} = p_{1} (m)

and

p_{2} = p_{2} (m)

.

Table 1 summarizes the Fisher information per observation for three common parameterizations of the Bernoulli distribution, each offering different advantages for inference.

We illustrate our geometric framework using data from Mendenhall et al. [13], who conducted a retrospective analysis of laryngeal carcinoma treatment. Disease was controlled in 18 of 23 patients treated with surgery alone and 11 of 19 patients treated with irradiation alone (

y_{1} = 18

,

n_{1} = 23

,

y_{2} = 11

,

n_{2} = 19

). We use this data to compare the orthogonalized score

s^{⊥}

with other generalized estimators when the interest parameter is the log odds ratio

δ = log (p_{1} / (1 - p_{1})) - log (p_{2} / (1 - p_{2}))

.

5.1. Orthogonalized Score

The score has two key properties: at each point in the sample space

ℓ_{/ δ}

is a smooth function on the parameter space, and at each each point in the manifold

ℓ_{/ δ}

is a distribution on the sample space. Formally, for y fixed

ℓ_{/ δ} = ℓ_{/ δ} (y, \cdot) \in C^{1} (Δ)

and for

δ

fixed

ℓ_{/ δ} = ℓ_{/ δ} (\cdot, δ) \in H_{m} M

when there is no nuisance parameter. As

δ

is the interest parameter we use the notation s for the score

ℓ_{/ δ}

. These properties persist after orthogonalization and standardization to obtain

s^{⊥}

and

Z_{s^{⊥}}

.

Figure 1 illustrates these properties for

Z_{s^{⊥}}

using the cancer data. The black curve shows

Z_{s^{⊥}}

evaluated at the observed sample

(y_{1} = 18, y_{2} = 11)

as a function of

δ

, with the nuisance parameter fixed at

ξ_{\circ} = 29

. Each of the 480 points in the sample space

Y_{1} \times Y_{2}

generates such a curve; two additional examples appear in gray. We distinguish the family of curves

Z_{s^{⊥}}

(uppercase) from the specific observed curve

z_{s^{⊥}}

(lowercase).

For any fixed

δ_{\circ}

, the vertical line

δ = δ_{\circ}

intersects all 480 curves, yielding a distribution of

Z_{s^{⊥}}

values. Together with the probability mass function

m_{δ_{\circ}, ξ_{\circ}}

, this defines the sampling distribution of

Z_{s^{⊥}}

when

δ = δ_{\circ}

and

ξ_{\circ} = 29

. Crucially, every such vertical distribution has mean zero and variance one, reflecting the standardization of the score.

The intersection of horizontal lines with

z_{s^{⊥}}

provides confidence intervals through inversion. The lines

z = \pm 2

intersect the observed curve at points

(δ_{l o}, 2)

and

(δ_{h i}, - 2)

, partitioning the parameter space into three regions:

For $δ \leq δ_{l o}$ : the observed $s^{⊥}$ exceeds 2 standard deviations above its expectation.
For $δ_{l o} < δ < δ_{h i}$ : the observed $s^{⊥}$ lies within 2 standard deviations.
For $δ \geq δ_{h i}$ : the observed $s^{⊥}$ falls below −2 standard deviations.

The interval

(δ_{l o}, δ_{h i})

forms an approximate 95% confidence interval for

δ

. The approximation quality depends on the normality of the vertical distributions, while the interval width depends on the slope of

z_{s^{⊥}}

—steeper slopes yield narrower intervals.

These calculations are conditional on

ξ = 29

. Different nuisance parameter values yield different intervals, motivating our choice of the orthogonal parameterization

(δ, ξ)

where

ξ = n_{1} p_{1} + n_{2} p_{2}

. With this choice, the one-dimensional submanifolds

M_{ξ_{\circ}} = ξ^{- 1} (ξ_{\circ})

and

M_{δ} = δ^{- 1} (δ)

intersect transversally, and their tangent spaces are orthogonal at the intersection point.

Varying the horizontal line height provides confidence intervals at different levels. For all

z \neq 0

, these lines intersect each of the 480 curves, ensuring that confidence intervals exist for every sample point. The intersection of all confidence levels can be interpreted as a point estimate for

δ

. For sample points other than

(0, 0)

and

(n_{1}, n_{2})

, this intersection equals the MLE—the point where

z_{s^{⊥}}

crosses zero. At the boundary points

(0, 0)

and

(n_{1}, n_{2})

, the curves never cross zero, yielding an empty intersection that corresponds to the nonexistence of the MLE.

The 2-standard deviation confidence interval (

z = \pm 2)

for the log odds ratio

δ

is (−0.35, 2.27). The exact 95% confidence interval is (−0.40, 2.40) for nuisance parameter

ξ

equal to 29. This interval is a function of

ξ

. To obtain an interval that is the same for all values of the nuisance parameter, we take the union of intervals as

ξ

takes all values to obtain (−0.46, 2.42). The exact 95% confidence interval from Fisher’s exact test is (−0.57, 2.55).

5.2. Other Generalized Estimators

Point estimators naturally induce generalized estimators, though the relationship depends on the parameterization. For a parameterization

θ

and point estimator

\hat{θ}

, if

\hat{θ} \in H_{M}

, then the generalized estimator is

g_{\hat{θ}} = \hat{θ} - E \hat{θ}

when no nuisance parameters exist, or

g_{\hat{θ}} = {(\hat{θ} - E \hat{θ})}^{⊥}

with nuisance parameters present.

Consider the binomial family with proportion parameter p. The MLE

\hat{p} = y / n

yields

g_{\hat{p}} = \hat{p} - p

, which is proportional to the score. However, for the log odds parameterization

η = log (p / (1 - p))

, the MLE

\hat{η} = log (\hat{p} / (1 - \hat{p}))

satisfies

\hat{η} (0) = - \infty

and

\hat{η} (n) = + \infty

, so

\hat{η} \notin H_{M}

. No generalized estimator exists for the unmodified log odds MLE.

A standard remedy adds a small constant

ϵ

to each cell, yielding the modified MLE:

\tilde{η} (y) = log (\frac{y + ϵ}{n - y + ϵ}) \in H_{M} .

This modification ensures finite values throughout the sample space, enabling construction of the corresponding generalized estimator.

While the proportion MLE

\hat{p}

could similarly be modified, this is rarely performed despite the MLE’s failure at the boundaries. The MLE’s parameter invariance allows its definition without reference to any specific parameterization: for

y \notin {0, n}

,

{\hat{m}}_{y} = arg max_{m \in M} m (y) .

This coordinate-free definition emphasizes the MLE’s geometric nature but obscures its boundary behavior.

For comparing two populations using log odds, the modified MLE yields the difference estimator

\hat{δ} = {\tilde{η}}_{1} - {\tilde{η}}_{2}

, with orthogonalized generalized estimator:

g = g_{\hat{δ}} = {(\hat{δ} - E \hat{δ})}^{⊥} .

Like the orthogonalized score, g exhibits smoothness in parameters and distributional properties in the sample space:

g = g (y, \cdot) \in C^{1} (Δ)

for fixed y, and

g (\cdot, δ) \in H_{m} M

for fixed

δ

. Both are orthogonal to the nuisance space. The critical distinction lies in their geometric location: while

s^{⊥} \in T M

, generally

g \notin T M

unless

g = s^{⊥}

.

Figure 2 illustrates this distinction for the cancer data. The black curve shows

z_{g}

for the observed sample with

ϵ_{1} = ϵ_{2} = 0.5

(adding 0.5 to each cell) and nuisance parameter

ξ_{\circ} = 29

. Each of the 480 sample points generates a smooth curve, with two shown in gray. Vertical lines at any

δ_{\circ}

intersect these curves to yield distributions with mean zero and unit variance.

As with

z_{s^{⊥}}

, horizontal lines at

z = \pm z_{0}

determine confidence intervals through their intersections with

z_{g}

. Steeper slopes produce narrower intervals, making the expected slope a natural efficiency measure. Differentiating the identity

E Z_{g} = 0

with respect to

δ

yields

E \frac{\partial Z_{g}}{\partial δ} + E (Z_{g} Z_{s^{⊥}}) \sqrt{I_{δ}^{⊥}} = 0

Rearranging gives the fundamental inequality

{(E \frac{\partial Z_{g}}{\partial δ})}^{2} = ρ_{g s^{⊥}}^{2} I_{δ}^{⊥} \leq I_{δ}^{⊥}

(13)

where

ρ_{g s^{⊥}}

is the correlation between g and

s^{⊥}

. Vos and Wu [12] define the left-hand side of (13) as the

Λ_{δ} (g)

,

Λ

-information in g for parameter

δ

. The bound is attained only when

g = s^{⊥}

, establishing the optimality of the orthogonalized score:

Λ_{δ} (g) = ρ_{g s^{⊥}}^{2} I_{δ}^{⊥} .

(14)

The square of the correlation is the same for any reparameterization of

δ

, so we can define the

Λ

-efficiency of g as

{Eff}^{Λ} (g) = ρ_{g s^{⊥}}^{2} .

Λ

-efficiency is independent of the choice of interest or nuisance parameter. For example,

Λ

-efficiency will be the same whether we use the odds ratio or the log odds ratio.

Λ

-information, like Fisher information

I_{δ}^{⊥}

, is a tensor.

The geometric interpretation is revealing:

ρ_{g s^{⊥}}

measures the cosine of the angle between g and

s^{⊥}

in

H_{m} M

. The information loss

(1 - ρ_{g s^{⊥}}^{2}) I_{δ}^{⊥}

equals the squared sine of this angle times the total information. Estimators achieve full

Λ

-efficiency only when perfectly aligned with the orthogonalized score.

A crucial distinction emerges when testing

H_{0} : δ = 0

. Under this null hypothesis with simple random sampling, the standardized orthogonalized score becomes

Z_{s^{⊥}} = \frac{n_{2}^{1 / 2} Z_{1} - n_{1}^{1 / 2} Z_{2}}{\sqrt{n_{1} + n_{2}}}

which remains invariant across all parameterizations

θ

. This invariance reflects the geometric fact that

H_{0} : δ = 0

is equivalent to

H_{0} : θ_{1} = θ_{2}

regardless of the choice of

θ

.

In contrast, test statistics based on point estimators like

{\hat{θ}}_{1} - {\hat{θ}}_{2}

depend critically on the parameterization. Tests based on proportions, log proportions, and log odds yield different statistics with different null distributions, even though they test the same hypothesis. The orthogonalized score provides a canonical, parameterization-invariant test that achieves maximum power against local alternatives.

The 2-standard deviation confidence interval (

z = \pm 2)

for the log odds ratio

δ

is (−1.09, 2.95). The exact 95% confidence interval is (−0.43, 2.39) for nuisance parameter

ξ

equal to 29. The union of intervals over values for

ξ

is at least (−0.68, 2.47).

5.3. Discussion

Table 2 presents confidence intervals for the log odds difference

δ

computed using various methods. These intervals reveal substantial variation in both width and location, highlighting the importance of understanding the underlying geometric principles.

The orthogonalized score interval, whether computed at

ξ = 29

or maximized over all nuisance parameter values, falls within both the modified MLE and Fisher’s exact test intervals for this particular dataset. However, this nesting relationship is sample-specific and should not guide method selection. The choice among methods should depend on their theoretical properties rather than their behavior for any particular observed data.

The orthogonalized score offers three key advantages:

It attains the Fisher information bound, achieving maximum $Λ$ -efficiency.
It requires no ad hoc modifications to handle boundary cases (unlike the MLE for log odds).
It provides parameterization-invariant inference for $H_{0} : δ = 0$ , yielding identical test statistics whether we parameterize using proportions, log proportions, or log odds.

The R (version 4.4.1) package (version 4.4.1) exact2x2 (version 1.6.8) [14,15] implements several additional unconditional methods, each corresponding to different generalized estimators. While this diversity offers flexibility, it also highlights the need for principled comparison methods.

The geometric framework of generalized estimation provides this principled approach. By working in the Hilbert bundle, we obtain

Unified treatment: Point estimators and test statistics become special cases of generalized estimators.
Parameter invariance: Generalized estimators transform properly under reparameterization.
Linear structure: The Hilbert bundle provides a natural vector space framework for combining and comparing estimators.
Consistent comparison: $Λ$ -information offers a single efficiency measure, replacing the multiple criteria (bias, variance, MSE) used for point estimators.

This geometric perspective reveals why the orthogonalized score achieves optimality: it lies in the tangent bundle

T M

, while other generalized estimators reside only in the larger Hilbert bundle

H M

. The information loss of any estimator equals

I_{δ}^{⊥}

times the squared sine of its angle from the tangent space—a geometric characterization that unifies and clarifies classical efficiency results.

6. Conclusions

This paper has demonstrated how the Hilbert bundle structure of statistical manifolds provides a unified geometric framework for statistical inference. By recognizing that points in a statistical manifold are probability distributions rather than abstract points, we extend the traditional tangent bundle framework to encompass a richer geometric structure that naturally accommodates both estimation and hypothesis testing.

The central insight is that generalized estimators—functions on the parameter space—serve as the fundamental inferential objects. The

Λ

-information of a generalized estimator g captures both its smooth structure across models in M and its distributional properties at each point. These dual aspects require different geometric descriptions: the smooth structure manifests through the graph of

Z_{g}

in the

(δ, z)

plane, while the distributional properties are naturally characterized within the Hilbert bundle

H M

.

The information bound emerges as a geometric principle: the mean slope of

- Z_{g}

equals

\sqrt{Λ (g)}

, and this slope is maximized precisely when g lies in the tangent bundle

T M

. Statistically, the bound is attained when

g = s^{⊥}

, the orthogonalized score. For any other generalized estimator, the information loss equals

(1 - ρ_{g s^{⊥}}^{2}) I_{δ}^{⊥}

, where

ρ_{g s^{⊥}}

measures the correlation between g and

s^{⊥}

as elements of

H M

. This correlation has a direct geometric interpretation: it equals the cosine of the angle between these functions in the Hilbert space.

The presence of nuisance parameters introduces an additional layer of geometric structure. Information loss due to nuisance parameters equals

ρ_{s ν}^{2} I_{δ}

, where

ρ_{s ν}

is the correlation between the score s and the nuisance direction

Z_{ν}

. Crucially, this correlation—and hence the information loss—remains invariant under reparameterization of either interest or nuisance parameters. This invariance reflects a fundamental geometric fact: specifying a value

δ_{\circ}

for the interest parameter defines a submanifold

M_{δ_{\circ}} = δ^{- 1} (δ_{\circ})

rather than a single point in M. The increased inferential difficulty is precisely quantified by

ρ_{s ν}^{2}

, the squared correlation between the score and the tangent space of

M_{δ_{\circ}}

.

Our analysis of

2 \times 2

contingency tables illustrates these principles concretely. The orthogonalized score achieves three key advantages over traditional approaches: it attains the information bound, requires no ad hoc modifications for boundary cases, and provides parameterization-invariant inference. The geometric framework explains why different confidence interval methods yield different results—they correspond to different generalized estimators with varying degrees of alignment with the tangent bundle.

This geometric perspective resolves longstanding tensions between estimation and testing frameworks. Rather than treating these as separate endeavors united only by computational tools like the likelihood function, we see them as complementary aspects of a single geometric structure. Point estimators, test statistics, and estimating equations all become special cases of generalized estimators, whose efficiency is uniformly measured by their

Λ

-information.

The Hilbert bundle framework thus provides both conceptual clarity and practical benefits. It reveals why certain statistical procedures are optimal, quantifies the cost of using suboptimal methods, and suggests principled ways to construct new inferential procedures. By shifting focus from points in parameter space to functions on the manifold, we gain a richer, more complete understanding of statistical evidence and its geometric foundations.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No data created.

Acknowledgments

The author thanks all reviewers for their valuable comments. Special acknowledgment goes to the reviewer who identified key references for the Hilbert bundle discussed in the introduction and emphasized the contribution of Godambe and Thompson [11] in Section 3, thereby strengthening the exposition of the distinction between generalized estimation and estimating equations. During the preparation of this manuscript, the author used Claude Opus 4.1 for the purposes of (1) clarifying and improving presentation of the penultimate draft, (2) summarizing the paper in the conclusion section with input from the author, and (3) writing the original R code (ggplot2) for the figures later modified by the author. The author has reviewed and edited the output and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

References

Lauritzen, S.L. Chapter 4: Statistical Manifolds. In Institute of Mathematical Statistics; Lecture Notes—MONOGRAPH series; Institute of Mathematical Statistics: Hayward, CA, USA, 1987; pp. 163–216. [Google Scholar]
Amari, S.I. Differential-Geometrical Methods in Statistics. In Lecture Notes in Statistics; Springer: New York, NY, USA, 1990. [Google Scholar]
Efron, B. The geometry of exponential families. Ann. Stat. 1978, 6, 362–376. [Google Scholar] [CrossRef]
Amari, S.I. Dual connections on the Hilbert bundles of statistical models. Geom. Stat. Theory 1987, 6, 123–152. [Google Scholar]
Amari, S.I.; Kumon, M. Estimation in the presence of infinitely many nuisance parameters–geometry of estimating functions. Ann. Stat. 1988, 16, 1044–1068. [Google Scholar] [CrossRef]
Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; Wiley Series in Probability and Statistics; Wiley: New York, NY, USA, 1997. [Google Scholar]
Pistone, G. Affine statistical bundle modeled on a Gaussian Orlicz–Sobolev space. Inf. Geom. 2024, 7, 109–130. [Google Scholar] [CrossRef]
Godambe, V.P. An Optimum Property of Regular Maximum Likelihood Estimation. Ann. Math. Stat. 1960, 31, 1208–1211. [Google Scholar] [CrossRef]
Fisher, R. Statistical Methods and Scientific Induction. J. R. Stat. Soc. Ser. Methodol. 1955, 17, 69–78. [Google Scholar] [CrossRef]
Vos, P. Generalized estimators, slope, efficiency, and fisher information bounds. Inf. Geom. 2022, 7, S151–S170. [Google Scholar] [CrossRef]
Godambe, V.P.; Thompson, M.E. Some aspects of the theory of estimating equations. J. Stat. Plan. Inference 1978, 2, 95–104. [Google Scholar] [CrossRef]
Vos, P.; Wu, Q. Generalized Estimation and Information. Inf. Geom. 2025, 8, 99–123. [Google Scholar] [CrossRef]
Mendenhall, W.M.; Million, R.R.; Sharkey, D.E.; Cassisi, N.J. Stage T3 squamous cell carcinoma of the glottic larynx treated with surgery and/or radiation therapy. Int. J. Radiat. Oncol. Biol. Phys. 1984, 10, 357–363. [Google Scholar] [CrossRef] [PubMed]
Michael, P. Fay Keith Lumbard. Confidence Intervals for Difference in Proportions for Matched Pairs Compatible with Exact McNemars or Sign Tests. Stat. Med. 2021, 40, 1147–1159. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2025. [Google Scholar]

Figure 1. Cancer data—orthogonalized score. Standardized orthogonal score

z_{s^{⊥}}

as a function of the log odds difference

δ

. The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points shown in gray illustrate the family of 480 possible curves. The nuisance parameter is fixed at

ξ = n_{1} p_{1} + n_{2} p_{2} = 29

. Horizontal lines at

z = \pm 2

intersect the observed curve to yield an approximate 95% confidence interval for

δ

.

Figure 1. Cancer data—orthogonalized score. Standardized orthogonal score

z_{s^{⊥}}

as a function of the log odds difference

δ

. The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points shown in gray illustrate the family of 480 possible curves. The nuisance parameter is fixed at

ξ = n_{1} p_{1} + n_{2} p_{2} = 29

. Horizontal lines at

z = \pm 2

intersect the observed curve to yield an approximate 95% confidence interval for

δ

.

Figure 2. Cancer data—modified MLE estimator. Standardized generalized estimator

z_{g}

based on the modified log odds difference

\hat{δ}

, where 0.5 is added to each cell count. The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points are shown in gray. The nuisance parameter is fixed at

ξ = 29

. Compare with Figure 1 to observe the flatter slope indicating lower

Λ

-information.

Figure 2. Cancer data—modified MLE estimator. Standardized generalized estimator

z_{g}

based on the modified log odds difference

\hat{δ}

, where 0.5 is added to each cell count. The observed data (18 of 23 surgery successes, 11 of 19 irradiation successes) yields the black curve. Two additional sample points are shown in gray. The nuisance parameter is fixed at

ξ = 29

. Compare with Figure 1 to observe the flatter slope indicating lower

Λ

-information.

Table 1. Fisher information per observation,

i_{θ}

, for common parameterizations of the Bernoulli distribution expressed in the success parameter p.

Table 1. Fisher information per observation,

i_{θ}

, for common parameterizations of the Bernoulli distribution expressed in the success parameter p.

Parameter $θ$	Information $i_{θ}$ per Observation
$log (\frac{p}{1 - p})$	$p (1 - p)$
p	$\frac{1}{p (1 - p)}$
$log p$	$\frac{1 - p}{p}$

Table 2. Confidence intervals for the log odds difference

δ

using cancer treatment data. Computed using R version 4.4.1 with package exact2x2 version 1.6.8 [14]. All methods except Fisher’s exact test are unconditional; Fisher’s conditions on the observed total of 29 successes.

Table 2. Confidence intervals for the log odds difference

δ

using cancer treatment data. Computed using R version 4.4.1 with package exact2x2 version 1.6.8 [14]. All methods except Fisher’s exact test are unconditional; Fisher’s conditions on the observed total of 29 successes.

Method	R Function	95% Confidence Interval
Fisher-type adjusted	`uncondExact2x2`	( $- \infty$ , 0.57)
Simple asymptotic	`uncondExact2x2`	( $- 3.90$ , 2.48)
Score-based	`uncondExact2x2`	( $- 2.99$ , 0.43)
Orthogonalized score $s^{⊥}$	–	( $- 0.46$ , 2.42)
Fisher’s exact test	`fisher.test`	( $- 0.57$ , 2.55)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vos, P.W. Geometry of Statistical Manifolds. Entropy 2025, 27, 1110. https://doi.org/10.3390/e27111110

AMA Style

Vos PW. Geometry of Statistical Manifolds. Entropy. 2025; 27(11):1110. https://doi.org/10.3390/e27111110

Chicago/Turabian Style

Vos, Paul W. 2025. "Geometry of Statistical Manifolds" Entropy 27, no. 11: 1110. https://doi.org/10.3390/e27111110

APA Style

Vos, P. W. (2025). Geometry of Statistical Manifolds. Entropy, 27(11), 1110. https://doi.org/10.3390/e27111110

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Geometry of Statistical Manifolds

Abstract

1. Introduction

2. Statistical Manifolds

3. Functions on $M$

4. Comparing Two Populations

5. Comparing Two Bernoulli Distributions

5.1. Orthogonalized Score

5.2. Other Generalized Estimators

5.3. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Geometry of Statistical Manifolds

Abstract

1. Introduction

2. Statistical Manifolds

3. Functions on M

4. Comparing Two Populations

5. Comparing Two Bernoulli Distributions

5.1. Orthogonalized Score

5.2. Other Generalized Estimators

5.3. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Functions on $M$