Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing

Kelbert, Mark; Kalimulina, El’mira Yu.

doi:10.3390/e28050536

Open AccessArticle

Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing

by

Mark Kelbert

^1,2,*

and

El’mira Yu. Kalimulina

^3,4

¹

Laboratory of Stochastic Analysis and Its Applications, Department of Statistics and Data Analysis, National Research University Higher School of Economics, 101000 Moscow, Russia

²

Department of Mathematics, Swansea University, Swansea SA2 8PP, UK

³

Institute for Information Transmission Problems, Russian Academy of Sciences (IITP RAS), 127051 Moscow, Russia

⁴

Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119991 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(5), 536; https://doi.org/10.3390/e28050536

Submission received: 20 March 2026 / Revised: 30 April 2026 / Accepted: 2 May 2026 / Published: 8 May 2026

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

We study binary hypothesis testing for i.i.d. observations under a multiplicative context weight. For the optimal weighted total loss, defined as the sum of weighted type-I and type-II losses, we prove the logarithmic asymptotic

L_{n}^{*} = exp {- n D_{C}^{w} (P, Q) + o (n)}, n \to \infty

, where

D_{C}^{w}

is the weighted Chernoff information. The single-letter form of the exponent relies on a structural assumption that the weight factorises across observations,

φ (x_{1}^{n}) = \prod_{i = 1}^{n} φ (x_{i})

; this restriction is essential for the single-letter representation and should be distinguished from the weaker qualitative description “multiplicative context weight”. The proof embeds the weighted geometric mixtures

φ p^{α} q^{1 - α}

into a likelihood-ratio exponential family and identifies the rate through its log-normaliser. We also derive concentration bounds for the tilted weighted log-likelihood, obtain closed forms for Gaussian, Poisson, and exponential models, and extend the exponent characterisation to finitely many hypotheses.

Keywords:

hypothesis testing; weighted Chernoff information; weighted Bhattacharyya coefficient; exponential family; information geometry; context-sensitive loss

MSC:

62F03; 60F10

1. Introduction

Let

X

be a Polish space with its Borel

σ

-algebra and let

X_{1}^{n} = (X_{1}, \dots, X_{n})

be i.i.d.

X

-valued observations. We consider the simple hypotheses

H_{0} : X_{1}^{n} \sim P^{\otimes n} versus H_{1} : X_{1}^{n} \sim Q^{\otimes n},

where

P

and

Q

are probability measures on

X

dominated by a reference measure

μ

. Without loss of generality, one may take

μ = \frac{1}{2} (P + Q)

and write

p = \frac{d P}{d μ}

and

q = \frac{d Q}{d μ}

. In the unweighted setting, the optimal sum of type-I and type-II error probabilities is characterized by

TV (P^{\otimes n}, Q^{\otimes n})

and can be written as

\int_{X^{n}} min {p (x_{1}^{n}), q (x_{1}^{n})} d μ^{\otimes n} (x_{1}^{n}), p (x_{1}^{n}) = \prod_{i = 1}^{n} p (x_{i}), q (x_{1}^{n}) = \prod_{i = 1}^{n} q (x_{i}) .

(1)

In the standard (unweighted) Bayesian setting, the decay rate of the optimal total error probability is governed by the Chernoff information [1,2]:

\begin{matrix} ρ_{α} (p, q) & : = \int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x), α \in [0, 1], \\ ρ (p, q) & : = inf_{α \in [0, 1]} ρ_{α} (p, q), \\ D_{C} (P, Q) & : = - ln ρ (p, q) = max_{α \in [0, 1]} [- ln ρ_{α} (p, q)] . \end{matrix}

(2)

Here

ρ_{α}

is usually called the

α

-skewed Bhattacharyya affinity coefficient, and

ρ (p, q) = inf_{α \in [0, 1]} ρ_{α} (p, q)

is the affinity coefficient. In view of Hölder’s inequality,

ρ_{α} (p, q) \in [0, 1]

.

Chernoff also introduced an asymptotic efficiency notion for comparing two experimental designs

e = \frac{ln ρ_{1}}{ln ρ_{2}}

such that n observations on one test are equivalent (i.e., they give asymptotically the same total loss as

n \to \infty

) to

e n

observations on another test; see [1].

The paper studies a context-sensitive (weighted) analogue of this criterion and the logarithmic asymptotics of the optimal total loss as

n \to \infty

, in the framework of [3,4]. In the weighted setting, a nonnegative weight function

φ (x_{1}^{n})

reweights the loss of a wrong decision according to the realised sample. Thus,

φ

acts as a context factor that changes the relevance of different observations for the statistical task.

Weights of this form arise naturally whenever observations are not equally informative for the inference task. Two canonical mechanisms produce such

φ

. In importance-type reweighting, samples drawn under a proposal density g are used to perform inference with respect to a target h, and the Radon–Nikodym factor

φ (x) = h (x) / g (x)

enters the loss as a strictly positive (non-indicator) tilt; this is the mechanism underlying the context-sensitive framework of [3,4].

In applications, the informational value of an observation often depends on the underlying channel state. A canonical example, directly relevant to multiple hypothesis testing of transmission regimes, is a mobile communication channel modulated by a multi-zone coverage process (e.g., strong/weak/outage) along the receiver trajectory: samples acquired in outage carry little information about the regime and are weighted accordingly. Such reliability-weighted aggregation in multi-state channels was studied within multi-valued frameworks in [5,6].

Under the standard assumption that the modulating state at time i is determined by

X_{i}

alone, the resulting weight is a strictly positive bounded function

φ (x)

and extends multiplicatively to

X_{1}^{n}

. The weighted Chernoff information

D_{C}^{w} (P, Q)

then quantifies the effective discrimination rate under channel-dependent reliability and reduces to the classical rate

D_{C} (P, Q)

in the limit

φ \equiv 1

. Further parametric instances (Gaussian, Poisson, exponential) are worked out in Section 4.

Throughout we assume that the weight is compatible with the i.i.d. structure and factorises across observations; by abuse of notation,

φ

denotes both the one-step weight and its product extension.

Assumption 1

(Factorised weight). The weight function

φ (x_{1}^{n})

satisfies

φ (x_{1}^{n}) = \prod_{i = 1}^{n} φ (x_{i}), φ \geq 0 .

(3)

Assumption 1 is the key single-letter hypothesis. It yields the weighted affinities

ρ_{α}^{w} (p, q) = \int_{X} φ (x) p {(x)}^{α} q {(x)}^{1 - α} d μ (x),

hence an additive logarithmic rate. For one observation and equal priors, the weighted Bayes risk equals

\frac{1}{2} \int_{X} φ (x) min {p (x), q (x)} d μ (x) .

Since

\min {a, b} \leq a^{α} b^{1 - α}

for every

α \in [0, 1]

,

\int_{X} φ (x) min {p (x), q (x)} d μ (x) \leq ρ_{α}^{w} (p, q),

and therefore

\int_{X} φ (x) min {p (x), q (x)} d μ (x) \leq exp {- D_{C}^{w} (P, Q)},

where

D_{C}^{w} (P, Q) : = {max}_{α \in [0, 1]} [- ln ρ_{α}^{w} (p, q)]

(see Definition 2). Under Assumption 1, the same bound factorises over n observations and yields the exponential scale

exp {- n D_{C}^{w} (P, Q)}

. Theorem 1 shows that this scale is exact on the logarithmic level.

1.1. Main Result and Contributions

Let

L_{n}^{*}

denote the optimal total context-sensitive loss (sum of weighted type-I and type-II losses, minimised over decision rules) for n i.i.d. observations under Assumption 1. Our main theorem (Theorem 1) proves the single-letter logarithmic asymptotic

L_{n}^{*} = exp {- n D_{C}^{w} (P, Q) + o (n)}, n \to \infty,

(4)

where the rate is the weighted Chernoff information

D_{C}^{w} (P, Q) = max_{α \in [0, 1]} [- ln \int_{X} φ (x) p {(x)}^{α} q {(x)}^{1 - α} d μ (x)] .

(5)

For

φ \equiv 1

, (5) reduces to the classical Chernoff information.

We also extend the exponent characterisation to a finite family of simple hypotheses: the optimal M-ary rate is the minimum pairwise weighted Chernoff information (cf. [7] in the unweighted case). A central technical device is an exponential-family representation of the weighted geometric mixtures

α \mapsto φ p^{α} q^{1 - α}

. This embeds the mixtures into a likelihood-ratio exponential family and identifies the exponent through the corresponding log-normaliser. We further derive concentration bounds for tilted weighted log-likelihood ratios and closed-form expressions for

D_{C}^{w}

in several parametric models; see Section 4.

1.2. Contributions

Items (N1)–(N4) below indicate new results; items (A1)–(A3) summarise definitions, geometric context, and tools adopted from the existing literature.

(N1): (New.) Theorem 1 establishes the logarithmic asymptotic (4) for the optimal weighted total loss under the factorised weight of Assumption 1, with rate given by the weighted Chernoff information (5).
(N2): (New.) The exponential-family representation of the weighted geometric mixtures $α \mapsto φ p^{α} q^{1 - α}$ (Section 3.2) and the resulting uniqueness of the optimal skewing parameter $α^{*}$ .
(N3): (New.) Concentration bounds for the tilted weighted log-likelihood and the finite-n tail bound of Theorem 2 (Section 3.4).
(N4): (New.) Closed-form expressions for $D_{C}^{w}$ in the Gaussian, Poisson, and exponential models (Section 4), and the M-ary extension showing that the optimal rate equals the minimum pairwise weighted Chernoff information.
(A1): (Adapted definitions.) The definitions of the weighted Bhattacharyya affinities and the weighted Chernoff information generalise the classical unweighted quantities of [1,2] and follow the context-sensitive framework of [3,4]; their asymptotic and information-geometric consequences developed below are new.
(A2): (Geometric context.) The information-geometric identities of Section 3.3 are derived in the spirit of the Chentsov–Amari–Nielsen framework [8,9,10,11] but are stated and proved for the tilted log-normaliser $\hat{F} (θ) = ln \int φ (x) e^{θ^{T} t (x) + k (x)} d μ (x)$ ; the unweighted limit $φ \equiv 1$ recovers the classical statements of [11,12].
(A3): (Standard tool.) The concentration argument uses the Azuma–Hoeffding/McDiarmid inequality [13,14]; the novelty lies in its application to the tilted weighted log-likelihood.

1.3. Related Work

The exponential theory of testing errors goes back to Chernoff [1] and Hoeffding [2]. The context-sensitive framework and the weighted information quantities used here were developed in [3,4]. The information-geometric viewpoint on Chernoff information originates with Chentsov [9]; the dually flat structure of exponential and mixture families and the associated

α

-divergences are developed in [8,10], and the Chernoff point is characterised as the intersection of an exponential geodesic with the Kullback–Leibler bisector in [11]. For

φ \equiv 1

, the likelihood-ratio exponential family description is given in [12]; the present paper extends this picture to the tilted integrand

φ p^{α} q^{1 - α}

. The minimum-pairwise principle for multiple testing is due to [7]. Weighting mechanisms for covariate-dependent relevance have also been studied outside the asymptotic error-exponent framework, e.g., adaptive-kernel conditional-independence testing [15].

1.4. Structure of the Paper

Section 2 introduces the weighted Bhattacharyya affinities and the weighted Chernoff information. Section 3 proves the main asymptotic result (4) and develops the exponential-family and information-geometric identities. Section 3.4 studies the tilted weighted log-likelihood and derives finite-n concentration bounds. Section 4 examines Gaussian, Poisson, and exponential models and includes the M-ary extension. Auxiliary computations are collected in the appendices.

2. Problem Set-Up and Weighted Divergences

2.1. Context-Sensitive Losses and Weighted Total Variation

We keep the binary i.i.d. model from Section 1 and work under Assumption 1. In particular,

p (x_{1}^{n}) = \prod_{i = 1}^{n} p (x_{i}), q (x_{1}^{n}) = \prod_{i = 1}^{n} q (x_{i}), φ (x_{1}^{n}) = \prod_{i = 1}^{n} φ (x_{i}) .

Define the

φ

-tilted (reweighted) densities

p^{*} (x) : = \frac{φ (x) p (x)}{E_{φ} (p)}, q^{*} (x) : = \frac{φ (x) q (x)}{E_{φ} (q)}, E_{φ} (p) : = \int φ (x) p (x) d μ (x),

(and similarly for

E_{φ} (q)

). Throughout this section, we assume that

E_{φ} (p), E_{φ} (q) \in (0, \infty)

. (Equivalently,

ρ_{0}^{w} (p, q), ρ_{1}^{w} (p, q) \in (0, \infty) .

) Then

p^{*}, q^{*}

are probability densities and, under

φ (x_{1}^{n}) = \prod_{i = 1}^{n} φ (x_{i})

, we have

φ (x_{1}^{n}) p (x_{1}^{n}) = E_{φ} {(p)}^{n} {(p^{*})}^{\otimes n} (x_{1}^{n})

(and similarly for q).

Under Assumption 1, we have

\int_{X^{n}} φ (x_{1}^{n}) p (x_{1}^{n}) d μ^{\otimes n} = {(E_{φ} (p))}^{n}, \int_{X^{n}} φ (x_{1}^{n}) q (x_{1}^{n}) d μ^{\otimes n} = {(E_{φ} (q))}^{n} .

Let

D

denote the class of (possibly randomised) decision rules

D : X^{n} \to [0, 1]

, where

D (x_{1}^{n})

is the probability of deciding in favour of

H_{1}

after observing

x_{1}^{n}

. (Deterministic rules correspond to

D \in {0, 1}

.)

For

D \in D

, define the context-sensitive type-I and type-II losses by

\begin{matrix} α_{φ} (D) & : = E_{P^{\otimes n}} [φ (X_{1}^{n}) D (X_{1}^{n})] = \int_{X^{n}} φ (x_{1}^{n}) D (x_{1}^{n}) p (x_{1}^{n}) d μ^{\otimes n} (x_{1}^{n}), \end{matrix}

(6)

\begin{matrix} β_{φ} (D) & : = E_{Q^{\otimes n}} [φ (X_{1}^{n}) (1 - D (X_{1}^{n}))] = \int_{X^{n}} φ (x_{1}^{n}) (1 - D (x_{1}^{n})) q (x_{1}^{n}) d μ^{\otimes n} (x_{1}^{n}), \end{matrix}

(7)

and the corresponding total loss

L_{n} (D) : = α_{φ} (D) + β_{φ} (D), L_{n}^{*} : = inf_{D \in D} L_{n} (D) .

Proposition 1

(Pointwise form of the optimal total loss). For each

n \geq 1

,

L_{n}^{*} = \int_{X^{n}} φ (x_{1}^{n}) min {p (x_{1}^{n}), q (x_{1}^{n})} d μ^{\otimes n} (x_{1}^{n}) .

(8)

Moreover, an optimal (deterministic) decision rule is given by the likelihood-ratio test

D_{n}^{*} (x_{1}^{n}) = 1 {q (x_{1}^{n}) \geq p (x_{1}^{n})}

(with any measurable tie-breaking on

{p = q}

).

Proof.

Fix

x_{1}^{n}

. The integrand in

L_{n} (D)

equals

φ (x_{1}^{n}) (p (x_{1}^{n}) D (x_{1}^{n}) + q (x_{1}^{n}) (1 - D (x_{1}^{n}))) = φ (x_{1}^{n}) (q (x_{1}^{n}) + D (x_{1}^{n}) (p (x_{1}^{n}) - q (x_{1}^{n}))) .

Minimising pointwise over

D (x_{1}^{n}) \in [0, 1]

yields

D_{n}^{*} (x_{1}^{n}) = 1

when

p (x_{1}^{n}) \leq q (x_{1}^{n})

and

D_{n}^{*} (x_{1}^{n}) = 0

when

p (x_{1}^{n}) > q (x_{1}^{n})

, giving (8). □

We also use the weighted total variation distance

{TV}_{φ} (P^{\otimes n}, Q^{\otimes n}) : = \frac{1}{2} \int_{X^{n}} φ (x_{1}^{n}) | p (x_{1}^{n}) - q (x_{1}^{n}) | d μ^{\otimes n} (x_{1}^{n}) .

(9)

Remark 1.

For

φ \equiv 1

, this reduces to the usual total variation distance. If φ vanishes on a non-negligible set,

{TV}_{φ}

is, in general, a pseudo-distance; this is sufficient for our purposes since it characterises the weighted losses.

Using

min {a, b} = \frac{1}{2} (a + b - | a - b |)

in (8) and the definition of

{TV}_{φ}

yields

L_{n}^{*} = \frac{1}{2} ({(E_{φ} (p))}^{n} + {(E_{φ} (q))}^{n}) - {TV}_{φ} (P^{\otimes n}, Q^{\otimes n}) .

(10)

2.2. Weighted Affinities and Chernoff Information

We introduce the weighted Bhattacharyya affinities and the weighted Chernoff information. Assume that

ρ_{α}^{w} (p, q) \in (0, \infty)

for all

α \in [0, 1]

.

Definition 1

(Weighted Bhattacharyya coefficient and distance). For

α \in [0, 1]

define the weighted α-skewed Bhattacharyya affinity coefficient

ρ_{α}^{w} (p, q) : = \int_{X} φ (x) p {(x)}^{α} q {(x)}^{1 - α} d μ (x),

(11)

and the corresponding weighted Bhattacharyya distance

D_{B, α}^{w} (p, q) : = - ln ρ_{α}^{w} (p, q) .

(12)

Definition 2

(Weighted Chernoff information). The weighted Chernoff information divergence between

P

and

Q

is

D_{C}^{w} (P, Q) = max_{α \in [0, 1]} [- ln \int_{X} φ (x) p {(x)}^{α} q {(x)}^{1 - α} d μ (x)] = max_{α \in [0, 1]} D_{B, α}^{w} (p, q) .

(13)

A maximiser

α^{*} = α^{*} (p, q)

in (13) is called the optimal Chernoff parameter.

Remark 2.

The weighted Chernoff information is symmetric:

D_{C}^{w} (P, Q) = D_{C}^{w} (Q, P)

, since

ρ_{α}^{w} (p, q) = ρ_{1 - α}^{w} (q, p)

. In general, however,

D_{C}^{w}

does not satisfy the triangle inequality and is therefore a divergence rather than a metric.

Remark 3.

Under Assumption 1, for every

α \in [0, 1]

,

\int_{X^{n}} φ (x_{1}^{n}) p {(x_{1}^{n})}^{α} q {(x_{1}^{n})}^{1 - α} d μ^{\otimes n} (x_{1}^{n}) = {(ρ_{α}^{w} (p, q))}^{n} .

Consequently, the weighted Bhattacharyya distances are additive in n and the corresponding Chernoff exponent is of single-letter form.

3. Asymptotics and Information-Geometric Identities

Before stating the main theorem, we separate the two optimisations that appear throughout this section and should not be conflated. The total loss

L_{n} (D) = α_{φ} (D) + β_{φ} (D)

is a non-negative functional of the decision rule and is minimised over

D \in D

, giving the optimal loss

L_{n}^{*} = {inf}_{D \in D} L_{n} (D)

. The map

α \mapsto - ln ρ_{α}^{w} (p, q)

is a non-negative concave functional of the skewing parameter and is maximised over

α \in [0, 1]

, giving the weighted Chernoff information

D_{C}^{w} (P, Q) = {sup}_{α \in [0, 1]} [- ln ρ_{α}^{w} (p, q)]

. Theorem 1 below connects the two via the single-letter asymptotic

L_{n}^{*} = exp {- n D_{C}^{w} (P, Q) + o (n)}

.

3.1. Asymptotics of the Optimal Sum of Losses

Recall from Section 2 that

L_{n}^{*} : = inf_{D \in D} [α_{φ} (D) + β_{φ} (D)],

and that Proposition 1 yields (8). The next theorem identifies its exact logarithmic asymptotic rate with the weighted Chernoff information from Definition 2.

Theorem 1

(Optimal sum of context-sensitive losses). Consider the binary hypotheses

H_{0} : X_{1}^{n} \sim P^{\otimes n}

versus

H_{1} : X_{1}^{n} \sim Q^{\otimes n}

, under Assumption 1. Assume also

sup_{α \in [0, 1]} \int φ (x) | ln \frac{p (x)}{q (x)} | p {(x)}^{α} q {(x)}^{1 - α} d μ (x) < \infty .

(14)

Let

D

be the class of (possibly randomised) decision rules

D : X^{n} \to [0, 1]

, and let

α_{φ} (D)

,

β_{φ} (D)

be defined by (6) and (7). Assume that

p, q > 0

μ-a.e. and that

ρ_{α}^{w} (p, q) \in (0, \infty)

for all

α \in [0, 1]

.

Then, as

n \to \infty

,

L_{n}^{*} = exp {- n D_{C}^{w} (P, Q) + o (n)} .

(15)

Equivalently,

lim_{n \to \infty} - \frac{1}{n} ln L_{n}^{*} = D_{C}^{w} (P, Q) .

Proof.

By Proposition 1,

L_{n}^{*} = \int_{X^{n}} φ (x_{1}^{n}) min {p (x_{1}^{n}), q (x_{1}^{n})} d μ^{\otimes n} (x_{1}^{n}) .

For any

α \in [0, 1]

,

min {a, b} \leq a^{α} b^{1 - α}

, hence, by factorisation,

L_{n}^{*} \leq {(ρ_{α}^{w} (p, q))}^{n} .

Taking the infimum over

α \in [0, 1]

gives

\underset{n \to \infty}{lim inf} - \frac{1}{n} ln L_{n}^{*} \geq D_{C}^{w} (P, Q) .

Now fix

α^{*} \in arg {min}_{α \in [0, 1]} ρ_{α}^{w} (p, q)

and define

r_{α^{*}} (x) : = \frac{φ (x) p {(x)}^{α^{*}} q {(x)}^{1 - α^{*}}}{ρ_{α^{*}}^{w} (p, q)}, S_{n} : = \sum_{i = 1}^{n} ln \frac{p (X_{i})}{q (X_{i})} .

A direct change of measure yields

L_{n}^{*} = {(ρ_{α^{*}}^{w} (p, q))}^{n} E_{r_{α^{*}}^{\otimes n}} [e^{- α^{*} S_{n}} 1 {S_{n} > 0} + e^{(1 - α^{*}) S_{n}} 1 {S_{n} \leq 0}] .

The bracket is bounded above by 1.

Let

F (α) : = ln ρ_{α}^{w} (p, q)

, it is easy to check that

F (α)

is a convex function. Under the regularity assumption of (14),

F^{'} (α) = E_{r_{α}} [ln \frac{p (X)}{q (X)}] .

If

α^{*} \in (0, 1)

, then

F^{'} (α^{*}) = 0

. Hence, in view of LLN,

S_{n} / n \to 0

in

r_{α^{*}}

-probability. Therefore, for every

ε > 0

,

E_{r_{α^{*}}^{\otimes n}} [e^{(1 - α^{*}) S_{n}} 1 {S_{n} \leq 0} + e^{- α^{*} S_{n}} 1 {S_{n} > 0}] \geq e^{- ε n} r_{α^{*}}^{\otimes n} (| S_{n} | \leq ε n) = exp {o (n)} .

Combined with the upper bound by 1, this implies

L_{n}^{*} = exp {- n D_{C}^{w} (P, Q) + o (n)} .

In the boundary case

α^{*} = 0

, we have the mean value

m > 0

, and

1 (S_{n} > 0) e^{- α^{*} S_{n}} |_{α^{*} = 0} \to 1

a.s. as

n \to \infty

by the strong LLN. Similarly, in the case

α^{*} = 1

, we have the mean value

m \leq 0

, and

1 (S_{n} \leq 0) e^{(1 - α^{*}) S_{n}} |_{α^{*} = 1} \to 1

a.s. as

n \to \infty

. This completes the proof. □

Corollary 1

(Asymptotics of the weighted total variation). Under the assumptions of Theorem 1, the weighted total variation satisfies

{TV}_{φ} (P^{\otimes n}, Q^{\otimes n}) = \frac{1}{2} ({(E_{φ} (p))}^{n} + {(E_{φ} (q))}^{n}) - exp {- n D_{C}^{w} (P, Q) + o (n)}, n \to \infty,

(16)

where

E_{φ} (r) = \int_{X} φ (x) r (x) d μ (x)

.

Proof.

Combine the identity (10) with (15). □

Remark 4.

The weighted α-skewed Bhattacharyya distance (12) appears in many papers, see e.g., [12]. Definition 2 shows that the weighted Chernoff information divergence is the maximally skewed weighted Bhattacharyya distance.

3.2. Exponential-Family Representation and Uniqueness of $α^{*}$

In order to develop an effective computational procedure and to connect the weighted Chernoff information to information geometry, we embed the weighted geometric mixtures of p and q into a one-parameter likelihood-ratio exponential family. For

α \in [0, 1]

define

Z_{p q} (α) : = \int_{X} φ (x) p {(x)}^{α} q {(x)}^{1 - α} d μ (x) = ρ_{α}^{w} (p, q),

(17)

and the corresponding normalised density

E_{p q} = \{{(p q)}_{α} (x) : = \frac{φ (x) p {(x)}^{α} q {(x)}^{1 - α}}{Z_{p q} (α)} : α \in [0, 1]\} .

(18)

By assumption

Z_{p q} (α) \in (0, \infty)

, so

{(p q)}_{α}

is well-defined as a probability density w.r.t.

μ

.

Set

t (x) : = ln \frac{p (x)}{q (x)}

and

k_{p q} (x) : = ln φ (x) + ln q (x)

. Then,

{(p q)}_{α}

admits the exponential-family form

\begin{matrix} {(p q)}_{α} (x) & = exp \{α t (x) - F_{p q} (α) + k_{p q} (x)\}, \end{matrix}

(19)

\begin{matrix} F_{p q} (α) : & = ln Z_{p q} (α) = - D_{B, α}^{w} (p, q) . \end{matrix}

(20)

In particular,

t (X)

is a sufficient statistic for the family

E_{p q}

.

The log-normaliser

F_{p q}

is convex on

[0, 1]

; if

ln \frac{p}{q}

is not

μ

-a.e. constant on

{φ > 0}

, then

F_{p q}

is strictly convex and the maximiser

α^{*}

in (13) is unique. By Hölder’s inequality,

Z_{p q} (α) = \int_{X} {(φ p)}^{α} {(φ q)}^{1 - α} d μ \leq {(E_{φ} (p))}^{α} {(E_{φ} (q))}^{1 - α} .

Finally, note that

Z_{p q} (1) = E_{φ} (p)

and

Z_{p q} (0) = E_{φ} (q)

; hence,

{(p q)}_{1} (x) = \frac{φ (x) p (x)}{E_{φ} (p)}, {(p q)}_{0} (x) = \frac{φ (x) q (x)}{E_{φ} (q)},

so

E_{p q}

is an exponential arc between the tilted versions of

P

and

Q

. By this definition, the following identities hold:

\begin{matrix} D_{C}^{w} (p, q) = D_{B, α^{*} (p, q)}^{w} (p, q) = D_{B, α^{*} (q, p)}^{w} (q, p) = D_{C}^{w} (q, p) . \end{matrix}

(21)

3.3. Weighted Bregman Divergence and Information-Geometric Identities

3.3.1. Weighted KL Divergence and Weighted Bregman Divergence

This subsection collects information-geometric identities useful for analysing

ρ_{α}^{w}

and for computing the optimal Chernoff parameter

α^{*}

. We follow [3] for weighted Bregman divergences.

Let

E = {p_{θ} : θ \in Θ \subset R^{d}}

be a regular exponential family of densities (with respect to

μ

),

p_{θ} (x) = exp {θ^{T} t (x) - F (θ) + k (x)}, x \in X .

(22)

For a density r, set

E_{φ} (r) : = \int_{X} φ (x) r (x) d μ (x)

and write

E_{φ} (θ) : = E_{φ} (p_{θ})

.

Assume

E_{φ} (θ) \in (0, \infty)

for

θ \in Θ

and define the tilted log-normaliser

\hat{F} (θ) : = ln \int_{X} φ (x) e^{θ^{T} t (x) + k (x)} d μ (x) = F (θ) + ln E_{φ} (θ) .

(23)

Equivalently, the tilted density is

p_{θ}^{*} (x) = \frac{φ (x) p_{θ} (x)}{E_{φ} (θ)} .

(24)

Definition 3

(Weighted Kullback–Leibler divergence). For densities

p, q

on

X

define

D_{KL}^{w} (p ∥ q) : = \int_{X} φ (x) p (x) ln \frac{p (x)}{q (x)} d μ (x),

(25)

whenever the integral is well defined in

(- \infty, \infty]

.

Definition 4

(Weighted Bregman divergence). The weighted Bregman divergence associated with

(F, \hat{F})

is

\begin{matrix} B_{φ, F}^{w} (θ_{1}, θ_{2}) : & = e^{\hat{F} (θ_{2}) - F (θ_{2})} [F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{T} \nabla \hat{F} (θ_{2})] \\ = E_{φ} (θ_{2}) [F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{T} \nabla \hat{F} (θ_{2})] . \end{matrix}

(26)

Proposition 2

(Weighted KL as weighted Bregman divergence). For a regular exponential family

E = {P_{θ}, θ \in Θ}

, assume that the integral in (25) is well-defined in

(- \infty, \infty]

. Then for any

θ_{1}, θ_{2} \in Θ

,

D_{KL}^{w} (p_{θ_{1}} ∥ p_{θ_{2}}) = B_{φ, F}^{w} (θ_{2}, θ_{1}) .

(27)

Proof.

This identity is stated in [3] (Proposition 4.1); we give a short derivation for completeness. By (22),

ln \frac{p_{θ_{1}} (x)}{p_{θ_{2}} (x)} = {(θ_{1} - θ_{2})}^{T} t (x) - (F (θ_{1}) - F (θ_{2})) .

Substituting this into (25) yields

D_{KL}^{w} (p_{θ_{1}} ∥ p_{θ_{2}}) = {(θ_{1} - θ_{2})}^{T} \int_{X} φ (x) t (x) p_{θ_{1}} (x) d μ (x) - (F (θ_{1}) - F (θ_{2})) E_{φ} (θ_{1}) .

Using (23) and differentiation under the integral sign (regularity of

E

),

\nabla \hat{F} (θ_{1}) = \frac{\int_{X} φ (x) t (x) e^{θ_{1}^{T} t (x) + k (x)} d μ (x)}{\int_{X} φ (x) e^{θ_{1}^{T} t (x) + k (x)} d μ (x)} = \frac{\int_{X} φ (x) t (x) p_{θ_{1}} (x) d μ (x)}{E_{φ} (θ_{1})} .

Hence,

\int φ t p_{θ_{1}} d μ = E_{φ} (θ_{1}) \nabla \hat{F} (θ_{1})

, and therefore

D_{KL}^{w} (p_{θ_{1}} ∥ p_{θ_{2}}) = E_{φ} (θ_{1}) [F (θ_{2}) - F (θ_{1}) - {(θ_{2} - θ_{1})}^{T} \nabla \hat{F} (θ_{1})] = B_{φ, F}^{w} (θ_{2}, θ_{1}),

which proves (27). □

Proposition 3

(Primal–dual identities for (weighted) Bregman divergences). Let F be a log-normaliser of a regular exponential family and let

F^{*}

denote its Legendre transform. Write

θ^{*} = \nabla F (θ)

and

θ = \nabla F^{*} (θ^{*})

.

(a) Weighted one-parameter identity. Assume

d = 1

(one-parameter case) and let

θ_{i}^{*} : = F^{'} (θ_{i})

. Then, the following weighted analogue of the classical Bregman duality holds:

B_{φ, F}^{w} (θ_{1}, θ_{2}) = B_{φ, F^{*}}^{w} (θ_{2}^{*}, θ_{1}^{*}) - (θ_{1} - θ_{2}) ln E_{φ} (θ_{2}) + (θ_{2}^{*} - θ_{1}^{*}) ln E_{φ} (θ_{1}),

(28)

where

B_{φ, F}^{w}

is as in Definition 4 and

B_{φ, F^{*}}^{w}

is defined analogously (with F replaced by

F^{*}

and with the convention

E_{φ} (θ^{*}) : = E_{φ} (θ)

under

θ = \nabla F^{*} (θ^{*})

).

(b) Classical identity. For any

d \geq 1

, the (unweighted) Bregman divergence admits the standard Legendre representation

B_{F} (θ_{0}, θ_{1}) = F (θ_{0}) + F^{*} (θ_{1}^{*}) - θ_{0}^{T} θ_{1}^{*}, θ_{1}^{*} = \nabla F (θ_{1}),

(29)

where

B_{F} (θ_{0}, θ_{1}) = F (θ_{0}) - F (θ_{1}) - {(θ_{0} - θ_{1})}^{T} \nabla F (θ_{1})

is the usual (unweighted) Bregman divergence.

Proof.

Part (a) is a weighted extension of the classical duality

B_{F} (θ_{1}, θ_{2}) = B_{F^{*}} (θ_{2}^{*}, θ_{1}^{*})

and follows by combining the weighted representation (26) with Legendre relations; see also [3]. Part (b) is standard. □

3.3.2. Weighted Chernoff/Bhattacharyya Quantities Inside an Exponential Family

Let

p_{θ_{1}}, p_{θ_{2}} \in E

and

θ_{α} : = α θ_{1} + (1 - α) θ_{2}

. A direct calculation yields

\begin{matrix} ρ_{α}^{w} (p_{θ_{1}}, p_{θ_{2}}) & = ln \int_{X} φ (x) p_{θ_{1}} {(x)}^{α} p_{θ_{2}} {(x)}^{1 - α} d μ (x) \\ = \hat{F} (θ_{α}) - α F (θ_{1}) - (1 - α) F (θ_{2}) . \end{matrix}

(30)

Consequently,

\begin{matrix} D_{B, α}^{w} (p_{θ_{1}}, p_{θ_{2}}) & = α F (θ_{1}) + (1 - α) F (θ_{2}) - \hat{F} (θ_{α}) \\ = U_{F, α} (θ_{1}, θ_{2}) - ln E_{φ} (θ_{α}), \end{matrix}

(31)

where

U_{F, α} (θ_{1}, θ_{2}) : = α F (θ_{1}) + (1 - α) F (θ_{2}) - F (θ_{α})

is the (unweighted) Jensen/Burbea–Rao divergence induced by F. In particular, when

φ \equiv 1

we have

\hat{F} \equiv F

and

D_{B, α}^{w} (p_{θ_{1}}, p_{θ_{2}}) = U_{F, α} (θ_{1}, θ_{2})

.

Remark 5

(Geometric mixtures and tilting by

φ

). In particular, when

φ \equiv 1

, we have

\hat{F} \equiv F

and the normalised geometric mixture

p_{θ_{1}}^{α} p_{θ_{2}}^{1 - α}

belongs to the same exponential family, namely,

\propto p_{θ_{α}}

with

θ_{α} = α θ_{1} + (1 - α) θ_{2}

.

Proposition 4

(Optimal Chernoff parameter in an exponential family). Assume that

\hat{F}

is strictly convex on the segment

[θ_{1}, θ_{2}]

and that the maximiser

α^{*} \in (0, 1)

exists. Then,

α^{*}

is unique and satisfies

{(θ_{1} - θ_{2})}^{T} \nabla \hat{F} (θ_{α^{*}}) = F (θ_{1}) - F (θ_{2}), θ_{α^{*}} = α^{*} θ_{1} + (1 - α^{*}) θ_{2},

(32)

with

\nabla \hat{F} (θ) = E_{p_{θ}^{*}} [t (X)]

.

Proof.

Differentiate (31) with respect to

α

and use

\frac{d}{d α} θ_{α} = θ_{1} - θ_{2}

. Strict convexity of

\hat{F}

on

[θ_{1}, θ_{2}]

implies strict concavity of

α \mapsto D_{B, α}^{w} (p_{θ_{1}}, p_{θ_{2}})

, hence uniqueness. □

Proposition 5

(Chernoff information as a Jensen-type divergence and a Bregman bisector). Let

p_{θ_{1}}, p_{θ_{2}} \in E

and assume that the maximiser

α^{*} \in (0, 1)

in Definition 2 exists and is unique. Set

θ_{α} = α θ_{1} + (1 - α) θ_{2}

. Then

D_{C}^{w} (p_{θ_{1}}, p_{θ_{2}}) = D_{B, α^{*}}^{w} (p_{θ_{1}}, p_{θ_{2}}) = α^{*} F (θ_{1}) + (1 - α^{*}) F (θ_{2}) - \hat{F} (θ_{α^{*}}) .

(33)

Moreover,

θ_{α^{*}}

is characterised by the weighted Bregman bisector condition

B_{φ, F}^{w} (θ_{1}, θ_{α^{*}}) = B_{φ, F}^{w} (θ_{2}, θ_{α^{*}}),

(34)

and the common value recovers the Chernoff information as

\begin{matrix} D_{C}^{w} (p_{θ_{1}}, p_{θ_{2}}) & = \frac{1}{E_{φ} (θ_{α^{*}})} B_{φ, F}^{w} (θ_{1}, θ_{α^{*}}) - ln E_{φ} (θ_{α^{*}}) \end{matrix}

(35)

\begin{matrix} = \frac{1}{E_{φ} (θ_{α^{*}})} B_{φ, F}^{w} (θ_{2}, θ_{α^{*}}) - ln E_{φ} (θ_{α^{*}}) . \end{matrix}

(36)

In the special case

φ \equiv 1

, we have

\hat{F} \equiv F

and

E_{φ} (θ) \equiv 1

, so that (33) reduces to the classical Jensen divergence induced by F and (35) becomes

D_{C} (p_{θ_{1}}, p_{θ_{2}}) = B_{F} (θ_{1}, θ_{α^{*}}) = B_{F} (θ_{2}, θ_{α^{*}})

.

Proof.

Equation (33) is (31) at

α = α^{*}

. For (34), expand

B_{φ, F}^{w} (θ_{i}, θ_{α^{*}}) = E_{φ} (θ_{α^{*}}) [F (θ_{i}) - F (θ_{α^{*}}) - {(θ_{i} - θ_{α^{*}})}^{T} \nabla \hat{F} (θ_{α^{*}})]

and use (32) to see that the difference vanishes. Finally, substituting

θ_{1} - θ_{α^{*}} = (1 - α^{*}) (θ_{1} - θ_{2})

into

B_{φ, F}^{w} (θ_{1}, θ_{α^{*}}) / E_{φ} (θ_{α^{*}})

and using (32) gives

\frac{1}{E_{φ} (θ_{α^{*}})} B_{φ, F}^{w} (θ_{1}, θ_{α^{*}}) = α^{*} F (θ_{1}) + (1 - α^{*}) F (θ_{2}) - F (θ_{α^{*}}) .

Since

\hat{F} (θ_{α^{*}}) = F (θ_{α^{*}}) + ln E_{φ} (θ_{α^{*}})

, this yields (35). □

3.3.3. Derivative and Weighted KL

Recall

F_{p q} (α) = ln Z_{p q} (α) = ln ρ_{α}^{w} (p, q)

. In view of (14 ), the differentiation under the integral sign is justified, and we have

F_{p q}^{'} (α) = E_{{(p q)}_{α}} [ln \frac{p (X)}{q (X)}],

(37)

where

{(p q)}_{α}

is the Chernoff-tilted density from (18). In particular,

F_{p q}^{'} (1) = \frac{1}{E_{φ} (p)} \int_{X} φ (x) p (x) ln \frac{p (x)}{q (x)} d μ (x) = \frac{1}{E_{φ} (p)} D_{KL}^{w} (p ∥ q) .

(38)

Analogously,

F_{p q}^{'} (0) = - \frac{1}{E_{φ} (q)} D_{KL}^{w} (q ∥ p)

.

3.3.4. Chernoff–KL

Lemma 1.

Let

α^{*}

be a maximiser in Definition 2 and assume that

α^{*} \in (0, 1)

(so that

F_{p q}^{'} (α^{*}) = 0

below). Set

F_{p q} (α) : = ln ρ_{α}^{w} (p, q), r_{α} : = {(p q)}_{α} .

Then

D_{C}^{w} (p, q) = D_{KL} (r_{α^{*}} ∥ r_{1}) - ln E_{φ} (p) = D_{KL} (r_{α^{*}} ∥ r_{0}) - ln E_{φ} (q),

(39)

where

r_{1} (x) = φ (x) p (x) / E_{φ} (p)

and

r_{0} (x) = φ (x) q (x) / E_{φ} (q)

.

Here,

D_{KL}

denotes the standard (unweighted) Kullback–Leibler divergence.

Proof.

A direct computation yields, for

α \in [0, 1]

,

\begin{matrix} D_{KL} (r_{α} ∥ r_{1}) & = - (1 - α) F_{p q}^{'} (α) - F_{p q} (α) + F_{p q} (1), \\ D_{KL} (r_{α} ∥ r_{0}) & = α F_{p q}^{'} (α) - F_{p q} (α) + F_{p q} (0) . \end{matrix}

At

α = α^{*}

, we have

F_{p q}^{'} (α^{*}) = 0

. Since

F_{p q} (1) = ln E_{φ} (p)

and

F_{p q} (0) = ln E_{φ} (q)

, the claim follows from

D_{C}^{w} (p, q) = - F_{p q} (α^{*})

. □

Corollary 2

(Chernoff information as a Bregman divergence on the Chernoff arc). Let

F_{p q} (α) : = ln ρ_{α}^{w} (p, q)

and define the one-dimensional Bregman divergence

B_{F_{p q}} (a, b) : = F_{p q} (a) - F_{p q} (b) - (a - b) F_{p q}^{'} (b) .

Assume that the maximiser

α^{*} \in (0, 1)

in Definition 2 is interior, so that

F_{p q}^{'} (α^{*}) = 0

. Then

D_{C}^{w} (p, q) = B_{F_{p q}} (1, α^{*}) - ln E_{φ} (p) = B_{F_{p q}} (0, α^{*}) - ln E_{φ} (q) .

(40)

Equivalently,

\begin{matrix} B_{F_{p q}} (1, α^{*}) & = D_{KL} (r_{α^{*}} ∥ r_{1}) = D_{C}^{w} (p, q) + ln E_{φ} (p), \\ B_{F_{p q}} (0, α^{*}) & = D_{KL} (r_{α^{*}} ∥ r_{0}) = D_{C}^{w} (p, q) + ln E_{φ} (q), \end{matrix}

with

r_{α} = {(p q)}_{α}

,

r_{1} = φ p / E_{φ} (p)

and

r_{0} = φ q / E_{φ} (q)

.

Proof.

In the Chernoff exponential family

{r_{α}}

with log-normalizer

F_{p q}

, the KL–Bregman identity gives

D_{KL} (r_{α^{*}} ∥ r_{1}) = B_{F_{p q}} (1, α^{*})

and

D_{KL} (r_{α^{*}} ∥ r_{0}) = B_{F_{p q}} (0, α^{*})

. Since

F_{p q}^{'} (α^{*}) = 0

and

F_{p q} (1) = ln E_{φ} (p)

,

F_{p q} (0) = ln E_{φ} (q)

, we obtain

B_{F_{p q}} (1, α^{*}) = - F_{p q} (α^{*}) + F_{p q} (1) = D_{C}^{w} (p, q) + ln E_{φ} (p),

and similarly for 0. Substituting this into Lemma 1 yields (40). □

Remark 6

(One-parameter case). Assume

d = 1

,

θ_{1} \neq θ_{2}

, and that the maximiser

α^{*} \in (0, 1)

in Definition 2 is interior. Assume moreover that

{\hat{F}}^{'}

is strictly increasing on Θ and set

\hat{G} = {({\hat{F}}^{'})}^{- 1}

. Then (32) yields

α^{*} = \frac{1}{θ_{1} - θ_{2}} (\hat{G} (\frac{F (θ_{1}) - F (θ_{2})}{θ_{1} - θ_{2}}) - θ_{2}) .

(41)

When

φ \equiv 1

, we have

\hat{F} \equiv F

and (41) reduces to the classical formula.

To illustrate the general identities above, we provide explicit expressions for

D_{B, α}^{w}

and

D_{C}^{w}

in several parametric settings (Gaussian, Poisson, and exponential); see Section 4.

Remark 7.

The Bregman representation (40) identifies

D_{C}^{w} (P, Q)

with a Bregman divergence on the tilted log-normaliser

\hat{F}

; geometrically, the optimiser

α^{*}

marks the intersection of the exponential geodesic of the tilted family

{φ p^{α} q^{1 - α}}_{α \in [0, 1]}

with the weighted Kullback–Leibler bisector, generalising the unweighted characterisation of [11,12]. In the exponential-family setting, the computation of

D_{C}^{w}

therefore reduces to

\hat{F}

and its gradient: once

\hat{F}

is available in closed form,

α^{*}

is determined by (41) in the one-parameter case and by a monotone equation in the natural-parameter space in general, without evaluating

ρ_{α}^{w}

for each α. The examples of Section 4 exemplify this reduction.

3.4. Tilted Weighted Likelihood and Concentration Bounds

Although the optimal rule in the context-sensitive problem is still the usual likelihood-ratio test

q / p

(cf. Section 3), it is convenient to work with the tilted ratio

q^{*} / p^{*}

. The factor

φ

cancels pointwise in

q^{*} / p^{*}

and enters only through the normalisation constants

E_{φ} (p)

and

E_{φ} (q)

. We record two consequences: a large-deviation representation for

L^{*} / n

via the cumulant generating functions

ψ_{P}, ψ_{Q}

and their Legendre transforms and a finite-n concentration bound based on a martingale argument.

For the tilted distributions, the log-likelihood takes the form

\begin{matrix} L^{*} (X_{1}^{n}) = L^{*} (X_{1}, \dots, X_{n}) = \sum_{i = 1}^{n} ln \frac{q^{*} (X_{i})}{p^{*} (X_{i})} \\ = \sum_{i = 1}^{n} ln \frac{q (X_{i})}{p (X_{i})} - n ln E_{φ} (q) + n ln E_{φ} (p) . \end{matrix}

(42)

Here

E_{φ} (p) = \int φ (x) p (x) d μ (x)

.

In particular, since

ln \frac{q^{*} (x)}{p^{*} (x)} = ln \frac{q (x)}{p (x)} + ln E_{φ} (p) - ln E_{φ} (q),

we may equivalently rewrite likelihood-ratio threshold rules in terms of

L^{*}

. For example,

\sum_{i = 1}^{n} ln \frac{q (X_{i})}{p (X_{i})} \geq 0 ⟺ L^{*} (X_{1}^{n}) \geq n (ln E_{φ} (p) - ln E_{φ} (q)) .

Thus,

L^{*}

is the usual log-likelihood ratio, shifted by a constant determined by the context weight

φ

.

The log of the moment generating function and its Legendre transform take the form

\begin{matrix} ψ_{P} (α) & = ln E_{P} [e^{α ln \frac{q^{*} (X)}{p^{*} (X)}}] \\ = ln \int_{X} q {(x)}^{α} p {(x)}^{1 - α} d μ (x) - α ln E_{φ} (q) + α ln E_{φ} (p), \\ I_{P} (r) & = sup_{α} [α r - ψ_{P} (α)], \end{matrix}

(43)

where

α

ranges over the set

{α \in R : ψ_{P} (α) < \infty}

(and similarly for

ψ_{Q}

).

Similarly,

\begin{matrix} ψ_{Q} (α) : & = ln E_{Q} [e^{α ln \frac{q^{*} (X)}{p^{*} (X)}}] = ln E_{P} [e^{ln \frac{q (X)}{p (X)} + α ln \frac{q^{*} (X)}{p^{*} (X)}}] \\ = ln \int_{X} q {(x)}^{α + 1} p {(x)}^{- α} d μ (x) - α ln E_{φ} (q) + α ln E_{φ} (p) . \end{matrix}

(44)

This implies the relation of Legendre transforms

I_{Q} (r) = I_{P} (r) - r + ln E_{φ} (p) - ln E_{φ} (q) .

(45)

In particular,

I_{P} (0)

may be treated as a natural weighted version of the Chernoff divergence between q and p:

\begin{matrix} I_{P} (0) & = sup_{α} [- ln \int_{X} q {(x)}^{α} p {(x)}^{1 - α} d μ (x) + α ln E_{φ} (q) - α ln E_{φ} (p)] \\ = : {\hat{D}}_{C}^{w} (q, p) . \end{matrix}

(46)

Interpretation. The value

I_{P} (0)

is the Chernoff–Cramér exponent controlling the tail event

{L^{*} / n \geq 0}

under

P

, i.e., (under standard regularity assumptions),

P (L^{*} (X_{1}^{n}) \geq 0) ≍ e^{- n I_{P} (0)}

. In the unweighted case

φ \equiv 1

, we have

E_{φ} (p) = E_{φ} (q) = 1

and

I_{P} (0)

reduces to the classical Chernoff information between p and q. We also stress that

{\hat{D}}_{C}^{w} (q, p) = I_{P} (0)

is a tilted-likelihood exponent and is distinct from the weighted Chernoff information

D_{C}^{w} (P, Q)

from Definition 2, which governs the optimal sum of context-sensitive losses.

Non-Asymptotic Concentration via a Doob Martingale

The rate functions

I_{P}

and

I_{Q}

capture the exponential scale of deviations of

L^{*} / n

as

n \to \infty

. To obtain explicit finite-n bounds, we now apply a standard martingale method to

L^{*}

under

Q

.

Consider the filtration

F_{k} = σ (X_{1}, \dots, X_{k})

and define the random variables

{U_{k}, k = 0, \dots, n}

by

\begin{matrix} U_{k} & = E_{Q} [L^{*} (X_{1}, \dots, X_{n}) | F_{k}] \\ = \sum_{j = 1}^{k} ln \frac{q (X_{j})}{p (X_{j})} + (n - k) D_{KL} (Q ∥ P) - n ln E_{φ} (q) + n ln E_{φ} (p), \end{matrix}

(47)

where

D_{KL} (Q ∥ P) = \int q (x) ln \frac{q (x)}{p (x)} d μ (x)

stands for the (unweighted) Kullback–Leibler divergence of

Q

and

P

. Then

U_{k} - U_{k - 1} = ln \frac{q (X_{k})}{p (X_{k})} - D_{KL} (Q ∥ P), k = 1, \dots, n .

(48)

Observe that

{U_{k}, F_{k}}

is a martingale w.r.t.

Q

.

Assume now that

d < \infty

and

σ^{2} < \infty

, where

d = sup_{x \in X} |ln \frac{q (x)}{p (x)} - D_{KL} (Q ∥ P)|

and

\begin{matrix} σ^{2} & = E_{Q} [{(U_{k} - U_{k - 1})}^{2} | F_{k - 1}] \\ = \int_{X} q (x) {(ln \frac{q (x)}{p (x)} - D_{KL} (Q ∥ P))}^{2} d μ (x) . \end{matrix}

(49)

In view of a refined Azuma–Hoeffding/McDiarmid inequality [13,14],

\begin{matrix} P_{Q} (L^{*} (X_{1}, \dots, X_{n}) > (D_{KL} (Q ∥ P) + β) n - (ln E_{φ} (q) - ln E_{φ} (p)) n) \\ \leq exp \{- n D (\frac{δ + γ^{*}}{1 + γ^{*}} ∥ \frac{γ^{*}}{1 + γ^{*}})\}, \end{matrix}

(50)

where

δ = \frac{σ^{2}}{d^{2}}

,

γ^{*} = \frac{β^{*}}{d}

,

β^{*} = β - ln E_{φ} (q) + ln E_{φ} (p)

and

D (p ∥ q) = p ln \frac{p}{q} + (1 - p) ln \frac{1 - p}{1 - q}

stands for the Kullback–Leibler divergence between the two Bernoulli distributions

(p, 1 - p)

and

(q, 1 - q)

.

We use the following modified version of Azuma–Hoeffding inequality; see [14].

Lemma 2.

Let

{U_{k}, F_{k}}

be a discrete-time real-valued martingale. Assume that, for some constants

d, σ > 0

, the following two requirements are satisfied a.s. for every

k \in {1, \dots, n} :

\begin{matrix} | U_{k} - U_{k - 1} | \leq d, \\ Var [U_{k} - U_{k - 1} | F_{k - 1}] \leq σ^{2} . \end{matrix}

(51)

Then, for every

β \geq 0

,

P (| U_{n} - U_{0} | \geq β n) \leq 2 exp \{- n D (\frac{δ + γ}{1 + γ} ∥ \frac{γ}{1 + γ})\},

(52)

where

δ = \frac{σ^{2}}{d^{2}}

and

γ = \frac{β}{d}

. (Note that

δ \in [0, 1]

automatically whenever

| U_{k} - U_{k - 1} | \leq d

a.s.)

In particular, the one-sided bound

P (U_{n} - U_{0} \geq β n) \leq exp {- n D (\cdot)}

holds.

Theorem 2.

Set

β^{*} = β - ln E_{φ} (p) + ln E_{φ} (q), γ^{*} = \frac{β^{*}}{d} .

Under conditions

d < \infty

,

σ^{2} < \infty

, and

β^{*} \geq 0

,

P_{Q} (L^{*} (X_{1}, \dots, X_{n}) \geq β n) \leq exp \{- n D (\frac{δ + γ^{*}}{1 + γ^{*}} ∥ \frac{γ^{*}}{1 + γ^{*}})\} .

(53)

Lemma 2 is quoted from [14]. Theorem 2 is its direct application to the tilted log-likelihood ratio (42); the only dependence on the context weight

φ

is through

E_{φ} (p)

and

E_{φ} (q)

.

4. Examples and Applications

The identities of Section 3 reduce the computation of

D_{B, α}^{w}

and

D_{C}^{w}

to the single-letter weighted affinity

ρ_{α}^{w} (p, q) = \int φ (x) p {(x)}^{α} q {(x)}^{1 - α} d μ (x),

followed by optimisation over

α \in [0, 1]

. We work this out for Gaussian, Poisson, and exponential families, highlighting how the context weight

φ

modifies the classical formulas. When

φ \equiv 1

, the expressions reduce to the standard unweighted Bhattacharyya and Chernoff quantities; more involved non-exponential-family computations (such as the Cauchy location–scale family) are deferred to the Appendix A.

4.1. A Numerical Illustration

This subsection illustrates the behaviour of

α^{*}

and

D_{C}^{w}

under a non-trivial factorised weight and serves as a direct numerical verification of the Bregman identities of Section 3.3: for the model below the affinity

ρ_{α}^{w}

is available in closed form, and we check that closed-form evaluation and direct numerical integration agree to machine precision.

Consider the asymmetric Gaussian hypotheses

H_{0} : P = N (μ_{0}, σ_{0}^{2}), H_{1} : Q = N (μ_{1}, σ_{1}^{2}), (μ_{0}, σ_{0}^{2}) = (0, 1), (μ_{1}, σ_{1}^{2}) = (3, 2),

with a non-indicator factorised weight

φ (x) = exp (- β {(x - x_{0})}^{2}), x_{0} = 0, β \geq 0 .

(54)

At

β = 0

, one has

φ \equiv 1

and the unweighted Chernoff information is recovered. For

β > 0

, the weight concentrates near

x_{0} = μ_{0}

; in particular, (54) is not an indicator-type weight, so the weighted problem does not reduce to the unweighted Chernoff information on a restricted domain.

The asymmetry

σ_{0}^{2} \neq σ_{1}^{2}

and the centring of

φ

at the

H_{0}

mean are essential for the illustration. In a fully symmetric configuration (

σ_{0} = σ_{1}

,

μ_{1} = - μ_{0}

,

x_{0} = 0

), the problem is invariant under

α \leftrightarrow 1 - α

, so the optimum is pinned to

α^{*} = 1 / 2

for every

β

and the effect of the weight on the Chernoff compromise is invisible. Asymmetric hypotheses are precisely where the weighted formalism is operationally distinct from the classical one, and it is this distinction that the numerics below is designed to expose.

Writing

A (α) = α / σ_{0}^{2} + (1 - α) / σ_{1}^{2} + 2 β

,

B (α) = α μ_{0} / σ_{0}^{2} + (1 - α) μ_{1} / σ_{1}^{2} + 2 β x_{0}

, and

C (α) = α μ_{0}^{2} / σ_{0}^{2} + (1 - α) μ_{1}^{2} / σ_{1}^{2} + 2 β x_{0}^{2}

, a direct Gaussian integration yields

ln ρ_{α}^{w} (p, q) = \frac{1}{2} ln \frac{2 π}{A (α)} - \frac{α}{2} ln (2 π σ_{0}^{2}) - \frac{1 - α}{2} ln (2 π σ_{1}^{2}) - \frac{1}{2} (C (α) - \frac{B {(α)}^{2}}{A (α)}),

(55)

and maximising (55) over

α \in [0, 1]

gives

α^{*} (β)

and

D_{C}^{w} (P, Q) (β)

. Table 1 reports their values at three selected

β

. Direct numerical integration of

ρ_{α}^{w}

from its definition agrees with (55) to machine precision on all tabulated entries, which confirms the Bregman identities of Section 3.3 numerically for this example.

The monotone growth of

β \mapsto D_{C}^{w}

and the leftward shift of

α^{*} (β)

illustrate a qualitative conclusion of Section 3: localising

φ

near

μ_{0}

preferentially retains observations that are more probable under

H_{0}

and thereby increases the effective discrimination rate, while simultaneously moving the optimal tilting towards the

H_{0}

side. The classical unweighted limit is recovered at

β = 0

.

In the language of hypothesis testing,

α^{*}

is the parameter that balances the exponential rates of the type-I and type-II losses at the Bayes optimum: the type-I exponent equals

α^{*} D_{C}^{w}

and the type-II exponent equals

(1 - α^{*}) D_{C}^{w}

(cf. Section 3). A leftward shift of

α^{*}

therefore corresponds to reallocating the available exponential budget towards faster decay of the type-II loss at the expense of the type-I loss, which is the optimal response to a weight that concentrates mass in the region where

H_{0}

is more plausible.

Data and Code Availability

A Jupyter/Colab notebook reproducing Table 1 and Figure 1, Figure 2 and Figure 3, together with the direct-integration verification of (55), is archived on Zenodo [16] and mirrored on GitHub.

4.2. Gaussian Models

Throughout this subsection, the reference measure is the Lebesgue measure on

R^{d}

. We compute the weighted Bhattacharyya coefficient

ρ_{α}^{w} (P, Q) = \int_{R^{d}} φ (x) p {(x)}^{α} q {(x)}^{1 - α} d x, α \in [0, 1],

together with the weighted Bhattacharyya distance

D_{B, α}^{w} (P, Q) : = - ln ρ_{α}^{w} (P, Q)

and the weighted Chernoff information

D_{C}^{w} (P, Q) = {max}_{α \in [0, 1]} D_{B, α}^{w} (P, Q)

(Definition 2). Note that, unlike the unweighted case,

ρ_{α}^{w}

is not restricted to

[0, 1]

and

D_{B, α}^{w}

(hence also

D_{C}^{w}

) may take negative values.

Example 1

(Gaussian weighted Bhattacharyya coefficient with exponential weight). Let

P = N (μ_{1}, Σ_{1})

and

Q = N (μ_{2}, Σ_{2})

on

R^{d}

, where

Σ_{1} ≻ 0

and

Σ_{2} ≻ 0

, and let

φ (x) = e^{γ^{T} x}

for some

γ \in R^{d}

. Denote by

p, q

the corresponding densities. For

α \in [0, 1]

define

\begin{matrix} Σ_{α} : = {(α Σ_{1}^{- 1} + (1 - α) Σ_{2}^{- 1})}^{- 1}, \\ {\tilde{μ}}_{α} : = Σ_{α} (α Σ_{1}^{- 1} μ_{1} + (1 - α) Σ_{2}^{- 1} μ_{2} + γ) . \end{matrix}

Then

\begin{matrix} ρ_{α}^{w} (P, Q) = \int_{R^{d}} e^{γ^{T} x} p {(x)}^{α} q {(x)}^{1 - α} d x \\ = \frac{| Σ_{α} |^{1 / 2}}{| Σ_{1} |^{α / 2} {| Σ_{2} |}^{(1 - α) / 2}} exp \{- \frac{1}{2} (α μ_{1}^{T} Σ_{1}^{- 1} μ_{1} + (1 - α) μ_{2}^{T} Σ_{2}^{- 1} μ_{2} - {\tilde{μ}}_{α}^{T} Σ_{α}^{- 1} {\tilde{μ}}_{α})\} . \end{matrix}

(56)

Consequently,

\begin{matrix} D_{B, α}^{w} (P, Q) = \\ = \frac{1}{2} (α μ_{1}^{T} Σ_{1}^{- 1} μ_{1} + (1 - α) μ_{2}^{T} Σ_{2}^{- 1} μ_{2} - {\tilde{μ}}_{α}^{T} Σ_{α}^{- 1} {\tilde{μ}}_{α} + ln \frac{| Σ_{1} |^{α} {| Σ_{2} |}^{1 - α}}{| Σ_{α} |}) . \end{matrix}

(57)

In particular, setting

γ = 0

(i.e.,

φ \equiv 1

) reduces (57) to the classical (unweighted) Gaussian Bhattacharyya distance; see, e.g., [12].

Corollary 3

(Common covariance). In Example 1, assume

Σ_{1} = Σ_{2} = Σ ≻ 0

and keep the exponential weight

φ (x) = e^{γ^{T} x}

. Set

δ : = μ_{1} - μ_{2}

,

{∥ v ∥}_{Σ^{- 1}}^{2} : = v^{T} Σ^{- 1} v

, and

μ_{α} : = α μ_{1} + (1 - α) μ_{2}

. Then, for any

α \in [0, 1]

,

\begin{matrix} ρ_{α}^{w} (P, Q) & = \int_{R^{d}} e^{γ^{T} x} p {(x)}^{α} q {(x)}^{1 - α} d x \\ = exp \{- \frac{α (1 - α)}{2} {∥ δ ∥}_{Σ^{- 1}}^{2} + γ^{T} μ_{α} + \frac{1}{2} γ^{T} Σ γ\}, \end{matrix}

(58)

and therefore

D_{B, α}^{w} (P, Q) = - ln ρ_{α}^{w} (P, Q) = \frac{α (1 - α)}{2} {∥ δ ∥}_{Σ^{- 1}}^{2} - γ^{T} μ_{α} - \frac{1}{2} γ^{T} Σ γ .

(59)

If

μ_{1} \neq μ_{2}

and the unconstrained maximiser

\tilde{α} = \frac{1}{2} - \frac{γ^{T} δ}{{∥ δ ∥}_{Σ^{- 1}}^{2}}

belongs to

(0, 1)

, then

α^{*} = \tilde{α}

; otherwise the maximum over

α \in [0, 1]

is attained at the nearest boundary point

α^{*} \in {0, 1}

. In all cases,

D_{C}^{w} (P, Q) = max_{α \in [0, 1]} D_{B, α}^{w} (P, Q) = D_{B, α^{*}}^{w} (P, Q) .

In particular, for

γ = 0

(i.e.,

φ \equiv 1

) we recover

α^{*} = 1 / 2

and

D_{C} (P, Q) = {∥ δ ∥}_{Σ^{- 1}}^{2} / 8

.

Proof.

The expression for

ρ_{α}^{w}

follows by simplifying Example 1 under

Σ_{1} = Σ_{2} = Σ

, which makes the determinant prefactor equal to 1 and yields a Gaussian MGF term

exp (γ^{T} μ_{α} + \frac{1}{2} γ^{T} Σ γ)

. The maximiser follows by differentiating (59) in

α

. □

Choosing an exponential weight

φ (x) = e^{γ^{T} x}

corresponds to exponential tilting: for a Gaussian

X \sim N (μ, Σ)

, this tilting keeps the covariance and shifts the mean to

μ + Σ γ

(with normalisation factor

exp (γ^{T} μ + \frac{1}{2} γ^{T} Σ γ)

), which is why the weighted affinities remain available in closed form. In particular, the optimal Chernoff parameter is no longer forced to be

α^{*} = 1 / 2

and, as (59) shows, sufficiently strong tilting can push the maximiser to the boundary

α^{*} \in {0, 1}

.

4.3. Poisson Models

Example 2

(Poisson model with exponential weight). Let

X = N_{0} = {0, 1, 2, \dots}

and let μ be the counting measure on

X

. Fix two hypotheses

P = Poi (λ_{1})

and

Q = Poi (λ_{2})

with

λ_{1}, λ_{2} > 0

, and write

p = p_{λ_{1}}

and

q = q_{λ_{2}}

. Throughout this subsection we work under the standing assumption of Section 2, namely, that the observations

X_{1}, \dots, X_{n}

are i.i.d. (distributed as

P

under

H_{0}

and as

Q

under

H_{1}

), and that the weight φ factorises across observations (Assumption 1). We still consider the exponential weight

φ_{γ} (k) = e^{γ k}, γ \in R .

(For

γ = 0

we recover the unweighted case

φ \equiv 1

.) Equivalently, setting

ε : = e^{γ} > 0

, the weight takes the form

φ (k) = ε^{k}

; this reparameterisation is convenient in applications where

ε \in (0, 1)

models a per-event discount factor, while

γ \in R

is the natural parameter for the exponential-family calculations below. The two parameterisations are equivalent.

For

α \in [0, 1]

, set

λ_{α} : = λ_{1}^{α} λ_{2}^{1 - α} .

(a) Weighted Bhattacharyya coefficient and Chernoff arc. A direct summation gives

\begin{matrix} ρ_{α}^{w} (P, Q) & = \sum_{k = 0}^{\infty} φ_{γ} (k) p {(k)}^{α} q {(k)}^{1 - α} = exp \{- α λ_{1} - (1 - α) λ_{2} + e^{γ} λ_{α}\} . \end{matrix}

(60)

Hence, by Definition 1,

D_{B, α}^{w} (P, Q) = - ln ρ_{α}^{w} (P, Q) = α λ_{1} + (1 - α) λ_{2} - e^{γ} λ_{α} .

(61)

Moreover, the normalised weighted geometric mixture (Chernoff-tilted density) from (18) takes the form

{(p q)}_{α} (k) = \frac{φ_{γ} (k) p {(k)}^{α} q {(k)}^{1 - α}}{ρ_{α}^{w} (P, Q)} = exp {- e^{γ} λ_{α}} \frac{{(e^{γ} λ_{α})}^{k}}{k!} = Poi (e^{γ} λ_{α}) .

(b) Optimal Chernoff parameter. If

λ_{1} \neq λ_{2}

, then

α \mapsto D_{B, α}^{w} (P, Q)

is strictly concave on

[0, 1]

since

\frac{d^{2}}{d α^{2}} D_{B, α}^{w} (P, Q) = - e^{γ} λ_{α} {(ln \frac{λ_{1}}{λ_{2}})}^{2} < 0;

hence, the maximiser

α^{*}

in Definition 2 is unique. Differentiating (61) yields the critical point condition

0 = \frac{d}{d α} D_{B, α}^{w} (P, Q) = λ_{1} - λ_{2} - e^{γ} λ_{α} ln (\frac{λ_{1}}{λ_{2}}) .

Equivalently, the (unconstrained) critical point

α = \tilde{α}

satisfies

λ_{\tilde{α}} = e^{- γ} L (λ_{1}, λ_{2}), L (λ_{1}, λ_{2}) : = \frac{λ_{1} - λ_{2}}{ln λ_{1} - ln λ_{2}} .

In contrast to the unweighted case (

γ = 0

), the context tilt γ may push the optimal Chernoff parameter to the boundary

α^{*} \in {0, 1}

.

Thus, the unconstrained maximiser is

\tilde{α} = \frac{ln L (λ_{1}, λ_{2}) - γ - ln λ_{2}}{ln λ_{1} - ln λ_{2}},

and the maximiser on

[0, 1]

is

α^{*} = Π_{[0, 1]} (\tilde{α})

, where

Π_{[0, 1]} (a) : = min {1, max {0, a}} .

Finally,

D_{C}^{w} (P, Q) = max_{α \in [0, 1]} D_{B, α}^{w} (P, Q) = D_{B, α^{*}}^{w} (P, Q) .

If

λ_{1} = λ_{2}

, then

ρ_{α}^{w} (P, Q)

does not depend on α and

D_{C}^{w} (P, Q) = D_{B, α}^{w} (P, Q)

for any

α \in [0, 1]

.

Derivation of (60). For

k \in N_{0}

,

p {(k)}^{α} q {(k)}^{1 - α} = exp {- α λ_{1} - (1 - α) λ_{2}} \frac{λ_{α}^{k}}{k!},

so multiplying by

e^{γ k}

and summing over k gives

\begin{matrix} ρ_{α}^{w} (P, Q) & = exp {- α λ_{1} - (1 - α) λ_{2}} \sum_{k \geq 0} {(e^{γ} λ_{α})}^{k} / k! \\ = exp {- α λ_{1} - (1 - α) λ_{2} + e^{γ} λ_{α}} . \end{matrix}

4.4. Exponential Models

Example 3

(Exponential model with exponential weight). Let

X = R_{+} = [0, \infty)

with Lebesgue measure. Fix two hypotheses

P = Exp (λ_{1})

and

Q = Exp (λ_{2})

with rates

λ_{1}, λ_{2} > 0

, and write

p (x) = λ_{1} e^{- λ_{1} x} 1 {x \geq 0}, q (x) = λ_{2} e^{- λ_{2} x} 1 {x \geq 0} .

Consider the exponential weight

φ_{γ} (x) = e^{γ x}

with

γ < min {λ_{1}, λ_{2}},

so that

ρ_{α}^{w} (P, Q) \in (0, \infty)

for all

α \in [0, 1]

. For

α \in [0, 1]

set

λ_{α} : = α λ_{1} + (1 - α) λ_{2} .

(a) Weighted Bhattacharyya coefficient and Chernoff arc. A direct computation gives

ρ_{α}^{w} (P, Q) = \int_{0}^{\infty} e^{γ x} p {(x)}^{α} q {(x)}^{1 - α} d x = \frac{λ_{1}^{α} λ_{2}^{1 - α}}{λ_{α} - γ} .

Hence

D_{B, α}^{w} (P, Q) = - ln ρ_{α}^{w} (P, Q) = ln (λ_{α} - γ) - α ln λ_{1} - (1 - α) ln λ_{2} .

Moreover, the Chernoff-tilted density

{(p q)}_{α}

from (18) is again exponential:

{(p q)}_{α} (x) = \frac{e^{γ x} p {(x)}^{α} q {(x)}^{1 - α}}{ρ_{α}^{w} (P, Q)} = (λ_{α} - γ) e^{- (λ_{α} - γ) x} 1 {x \geq 0} = Exp (λ_{α} - γ) .

(b) Optimal Chernoff parameter. If

λ_{1} \neq λ_{2}

, then

α \mapsto D_{B, α}^{w} (P, Q)

is strictly concave on

[0, 1]

; hence, the maximiser

α^{*}

in Definition 2 is unique. Differentiating yields the critical point condition

\frac{λ_{1} - λ_{2}}{λ_{α} - γ} = ln (\frac{λ_{1}}{λ_{2}}) .

Equivalently, the (unconstrained) critical point

α = \tilde{α}

satisfies

λ_{\tilde{α}} - γ = L (λ_{1}, λ_{2}), L (λ_{1}, λ_{2}) : = \frac{λ_{1} - λ_{2}}{ln λ_{1} - ln λ_{2}},

so that

\tilde{α} = \frac{γ + L (λ_{1}, λ_{2}) - λ_{2}}{λ_{1} - λ_{2}} .

The maximiser on

[0, 1]

is

α^{*} = Π_{[0, 1]} (\tilde{α})

(projection onto

[0, 1]

), and

D_{C}^{w} (P, Q) = max_{α \in [0, 1]} D_{B, α}^{w} (P, Q) = D_{B, α^{*}}^{w} (P, Q) .

If

λ_{1} = λ_{2} = λ

, then

ρ_{α}^{w} (P, Q) = λ / (λ - γ)

does not depend on α, so any

α \in [0, 1]

is optimal and

D_{C}^{w} (P, Q) = ln (λ - γ) - ln λ

. Setting

γ = 0

(i.e.,

φ \equiv 1

) recovers the classical unweighted expressions.

Additional Example (Baseline, Non-Exponential Family)

Appendix A contains a closed-form illustration for the Cauchy location–scale family. Since the Cauchy family is not an exponential family, this example complements the main text by showing that, even in the unweighted baseline case

φ \equiv 1

, the Bhattacharyya coefficient (in particular

ρ_{1 / 2}

) and the Chernoff information may involve special functions (complete elliptic integrals). For nontrivial weights

φ

, the symmetry

ρ_{α} = ρ_{1 - α}

(hence

α^{*} = 1 / 2

) typically fails and a comparable closed form is not available, so we keep the baseline Cauchy computation in the appendix.

4.5. Extension to M-ary Hypothesis Testing

We now record the finite-M analogue of Theorem 1. The key observation is that the optimal M-ary pointwise loss is squeezed between pairwise minima (Lemma 3), and each pairwise term has logarithmic rate given by the corresponding weighted Chernoff information. Hence the overall M-ary rate is determined by the closest pair in terms of

D_{C}^{w}

.

Fix an integer

M \geq 2

and let

P_{1}, \dots, P_{M}

be probability measures on

X

dominated by

μ

, with strictly positive densities

p_{1}, \dots, p_{M}

. Let

X_{1}^{n} = (X_{1}, \dots, X_{n})

be i.i.d. under each hypothesis

H_{i} : X_{1}^{n} \sim P_{i}^{\otimes n}

. Assume that the weight function factorises as in Assumption 1.

Assume moreover that for every

1 \leq i < j \leq M

and every

α \in [0, 1]

,

ρ_{α}^{w} (p_{i}, p_{j}) = \int_{X} φ (x) p_{i} {(x)}^{α} p_{j} {(x)}^{1 - α} d μ (x) \in (0, \infty),

so that all pairwise weighted Chernoff information values are well-defined and inequality

max_{i \neq j} sup_{α \in [0, 1]} \int φ (x) | ln \frac{p_{i} (x)}{p_{j} (x)} | p_{i} {(x)}^{α} p_{j} {(x)}^{1 - α} d μ (x) < \infty

(62)

holds true.

A (deterministic) M-ary decision rule is a measurable map

δ_{n} : X^{n} \to {1, \dots, M}

. Define the context-sensitive loss under

H_{i}

by

L_{i, n} (δ_{n}) : = E_{P_{i}^{\otimes n}} [φ (X_{1}^{n}) 1 {δ_{n} (X_{1}^{n}) \neq i}],

and the total loss

L_{n, M} (δ_{n}) : = \sum_{i = 1}^{M} L_{i, n} (δ_{n}), L_{n, M}^{*} : = inf_{δ_{n}} L_{n, M} (δ_{n}) .

Proposition 6

(Pointwise form of the optimal M-ary total loss). For each

n \geq 1

,

L_{n, M}^{*} = \int_{X^{n}} φ (x_{1}^{n}) (\sum_{i = 1}^{M} p_{i} (x_{1}^{n}) - max_{1 \leq j \leq M} p_{j} (x_{1}^{n})) d μ^{\otimes n} (x_{1}^{n}),

(63)

where

p_{i} (x_{1}^{n}) = \prod_{k = 1}^{n} p_{i} (x_{k})

. Moreover, an optimal rule is given by the maximum-likelihood classifier

δ_{n}^{*} (x_{1}^{n}) \in arg max_{1 \leq j \leq M} p_{j} (x_{1}^{n})

(with any measurable tie-breaking).

Proof.

Fix

δ_{n}

. Using

\sum_{i = 1}^{M} 1 {δ_{n} \neq i} p_{i} = \sum_{i = 1}^{M} p_{i} - p_{δ_{n}}

pointwise, we obtain

L_{n, M} (δ_{n}) = \int_{X^{n}} φ (x_{1}^{n}) (\sum_{i = 1}^{M} p_{i} (x_{1}^{n}) - p_{δ_{n} (x_{1}^{n})} (x_{1}^{n})) d μ^{\otimes n} (x_{1}^{n}) .

Minimisation over

δ_{n}

is therefore pointwise in

x_{1}^{n}

and is achieved by selecting an index maximising

p_{j} (x_{1}^{n})

, yielding (63). □

Lemma 3

(Pairwise minima). For any non-negative numbers

a_{1}, \dots, a_{M}

,

max_{1 \leq i < j \leq M} min (a_{i}, a_{j}) \leq \sum_{k = 1}^{M} a_{k} - max_{1 \leq k \leq M} a_{k} \leq \sum_{1 \leq i < j \leq M} min (a_{i}, a_{j}) .

(64)

Consequently, defining for

i < j

I_{n}^{i, j} : = \int_{X^{n}} φ (x_{1}^{n}) min {p_{i} (x_{1}^{n}), p_{j} (x_{1}^{n})} d μ^{\otimes n} (x_{1}^{n}),

we have the sandwich inequality

max_{i < j} I_{n}^{i, j} \leq L_{n, M}^{*} \leq \sum_{i < j} I_{n}^{i, j} .

(65)

Proof.

Let

a_{(1)} \geq a_{(2)} \geq \dots \geq a_{(M)}

be the decreasing rearrangement of

(a_{1}, \dots, a_{M})

. Then

\sum_{k} a_{k} - {max}_{k} a_{k} = \sum_{r = 2}^{M} a_{(r)} \geq a_{(2)}

. Moreover,

max_{i < j} min (a_{i}, a_{j}) = a_{(2)}

, proving the left inequality in (64).

For the right inequality, let

k^{*} \in arg {max}_{k} a_{k}

. Then

\sum_{1 \leq i < j \leq M} min (a_{i}, a_{j}) \geq \sum_{k \neq k^{*}} min (a_{k}, a_{k^{*}}) = \sum_{k \neq k^{*}} a_{k} = \sum_{k = 1}^{M} a_{k} - max_{k} a_{k} .

Applying (64) pointwise to

a_{i} = p_{i} (x_{1}^{n})

, multiplying by

φ (x_{1}^{n})

and integrating yields (65). □

Theorem 3

(M-ary exponent equals the minimum pairwise weighted Chernoff information). For

1 \leq i < j \leq M

, let

D_{C}^{w} (P_{i}, P_{j})

be the weighted Chernoff information as in Definition 2, and (62) holds true. Set

C_{M}^{w} : = min_{1 \leq i < j \leq M} D_{C}^{w} (P_{i}, P_{j}) .

Then the optimal M-ary total loss satisfies

L_{n, M}^{*} = exp {- n C_{M}^{w} + o (n)}, n \to \infty,

(66)

or equivalently,

lim_{n \to \infty} - \frac{1}{n} ln L_{n, M}^{*} = C_{M}^{w} .

Proof.

Fix

1 \leq i < j \leq M

and consider the binary testing problem between

P_{i}

and

P_{j}

with the same factorised weight

φ

. The optimal binary total loss equals

I_{n}^{i, j} = \int_{X^{n}} φ (x_{1}^{n}) min {p_{i} (x_{1}^{n}), p_{j} (x_{1}^{n})} d μ^{\otimes n} (x_{1}^{n}),

and by Theorem 1 applied to the pair

(P_{i}, P_{j})

,

I_{n}^{i, j} = exp {- n D_{C}^{w} (P_{i}, P_{j}) + o_{i, j} (n)} .

Since the number of pairs is finite, letting

r_{n} : = {max}_{i < j} | o_{i, j} (n) |

yields

r_{n} = o (n)

and

I_{n}^{i, j} = exp {- n D_{C}^{w} (P_{i}, P_{j}) + O (r_{n})} uniformly over i < j .

Now use the sandwich inequality (65). Let

(i^{*}, j^{*})

attain the minimum

C_{M}^{w} = D_{C}^{w} (P_{i^{*}}, P_{j^{*}})

. From the lower bound,

L_{n, M}^{*} \geq I_{n}^{i^{*}, j^{*}} = exp {- n C_{M}^{w} + O (r_{n})} .

From the upper bound,

L_{n, M}^{*} \leq \sum_{i < j} I_{n}^{i, j} \leq (\binom{M}{2}) exp {- n C_{M}^{w} + O (r_{n})} .

Taking

- \frac{1}{n} ln (\cdot)

and letting

n \to \infty

yields (66). □

Remark 8

(Nonzero priors do not change the exponent). Let

w_{1}, \dots, w_{M} > 0

,

\sum_{i} w_{i} = 1

, and consider the Bayesian weighted total loss

L_{n, M}^{(w)} (δ_{n}) = \sum_{i = 1}^{M} w_{i} L_{i, n} (δ_{n})

with optimum

L_{n, M}^{(w) *} : = {inf}_{δ_{n}} L_{n, M}^{(w)} (δ_{n})

. Then, the exponent remains

C_{M}^{w}

. Indeed, writing

w_{min} : = {min}_{i} w_{i}

and

w_{max} : = {max}_{i} w_{i}

, for any

δ_{n}

,

w_{min} L_{n, M} (δ_{n}) \leq L_{n, M}^{(w)} (δ_{n}) \leq w_{max} L_{n, M} (δ_{n}),

and taking infimum over

δ_{n}

gives

w_{min} L_{n, M}^{*} \leq L_{n, M}^{(w) *} \leq w_{max} L_{n, M}^{*}

. Hence,

- \frac{1}{n} ln L_{n, M}^{(w) *}

and

- \frac{1}{n} ln L_{n, M}^{*}

have the same limit

C_{M}^{w}

.

5. Conclusions

We studied context-sensitive simple hypothesis testing under a multiplicative weight and proved that the optimal total loss admits the single-letter logarithmic asymptotic

L_{n}^{*} = exp {- n D_{C}^{w} (P, Q) + o (n)} .

The rate is the weighted Chernoff information. The main structural ingredient is an exponential-family embedding of the weighted geometric mixtures

φ p^{α} q^{1 - α}

, which yields the characterisation of the optimal Chernoff parameter through the log-normaliser and leads to weighted information-geometric identities. We also derived finite-n concentration bounds for the tilted weighted log-likelihood, obtained explicit formulas in Gaussian, Poisson, and exponential models, and extended the logarithmic asymptotic to finitely many hypotheses through the minimum pairwise weighted Chernoff information.

Open Problems

The single-letter representation (4) rests on Assumption 1 (factorised weights), on the integrability of the tilted log-normaliser

\hat{F}

, and on i.i.d. sampling of simple hypotheses. Relaxing these assumptions defines several natural open directions.

(a): Non-factorised weights. Replacing Assumption 1 by a weight $φ (x_{1}^{n}) = ψ (\frac{1}{n} \sum_{i = 1}^{n} h (x_{i}))$ or a pairwise-interaction weight; the single-letter rate is then expected to be replaced by a variational formula over the space of probability measures, in the spirit of Sanov/Gibbs conditioning.
(b): Integrability. Weights with heavy tails in the sufficient statistic $t (x)$ may violate the finiteness of $\hat{F}$ near $α^{*} θ_{1} + (1 - α^{*}) θ_{2}$ ; the boundary cases $α^{*} \in {0, 1}$ (cf. Proposition 5 and [3]) call for a systematic treatment via truncation or a change in base measure.
(c): Dependent observations. A weighted counterpart of the Gärtner–Ellis theorem for stationary ergodic sequences, along the lines of the weighted extensions in [4].
(d): Composite hypotheses. A weighted analogue of the generalised likelihood-ratio test and its exponent, with sup/inf characterisations in terms of $D_{C}^{w}$ over the composite parameter sets.
(e): Information geometry of weighted manifolds. Extending the Chentsov–Amari framework [8,9,10,11,12] to weighted statistical manifolds; the Fisher metric and the dually flat structure depend on symmetries that $φ$ breaks in a controlled way.

Author Contributions

Conceptualization, methodology, and supervision, M.K.; formal analysis and validation, M.K. and E.Y.K.; software, visualization, data curation, and project administration, E.Y.K.; writing–original draft preparation, M.K. and E.Y.K.; writing–review and editing, E.Y.K.; funding acquisition, M.K. and E.Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

The work by MK was carried out in the framework of a research project HSE-BR-2025-039 implemented as part of the Basic Research Program at HSE University. The second author (E.Yu. Kalimulina) was supported by the Ministry of Science and Higher Education of the Russian Federation under the state assignment (project FFNU-2025-0029).

Data Availability Statement

The Python/Jupyter notebook reproducing Table 1 and Figure 1, Figure 2 and Figure 3, together with the direct numerical-integration verification of the Gaussian formula in Section 4.1, is openly available on Zenodo. No new experimental datasets were generated.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Cauchy Location–Scale Family

We consider the univariate Cauchy location–scale family

p_{l, s} (x) = \frac{s}{π (s^{2} + {(x - l)}^{2})}, x \in R, l \in R, s > 0 .

In this appendix, we set

φ \equiv 1

, so the weighted quantities from Section 2 coincide with their classical counterparts (e.g.,

ρ_{α}^{w} = ρ_{α}

,

D_{B, α}^{w} = D_{B, α}

,

D_{C}^{w} = D_{C}

). This example provides a closed-form illustration outside the exponential-family setting.

Although the main body of the paper develops weighted Chernoff/Bhattacharyya quantities, we include the Cauchy location–scale family as an unweighted baseline closed-form benchmark outside the exponential-family setting (note the appearance of complete elliptic integrals even in the classical case). This appendix is not used in the proofs of the weighted results; it serves as a sanity check and illustrates the analytic complexity of

ρ_{α}

beyond exponential families. Nontrivial weights must satisfy the finiteness conditions from Section 2; for heavy-tailed Cauchy laws, common exponential tilts

φ (x) = e^{γ x}

violate these conditions for any

γ \neq 0

, and in general, a nonconstant weight also breaks the symmetry leading to

α^{*} = 1 / 2

.

Appendix A.1. KL Divergence (Closed Form)

Proposition A1

(KL divergence between two Cauchy laws). Let

P = Cauchy (l_{1}, s_{1})

and

Q = Cauchy (l_{2}, s_{2})

with

s_{1}, s_{2} > 0

. Then

D_{KL} (P | | Q) = \int_{R} p_{l_{1}, s_{1}} (x) ln \frac{p_{l_{1}, s_{1}} (x)}{p_{l_{2}, s_{2}} (x)} d x = ln \frac{{(s_{1} + s_{2})}^{2} + {(l_{1} - l_{2})}^{2}}{4 s_{1} s_{2}} .

(A1)

Appendix A.2. Chernoff Parameter and Bhattacharyya Coefficient

Recall the (unweighted)

α

-Bhattacharyya coefficient and distance

ρ_{α} (P, Q) = \int_{R} p_{l_{1}, s_{1}} {(x)}^{α} p_{l_{2}, s_{2}} {(x)}^{1 - α} d x, D_{B, α} (P, Q) = - ln ρ_{α} (P, Q), α \in [0, 1] .

Proposition A2

(Chernoff parameter for Cauchy). For the Cauchy location–scale family (with

φ \equiv 1

), one has the symmetry

ρ_{α} (P, Q) = ρ_{1 - α} (P, Q), α \in [0, 1];

(A2)

equivalently,

D_{B, α} (P, Q) = D_{B, 1 - α} (P, Q)

. Consequently, if

P \neq Q

, then the Chernoff maximiser is unique and equals

α^{*} = \frac{1}{2}, and hence D_{C} (P, Q) = max_{α \in [0, 1]} D_{B, α} (P, Q) = D_{B, 1 / 2} (P, Q) .

If

P = Q

, then

D_{B, α} (P, Q) \equiv 0

and every

α \in [0, 1]

is a maximiser.

Proof.

Set

F (α) : = ln ρ_{α} (P, Q)

. By Hölder’s inequality, F is convex on

[0, 1]

, and it is strictly convex unless

p_{l_{1}, s_{1}} / p_{l_{2}, s_{2}}

is

μ

-a.e. constant (equivalently, unless

P = Q

). Hence

D_{B, α} (P, Q) = - F (α)

is concave (strictly concave when

P \neq Q

). For Cauchy laws, the symmetry (A2) holds; see, e.g., [17] for an invariance-based proof. Therefore,

D_{B, α}

is symmetric about

1 / 2

, and concavity implies that it attains its maximum at

α = 1 / 2

. Strict concavity yields uniqueness when

P \neq Q

. □

Appendix A.3. Closed Form for ρ_1/2 and D_C

Lemma A1

(A standard elliptic-integral identity). Let

a, b > 0

and

d \in R

. Define the complete elliptic integral of the first kind

K (m) : = \int_{0}^{π / 2} \frac{d u}{\sqrt{1 - m {sin}^{2} u}}, m \in [0, 1) .

Then

\int_{- \infty}^{\infty} \frac{d x}{\sqrt{(x^{2} + a^{2}) ({(x - d)}^{2} + b^{2})}} = \frac{4}{\sqrt{{(a + b)}^{2} + d^{2}}} K (\frac{{(a - b)}^{2} + d^{2}}{{(a + b)}^{2} + d^{2}}) .

(A3)

Proposition A3

(Bhattacharyya coefficient for Cauchy, closed form). Let

P = Cauchy (l_{1}, s_{1})

and

Q = Cauchy (l_{2}, s_{2})

with

s_{1}, s_{2} > 0

, and set

δ : = l_{1} - l_{2}

. Then

ρ_{1 / 2} (P, Q) = \int_{R} \sqrt{p_{l_{1}, s_{1}} (x) p_{l_{2}, s_{2}} (x)} d x = \frac{4 \sqrt{s_{1} s_{2}}}{π \sqrt{{(s_{1} + s_{2})}^{2} + δ^{2}}} K (\frac{{(s_{1} - s_{2})}^{2} + δ^{2}}{{(s_{1} + s_{2})}^{2} + δ^{2}}) .

(A4)

Consequently,

D_{C} (P, Q) = D_{B, 1 / 2} (P, Q) = - ln ρ_{1 / 2} (P, Q),

(A5)

with

ρ_{1 / 2} (P, Q)

given by (A4).

Proof.

By definition,

\sqrt{p_{l_{1}, s_{1}} (x) p_{l_{2}, s_{2}} (x)} = \frac{\sqrt{s_{1} s_{2}}}{π \sqrt{({(x - l_{1})}^{2} + s_{1}^{2}) ({(x - l_{2})}^{2} + s_{2}^{2})}} .

Shift

x \mapsto x + l_{2}

to obtain

ρ_{1 / 2} (P, Q) = \frac{\sqrt{s_{1} s_{2}}}{π} \int_{- \infty}^{\infty} \frac{d x}{\sqrt{(x^{2} + s_{2}^{2}) ({(x - δ)}^{2} + s_{1}^{2})}} .

Apply Lemma A1 with

a = s_{2}

,

b = s_{1}

,

d = δ

to get (A4). Finally, Proposition A2 yields

D_{C} (P, Q) = D_{B, 1 / 2} (P, Q)

, hence (A5). □

Remark A1

(Useful special cases). If

l_{1} = l_{2}

(common location), then

δ = 0

and (A4) reduces to

ρ_{1 / 2} (P, Q) = \frac{4 \sqrt{s_{1} s_{2}}}{π (s_{1} + s_{2})} K ({(\frac{s_{1} - s_{2}}{s_{1} + s_{2}})}^{2}) .

If

s_{1} = s_{2} = s

(common scale), then

ρ_{1 / 2} (P, Q) = \frac{4 s}{π \sqrt{4 s^{2} + δ^{2}}} K (\frac{δ^{2}}{4 s^{2} + δ^{2}}) .

Remark A2

(Weighted case). For a general weight φ, one has

ρ_{1 / 2}^{w} (P, Q) = \int_{R} φ (x) \sqrt{p_{l_{1}, s_{1}} (x) p_{l_{2}, s_{2}} (x)} d x .

The symmetry (A2) (and hence

α^{*} = 1 / 2

) typically fails unless φ is compatible with the invariance argument used in [17]. We therefore use the Cauchy model mainly as a closed-form baseline example at

φ \equiv 1

.

References

Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Hoeffding, W. Asymptotically optimal tests for multinomial distributions. Ann. Math. Stat. 1965, 36, 369–401. [Google Scholar] [CrossRef]
Kelbert, M.; Suhov, Y. Context-sensitive hypothesis-testing and exponential families. Statistics 2025, 59, 845–878. [Google Scholar] [CrossRef]
Kelbert, M.; Suhov, Y. On basic context-dependent concepts of Information Theory and Statistics. Theory Probab. Appl. 2026, 70, 563–583. [Google Scholar]
Kalimulina, E.Y. Application of multi-valued logic models in traffic aggregation problems in mobile networks. In Proceedings of the 2021 IEEE 15th International Conference on Application of Information and Communication Technologies (AICT), Baku, Azerbaijan, 13–15 October 2021; pp. 1–6. [Google Scholar]
Esin, A.A.; Kalimulina, E.Y. Markov-modulated queueing network for mobile traffic aggregation with threshold-controlled buffers. Math. Model. Numer. Simul. Appl. 2026, 6, 4. [Google Scholar] [CrossRef]
Nielsen, F. Hypothesis testing, information divergence and computational geometry. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; GSI 2013, Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 241–248. [Google Scholar] [CrossRef]
Amari, S.-I.; Nagaoka, H. Methods of Information Geometry; Translations of Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 2000; Volume 191. [Google Scholar] [CrossRef]
Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Translations of Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 1982; Volume 53. [Google Scholar]
Amari, S.-I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Japan, Tokyo, 2016; Volume 194. [Google Scholar] [CrossRef]
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef] [PubMed]
Azuma, K. Weighted sums of certain dependent random variables. Tôhoku Math. J. 1967, 19, 357–367. [Google Scholar] [CrossRef]
McDiarmid, C. On the method of bounded differences. In Surveys in Combinatorics; Cambridge University Press: Cambridge, UK, 1989; Volume 141, pp. 148–188. [Google Scholar]
Ren, Y.; Zhang, J.; Xia, Y.; Wang, R.; Xie, F.; Guan, J.; Zhang, H.; Zhou, S. Regression-based conditional independence test with adaptive kernels. Artif. Intell. 2025, 347, 104391. [Google Scholar] [CrossRef]
Kalimulina, E.Y. Weighted Chernoff Information–Numerical Illustration. Software, companion to the present paper, version 1.0.0. Zenodo 2026. [Google Scholar] [CrossRef]
Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. IEEE Trans. Inf. Theory 2023, 69, 3150–3170. [Google Scholar] [CrossRef]

Figure 1. The map

α \mapsto - ln ρ_{α}^{w} (p, q)

for the Gaussian hypotheses

N (0, 1)

,

N (3, 2)

with weight (54), for

β \in {0, 1 / 16, 1 / 4}

. The optimum

α^{*}

is marked by a bullet on each curve and shifts to the left as

β

increases.

Figure 1. The map

α \mapsto - ln ρ_{α}^{w} (p, q)

for the Gaussian hypotheses

N (0, 1)

,

N (3, 2)

with weight (54), for

β \in {0, 1 / 16, 1 / 4}

. The optimum

α^{*}

is marked by a bullet on each curve and shifts to the left as

β

increases.

Figure 2. Optimal skewing parameter

α^{*} (β)

for the Gaussian example. The dashed line marks the unweighted value

α^{*} (0)

.

Figure 2. Optimal skewing parameter

α^{*} (β)

for the Gaussian example. The dashed line marks the unweighted value

α^{*} (0)

.

Figure 3. Weighted Chernoff information

β \mapsto D_{C}^{w} (P, Q)

. The dashed line marks the classical value

D_{C}

recovered at

β = 0

.

Figure 3. Weighted Chernoff information

β \mapsto D_{C}^{w} (P, Q)

. The dashed line marks the classical value

D_{C}

recovered at

β = 0

.

Table 1. Optimal skewing parameter and weighted Chernoff information for the Gaussian example (54), obtained by maximising (55).

$β$	$α^{*} (β)$	$D_{C}^{w} (P, Q)$
0 (unweighted)	$0.4153$	$0.8018$
$1 / 16$	$0.3355$	$0.9827$
$1 / 4$	$0.0963$	$1.4935$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kelbert, M.; Kalimulina, E.Y. Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing. Entropy 2026, 28, 536. https://doi.org/10.3390/e28050536

AMA Style

Kelbert M, Kalimulina EY. Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing. Entropy. 2026; 28(5):536. https://doi.org/10.3390/e28050536

Chicago/Turabian Style

Kelbert, Mark, and El’mira Yu. Kalimulina. 2026. "Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing" Entropy 28, no. 5: 536. https://doi.org/10.3390/e28050536

APA Style

Kelbert, M., & Kalimulina, E. Y. (2026). Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing. Entropy, 28(5), 536. https://doi.org/10.3390/e28050536

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Weighted Chernoff Information and Optimal Loss Exponent in Context-Sensitive Hypothesis Testing

Abstract

1. Introduction

1.1. Main Result and Contributions

1.2. Contributions

1.3. Related Work

1.4. Structure of the Paper

2. Problem Set-Up and Weighted Divergences

2.1. Context-Sensitive Losses and Weighted Total Variation

2.2. Weighted Affinities and Chernoff Information

3. Asymptotics and Information-Geometric Identities

3.1. Asymptotics of the Optimal Sum of Losses

3.2. Exponential-Family Representation and Uniqueness of α ∗

3.3. Weighted Bregman Divergence and Information-Geometric Identities

3.3.1. Weighted KL Divergence and Weighted Bregman Divergence

3.3.2. Weighted Chernoff/Bhattacharyya Quantities Inside an Exponential Family

3.3.3. Derivative and Weighted KL

3.3.4. Chernoff–KL

3.4. Tilted Weighted Likelihood and Concentration Bounds

Non-Asymptotic Concentration via a Doob Martingale

4. Examples and Applications

4.1. A Numerical Illustration

Data and Code Availability

4.2. Gaussian Models

4.3. Poisson Models

4.4. Exponential Models

Additional Example (Baseline, Non-Exponential Family)

4.5. Extension to M-ary Hypothesis Testing

5. Conclusions

Open Problems

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Cauchy Location–Scale Family

Appendix A.1. KL Divergence (Closed Form)

Appendix A.2. Chernoff Parameter and Bhattacharyya Coefficient

Appendix A.3. Closed Form for ρ1/2 and DC

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. Exponential-Family Representation and Uniqueness of $α^{*}$

Appendix A.3. Closed Form for ρ_1/2 and D_C