Next Article in Journal
Numerical Generation of Trajectories Statistically Consistent with Stochastic Differential Equations
Previous Article in Journal
Applicability and Design Considerations of Chaotic and Quantum Entropy Sources for Random Number Generation in IoT Devices
Previous Article in Special Issue
Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures

Department of Mathematics and Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(7), 727; https://doi.org/10.3390/e27070727
Submission received: 28 May 2025 / Revised: 30 June 2025 / Accepted: 1 July 2025 / Published: 5 July 2025
(This article belongs to the Special Issue Information Theoretic Learning with Its Applications)

Abstract

Given finite-dimensional random vectors Y, X, and Z that form a Markov chain in that order ( Y X Z ), we derive the upper bounds on the excess minimum risk using generalized information divergence measures. Here, Y is a target vector to be estimated from an observed feature vector X or its stochastically degraded version Z. The excess minimum risk is defined as the difference between the minimum expected loss in estimating Y from X and from Z. We present a family of bounds that generalize a prior bound based on mutual information, using the Rényi and α -Jensen–Shannon divergences, as well as Sibson’s mutual information. Our bounds are similar to recently developed bounds for the generalization error of learning algorithms. However, unlike these works, our bounds do not require the sub-Gaussian parameter to be constant, and therefore, apply to a broader class of joint distributions over Y, X, and Z. We also provide numerical examples under both constant and non-constant sub-Gaussianity assumptions, illustrating that our generalized divergence-based bounds can be tighter than the ones based on mutual information for certain regimes of the parameter α .

1. Introduction

The excess minimum risk in statistical inference quantifies the difference between the minimum expected loss attained by estimating a (target) hidden random vector from a feature (observed) random vector and the minimum expected loss incurred by estimating the hidden vector from a stochastically degraded version of the feature vector. The aim of this work is to derive upper bounds on the excess minimum risk in terms of generalized information divergence measures such as the Rényi divergence [1], the α -Jensen–Shannon divergence [2,3] and the Sibson mutual information [4,5].
Recently, several bounds of this nature, expressed in terms of information-theoretic measures, have appeared in the literature, including [6,7,8,9,10,11,12,13,14,15,16,17] among others. Most of these works have focused on the (expected) generalization error of learning algorithms. In [6], Xu and Raginsky established bounds on the generalization error in terms of Shannon’s mutual information between the (input) training dataset and the (output) hypothesis; these bounds are tightened in [7] by using the mutual information between individual data samples and the hypothesis. In [11], Modak et al. extend these works by obtaining upper bounds on the generalization error in terms of the Rényi divergence, employing the variational characterization of the Rényi divergence [18,19,20]. The authors also derive bounds on the probability of generalization error via Rényi’s divergence, which recover the bounds of Esposito et al. [9] (see also [8,10] for bounds expressed in terms of the f-divergence [21]). More recently, Aminian et al. [17] obtained a family of bounds on the generalization error and excess risk applicable to supervised learning settings using a so-called “auxiliary distribution method.” In particular, they derive new bounds based on the α -Jensen–Shannon and α -Rényi mutual information measures. Here, both measures are defined via divergences between a joint distribution and a product of its marginals: the former using the Jensen–Shannon divergence of weight α [3] (Equation (4.1)) (which is always finite), and the latter using the Rényi divergence of order α . Beyond learning-theoretic settings, Rényi divergence-based measures have also been successfully applied to classification problems, including time series and pattern classification, via belief and fractal extensions of the divergence [22,23,24]. In addition to information-theoretic approaches, generalization bounds based on PAC-Bayesian theory [25,26], particularly those involving f-divergences and Rényi-type divergences, have been actively studied. Separately, generalization bounds based on the Wasserstein distance [27] have also been established as an alternative approach based on optimal transport techniques. Connections between generalization error and transportation cost inequalities were explored in [28], recovering previous mutual information-based bounds and deriving a family of new bounds. A convex analytic approach is taken in [29], where information-theoretic measures of the dependence between input and output are replaced with arbitrary strongly convex functions of the input and output joint distribution. The resulting new generalization bounds either complement prior results or improve on these. Other works concerning the analysis of generalization error include [12,30] for deep learning generative adversarial networks [31] and [16] for the Gibbs algorithm (see also the extensive lists of references therein).
In this paper, we focus on the excess minimum risk in statistical inference. Our motivation is to generalize the results of Györfi et al. [14], who derived a mutual information-based upper bound that applies to a broad class of loss functions under standard sub-Gaussianity assumptions. Related but distinct work includes [13,15], where information-theoretic bounds on excess risk are developed in a Bayesian learning framework involving training data. Lower bounds on the Bayes risk in terms of information measures were recently developed in [32]. The contributions of our paper are as follows:
  • We extend the bound in [14] by introducing a family of bounds based on generalized information divergence measures, namely, the Rényi divergence, the α -Jensen–Shannon divergence, and the Sibson mutual information, parameterized by the order α ( 0 , 1 ) . Unlike [11] and [17], where the sub-Gaussian parameter is assumed to be constant, our setup allows this parameter to depend on the (target) random vector being estimated. This makes our bounds applicable to a broader class of joint distributions over the random vectors involved.
  • For the Rényi divergence based bounds, we adopt an approach similar to that of [11], deriving upper bounds by making use of the the variational representation of the Rényi divergence.
  • For the bounds involving the α -Jensen–Shannon divergence and the Sibson mutual information, we follow the methodology of [17], employing the auxiliary distribution method together with the variational representation of the Kullback–Leibler (KL) divergence [33].
  • We provide simple conditions under which the α -Jensen–Shannon divergence bound is tighter than the other two bounds for bounded loss functions.
  • We compare the bounds based on the aforementioned information divergence measures with mutual information-based bounds by providing numerical examples.
Our problem of bounding the excess minimum risk is closely related to recent work on generalization error in learning theory. In both settings, the goal is to understand how much performance is lost when a target variable is estimated from a less informative or transformed version of the input. In learning theory, this is often studied through generalization bounds, which compare the performance of a learned predictor on training and test data. As already stated, several recent works have used information-theoretic tools—such as mutual information and its generalizations—to bound the generalization error (e.g., [6,11,17]). Although these works focus on algorithm-dependent error, the structure of the bounds is similar to ours. Our bounds, instead, are on the excess minimum risk, which compares the best possible performance using full observations versus using degraded ones. Still, both approaches rely on similar tools, including variational characterizations and divergence measures. In this sense, our work takes a different but related approach by studying the basic limits of inference, rather than how well a particular algorithm performs.
This paper is organized as follows. In Section 2, we provide preliminary definitions and introduce the statistical inference problem. In Section 3, we establish a family of upper bounds on the excess minimum risk, expressed in terms of the Rényi divergence, the α -Jensen–Shannon divergence, and the Sibson mutual information, all parameterized by the order α ( 0 , 1 ) . We also present several numerical examples, including cases with both constant and non-constant sub-Gaussian parameters, all of which demonstrate that the proposed bounds are tighter than the mutual information bound for a range of values of α . Additionally, Section 3 includes an analytical comparison of the proposed bounds under bounded loss functions. In Section 5, we provide concluding remarks and suggest directions for future work.

2. Preliminaries

2.1. Problem Setup

Consider a random vector Y R p , p 1 , that is to be estimated (predicted) from a random observation vector X taking values in R q , q 1 . Given a measurable estimator (predictor) f : R q R p and a loss function l : R p × R p R + , the loss (risk) realized in estimating Y by f ( X ) is given by l ( Y , f ( X ) ) . The minimum expected risk in predicting Y from X is defined by
L l * ( Y | X ) = inf f : R q R p E [ l ( Y , f ( X ) ) ]
where the infimum is over all measurable f.
We also consider another random observation vector Z that is a random transformation or stochastically degraded version of X obtained, for example, by observing X through a noisy channel. Here, Z takes values in R r , r 1 , and Y, X and Z form a Markov chain in this order, which we denote as Y X Z . We similarly define the minimum expected risk in predicting Y from Z as
L l * ( Y | Z ) = inf g : R r R p E [ l ( Y , g ( Z ) ) ] ,
where the infimum is over all measurable predictors g. With the notation introduced above, we define the excess minimum risk as the difference L l * ( Y | Z ) L l * ( Y | X ) , which is always non-negative due to the Markov chain condition Y X Z (e.g., see the data processing inequality for expected risk in [13] (Lemma 1)). Our objective is to establish upper bounds to this difference using generalized information divergence measures.
In [14], the random vector Z is taken as T ( X ) , a transformation of the random vector X, where T : R p R r is measurable. The authors derive bounds on the excess minimum risk using Shannon’s mutual information. Here, we generalize these bounds by employing a family of information-divergence measures of order α ( 0 , 1 ) , which recover Shannon’s mutual information in the limits α 0 or α 1 . Furthermore, we use an arbitrary random vector Z, as the degraded version of the observation X instead of T ( X ) . We also provide examples where the various generalized information divergence-based bounds perform better than the mutual information-based bounds of [14] for a certain range α .
We next state some definitions that we will invoke in deriving our results.

2.2. Definitions

Consider two arbitrary jointly distributed random variables U and V defined on the same probability space ( Ω , M ) and taking values in U and V , respectively. Let P U and P V be the marginal distributions of U and V, respectively, and P U , V be a joint distribution over U × V . We first provide definitions for the Rényi divergence-based measures.
Definition 1 
([1,34]). The Rényi divergence of order α ( 0 , ) , α 1 , between the two probability measures P U and P V is denoted by D α ( P U P V ) and defined as follows. Let ν be a sigma-finite positive measure such that P U and P V are absolutely continuous with respect to ν, written as P U , P V ν , with Radon–Nikodym derivatives d P U d ν = p U and d P V d ν = p V , respectively. Then
D α ( P U P V ) = 1 α 1 log ( p U ) α ( p V ) 1 α d ν i f 0 < α < 1 o r α > 1 a n d P U P V + i f α > 1 a n d P U / P V .
Definition 2. 
The conditional Rényi divergence of order α between the conditional distributions P V | U and Q V | U given P U is denoted by D α ( P V | U Q V | U | P U ) and given by
D α ( P V | U Q V | U | P U ) = E P U D α ( P V | U ( · | U ) Q V | U ( · | U ) ) ) ,
where E P U [ · ] denotes expectation with respect to distribution P U .
Note that the above definition of conditional Rényi divergence differs from the somewhat standard one, which is given as D α ( P V | U P U Q V | U P U ) , e.g., see [35] (Definition 3). We adopt the above definition because it is well-tailored to our setting, which allows sub-Gaussianity parameters to be random. However, as α 1 , both notions of the conditional Rényi divergence recover the conditional KL divergence, which is
D KL ( P V | U Q V | U | P U ) = D KL ( P V | U P U Q V | U P U ) = E P U p V | U log p V | U q V | U d v .
We next provide the definitions for the α -Jensen–Shannon divergence-based measures.
Definition 3 
([2,3]). The α-Jensen–Shannon divergence for α ( 0 , 1 ) between two probability measures P U and P V on a measurable space ( Ω , M ) is denoted by J S α ( P U P V ) and given by
J S α ( P U P V ) = α D KL ( P U α P U + ( 1 α ) P V ) + ( 1 α ) D KL ( P V α P U + ( 1 α ) P V ) ,
where D KL ( · · ) is the KL divergence.
Definition 4. 
The conditional α-Jensen–Shannon divergence between the conditional distributions P V | U and Q V | U given P U is denoted by J S α ( P V | U Q V | U | P U ) and given by
J S α ( P V | U Q V | U | P U ) = E P U J S α ( P V | U ( · | U ) Q V | U ( · | U ) ) ,
where E P U [ · ] denotes expectation with respect to distribution P U .
Next we define the Sibson mutual information (of order α ).
Definition 5 
([4,5]). Let α ( 0 , 1 ) ( 1 , ) . The Sibson mutual information of order α between V and U is denoted by I α S ( V ; U ) and given by
I α S ( V ; U ) = min Q U P ( U ) D α ( P U , V Q U P V ) ,
where P ( U ) denotes the set of probability distributions on U .
It is known that D α ( P U , V Q U P V ) is convex in Q U [34], which allows for a closed-form expression for the minimizer and, consequently, for the Sibson mutual information I α S ( V ; U ) [9,36]. Let U * denote a random variable whose distribution achieves the minimum, with the corresponding distribution P U * . Then, the Sibson mutual information of order α can equivalently be written as follows.
Definition 6 
([9]). Let ν be a sigma-finite positive measure such that P U , V and P U P V are absolutely continuous with respect to ν × ν , written as P U , V , P U P V ν × ν , with Radon–Nikodym derivatives(densities) d P U , V d ( ν × ν ) = p U , V and d ( P U P V ) d ( ν × ν ) = p U p V , respectively. For α ( 0 , 1 ) ( 1 , ) , the Sibson mutual information of order α between V and U can be written as follows:
I α S ( V ; U ) = D α ( P U , V P U * P V ) ,
where the distribution P U * has density
p U * ( u ) = d P U * d ν ( u ) = p U , V ( u , v ) p U ( u ) p V ( v ) α p V ( v ) d v 1 α p U , V ( u , v ) p U ( u ) p V ( v ) α p V ( v ) d v 1 α p U ( u ) d u p U ( u ) .
Remark 1. 
From Definition 6, we note that the Sibson mutual information of order α is a functional of the distributions P U , V and P U * . Hence, from this point onward, we denote with a slight abuse of notation the Sibson mutual information of order α between V and U by I α S ( P U , V , P U * ) .
We end this section with the definitions of the sub-Gaussian and conditional sub-Gaussian properties.
Definition 7. 
A real random variable U with finite expectation is said to be σ 2 -sub-Gaussian for some σ 2 > 0 if
log E [ e λ ( U E [ U ] ) ] σ 2 λ 2 2
for all λ R .
Definition 8. 
A real random variable U is said to be conditionally σ 2 -sub-Gaussian given another random variable V (i.e., under P U | V ) for some σ 2 > 0 if we have almost surely that
log E [ e λ ( U E [ U | V ] ) | V ] σ 2 λ 2 2
for all λ R .
Throughout the paper, we omit stating explicitly that the conditional sub-Gaussian inequality holds almost surely for the sake of simplicity.

3. Bounding Excess Minimum Risk

In this section, we establish a series of bounds on the excess minimum risk based on different information-theoretic measures. Our approach combines the variational characterizations of the KL divergence [33], the Rényi divergence [18], and the Sibson mutual information [9] (Theorem 2), along with the auxiliary distribution method introduced in [17].

3.1. Rényi Divergence-Based Upper Bound

We first state the variational characterization of the Rényi divergence [18], which generalizes the Donsker–Varadhan variational formula for KL divergence [33].
Lemma 1 
([18] (Theorem 3.1)). Let P and Q be two probability measures on ( Ω , M ) and α ( 0 , ) , α 1 . Let g be a measurable function such that e ( α 1 ) g L 1 ( P ) and e α g L 1 ( Q ) , where L 1 ( μ ) denotes the collection of all measurable functions with finite L 1 -norm. Then,
D α ( P Q ) α α 1 log E P [ e ( α 1 ) g ( X ) ] log E Q [ e α g ( X ) ] .
We next provide the following lemma, whose proof is a slight generalization of [11] (Lemma 2) and [14] (Lemma 1).
Lemma 2. 
Consider two arbitrary jointly distributed random variables U and V defined on the same probability and taking values in spaces U and V , respectively. Given a measurable function h : U × V R , assume that h ( u , V ) is σ 2 ( u ) -sub-Gaussian under P V and P V | U = u for all u U , where E [ σ 2 ( U ) ] < . Then for α ( 0 , 1 ) ,
| E [ h ( U , V ) ] E [ h ( U ¯ , V ¯ ) ] | 2 E [ σ 2 ( U ) ] D α ( P V | U P V | P U ) α ,
where U ¯ and V ¯ are independent copies of U and V, respectively, (i.e., P U ¯ , V ¯ = P U P V ).
Proof. 
By the sub-Gaussian property, we have that
log E e ( α 1 ) λ h ( u , V ) E [ ( α 1 ) λ h ( u , V ) | U = u ] | U = u λ 2 ( α 1 ) 2 σ 2 ( u ) 2
and
log E [ e α λ h ( u , V ) E [ α λ h ( u , V ) ] ] λ 2 α 2 σ 2 ( u ) 2 .
Re-arranging the terms gives us
log E e ( α 1 ) λ h ( u , V ) | U = u λ 2 ( α 1 ) 2 σ 2 ( u ) 2 + E [ ( 1 α ) λ h ( u , V ) | U = u ]
and
log E [ e α λ h ( u , V ) ] λ 2 α 2 σ 2 ( u ) 2 E [ α λ h ( u , V ) ] .
Note that by (12) and (13), e ( α 1 ) λ h ( u , V ) L 1 ( P V | U = u ) and e α λ h ( u , V ) L 1 ( P V ) . By the variational formula in (11), we have that
D α ( P V | U = u P V ) α α 1 log E [ e ( α 1 ) λ h ( u , V ) | U = u ] log E [ e α λ h ( u , V ) ] .
Substituting (14) and (15) in (16) yields
D α ( P V | U = u P V ) α 1 α λ 2 ( α 1 ) 2 σ 2 ( u ) 2 + E [ ( 1 α ) λ h ( u , V ) | U = u ] λ 2 α 2 σ 2 ( u ) 2 E [ α λ h ( u , V ) ] = α λ ( E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] ) λ 2 α ( 1 α ) σ 2 ( u ) 2 λ 2 α 2 σ 2 ( u ) 2 = α λ ( E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] ) λ 2 α σ 2 ( u ) 2 .
The left-hand side of the resulting inequality
λ 2 α σ 2 ( u ) 2 α λ ( E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] ) + D α ( P V | U = u P V ) 0
is a non-negative quadratic polynomial in λ . Thus, the discriminant is non-positive and we have
( α ( E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] ) ) 2 4 α σ 2 ( u ) 2 D α ( P V | U = u P V ) .
Therefore,
| E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] | 2 σ 2 ( u ) D α ( P V | U = u P V ) α .
Since U ¯ and V ¯ are independent and P V ¯ = P V , we have that
E [ h ( u , V ) ] = E [ h ( U ¯ , V ¯ ) | U ¯ = u ] .
Therefore, we have
| E [ h ( U , V ) ] E [ h ( U ¯ , V ¯ ) ] | = ( E [ h ( U , V ) | U = u ] E [ h ( U ¯ , V ¯ ) | U ¯ = u ] ) P U ( d u ) = ( E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] ) P U ( d u ) ( E [ h ( u , V ) | U = u ] E [ h ( u , V ) ] P U ( d u )
2 σ 2 ( u ) D α ( P V | U = u P V ) α P U ( d u )
2 σ 2 ( u ) P U ( d u ) D α ( P V | U = u P V ) α P U ( d u )
= 2 E [ σ 2 ( U ) ] D α ( P V | U P V | P U ) α ,
where (18) follows from Jensen’s inequality, (19) follows from (17), (20) follows from the Cauchy–Schwarz inequality and the definition of conditional Rényi divergence in (3) with D α ( P V | U P V | P U ) = E U [ D α ( P V | U ( · | U ) P V ) ] . □
Note that the Rényi divergence-based bound in Lemma 2 differs from that in [17] (Theorem 3). In our approach, we consider sub-Gaussianity under both P V and P V | U = u for all u U , which allows for non-constant sub-Gaussian parameters. This leads to a more general bound that applies to a broader class of loss functions.
We next use Lemma 2 to derive our theorem for the Rényi divergence-based bound; its proof is an adaptation of [14] (Theorem 3).
Theorem 1. 
Let X, Y and Z be random vectors such that Y X Z , as described in Section 2.1. Assume that there exists an optimal estimator f of Y from X such that l ( y , f ( X ) ) is conditionally σ 2 ( y ) -sub-Gaussian under P X | Z and P X | Z , Y = y for all y R p , i.e.,
log E [ e ( λ ( l ( y , f ( X ) ) ) E [ l ( y , f ( X ) ) | Z ] ) | Z ] σ 2 ( y ) λ 2 2
and
log E [ e ( λ ( l ( y , f ( X ) ) ) E [ l ( y , f ( X ) ) | Z , Y = y ] ) | Z , Y = y ] σ 2 ( y ) λ 2 2
for all λ R and y R , where σ 2 : R R , satisfies E [ σ 2 ( Y ) ] < . Then for α ( 0 , 1 ) , the excess minimum risk satisfies
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] α D α ( P X | Y , Z P X | Z | P Y , Z ) .
Proof of Theorem 1. 
Let X ¯ , Y ¯ and Z ¯ be random variables such that P Y ¯ | Z ¯ = P Y | Z , P X ¯ | Z ¯ = P X | Z , P Z ¯ = P Z and Y ¯ and X ¯ are conditionally independent given Z ¯ , i.e., P Y ¯ , X ¯ , Z ¯ = P Y | Z P X | Z P Z .
We apply Lemma 2 by setting U = Y , V = X and h ( u , v ) = l ( y , f ( x ) ) . Consider E [ l ( Y , f ( X ) ) | Z = z ] ) and E [ l ( Y ¯ , f ( X ¯ ) ) | Z = z ] as regular expectations taken with respect to P Y , X | Z = z and P Y ¯ , X ¯ | Z = z . Since Y ¯ and X ¯ are conditionally independent given Z ¯ = z and P Z ¯ = P Z , we have that
| E [ l ( Y , f ( X ) ) | Z = z ] ) E [ l ( Y ¯ , f ( X ¯ ) ) | Z = z ] ) | 2 E [ σ 2 ( Y ) | Z = z ] α D α ( P X | Y , Z = z P X | Z = z | P Y | Z = z ) .
Now,
E [ l ( Y , f ( X ) ) ] E [ l ( Y ¯ , f ( X ¯ ) ) ]   E [ l ( Y , f ( X ) ) | Z = z ] ) E [ l ( Y ¯ , f ( X ¯ ) ) | Z = z ] ) P Z ( d z )   ( 2 E [ σ 2 ( Y ) | Z = z ] α   × D α ( P X | Y , Z = z P X | Z = z | P Y | Z = z ) ) P Z ( d z )   2 E [ σ 2 ( Y ) | Z = z ] P Z ( d z )   × D α ( P X | Y , Z = z P X | Z = z | P Y | Z = z ) α P Z ( d z )   = 2 E [ σ 2 ( Y ) ] α D α ( P X | Y , Z P X | Z | P Y , Z ) ,
where the first inequality follows from Jensen’s inequality and since P Z ¯ = P Z , the second inequality follows from (23), the third from the Cauchy–Schwarz inequality, and the equality follows from (3). Since Y ¯ and X ¯ are conditionally independent given Z ¯ , we obtain the Markov chain Y ¯ Z ¯ X ¯ . Then, we have
E [ l ( Y ¯ , f ( X ¯ ) ) ] ) L l * ( Y ¯ | X ¯ ) L l * ( Y ¯ | Z ¯ ) = L l * ( Y | Z ) ,
where the first inequality follows since Y ¯ X ¯ f ( X ¯ ) , the second inequality holds since Y ¯ Z ¯ X ¯ by construction, and the equality follows since ( Y ¯ , Z ¯ ) and ( Y , Z ) have the same distribution by construction. Since f is an optimal estimator of Y from X, we also have
E [ l ( Y , f ( X ) ) ] ) = L l * ( Y | X ) .
Therefore, using (25) and (26) in (24) combined with the fact that L l * ( Y | Z ) L l * ( Y | X ) , we arrive at the desired inequality:
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] α D α ( P X | Y , Z P X | Z | P Y , Z ) .
Remark 2. 
Taking the limit as α 1 of the right-hand side of (22) in Theorem 1, we have that
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] D KL ( P X | Y , Z P X | Z | P Y , Z ) = 2 E [ σ 2 ( Y ) ] ( I ( X ; Y ) I ( Z ; Y ) ) ,
recovering the bound in [14] (Theorem 3).
As a special case, we consider bounded loss functions, which naturally satisfy the conditional sub-Gaussian condition. The following corollary is an application of Theorem 1 under a fixed sub-Gaussian parameter. For completeness, we include the full proof.
Corollary 1. 
Suppose the loss function l is bounded, i.e., l = sup y , y l ( y , y ) < . Then for random vectors X, Y and Z such that Y X Z as described in Section 2.1, we have the following inequality for α ( 0 , 1 ) on the excess minimum risk:
L l * ( Y | Z ) L l * ( Y | X ) l 2 D α ( P X | Y , Z P X | Z | P Y , Z ) α .
Proof. 
We show that the bounded loss function l satisfies the conditional sub-Gaussian properties of Theorem 1. Since l is bounded we have that for any f : R q R p , x R q and y R p , l ( y , f ( x ) ) [ 0 , l ] . Then, by Hoeffding’s lemma [37], we can write
log E [ e ( λ ( l ( y , f ( X ) ) ) | Z ] E [ λ l ( y , f ( X ) ) | Z ] ) + l 2 λ 2 8
and
log E [ e ( λ ( l ( y , f ( X ) ) ) | Z , Y = y ] E [ λ l ( y , f ( X ) ) | Z , Y = y ] ) + l 2 λ 2 8
for all λ R and y R . Rearranging the above inequalities gives us that l ( y , f ( X ) ) is conditionally l 2 4 -sub-Gaussian under both P X | Z and P X | Z , Y = y for all y R p . Then by (22), we have
L l * ( Y | Z ) L l * ( Y | X ) l 2 D α ( P X | Y , Z P X | Z | P Y , Z ) α .
Remark 3. 
Taking the limit as α 1 of (29) in Corollary 1 yields the mutual information-based bound:
L l * ( Y | Z ) L l * ( Y | X ) l 2 D KL ( P X | Y , Z P X | Z | P Y , Z ) = l 2 I ( X ; Y ) I ( Z ; Y ) ,
which recovers the bound in [14] (Corollary 1).

3.2. α -Jensen–Shannon Divergence-Based Upper Bound

We next derive α -Jensen–Shannon divergence-based bounds on minimum excess risk.
We consider two arbitrary jointly distributed random variables U and V defined on the same probability space and taking values in U and V , respectively. Throughout this section, we work with the joint distribution P U , V over U × V and the corresponding product of marginals P U P V . For convenience, we also define additional distributions that will play an important role in the derivation of our bounds.
Definition 9. 
The α-convex combination of the joint distribution P U , V and the product of two marginals P U P V is denoted by P U , V ( α ) and given by
P U , V ( α ) = α P U , V + ( 1 α ) P U P V
for α ( 0 , 1 ) .
Definition 10. 
The α-conditional convex combination of the conditional distribution P V | U and the marginal P V is denoted by P V | U ( α ) and given by
P V | U ( α ) = α P V | U + ( 1 α ) P V
for α ( 0 , 1 ) .
We first provide the following lemma, whose proof, given in Appendix A, is a slight generalization of [17] (Lemma 2) and [14] (Lemma 1).
Lemma 3. 
Given a function h : U × V R , assume that h ( u , V ) is σ 2 ( u ) -sub-Gaussian under P V | U = u ( α ) for all u U , where E [ σ 2 ( U ) ] < . Then for α ( 0 , 1 ) ,
| E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] | 2 E [ σ 2 ( U ) ] J S α ( P U , V P U P V ) α ( 1 α ) .
We next use Lemma 3 to derive our theorem for α -Jensen–Shannon divergence-based bound. The proof of the theorem is relegated to Appendix A.
Theorem 2. 
Let X, Y and Z be random vectors such that Y X Z , as described in Section 2.1. Assume that there exists an optimal estimator f of Y from X such that l ( y , f ( X ) ) is conditionally σ 2 ( y ) -sub-Gaussian under P X | Z , Y = y ( α ) = α P X | Z , Y = y + ( 1 α ) P X | Z for all y R p , i.e.,
log E P X | Z , Y = y ( α ) e λ l ( y , f ( X ) ) E P X | Z , Y = y ( α ) l ( y , f ( X ) ) σ 2 ( y ) λ 2 2
for all λ R and y R , where σ 2 : R R satisfies E [ σ 2 ( Y ) ] < . Then for α ( 0 , 1 ) , the excess minimum risk satisfies
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] α ( 1 α ) J S α ( P Y , X | Z P Y | Z P X | Z | P Z ) .
Remark 4. 
Taking the limit as α 0 on the right-hand side of (33) in Theorem 2, we obtain
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] ( I ( X ; Y ) I ( Z ; Y ) ) ,
recovering the bound (27) of [14] (Theorem 3). Furthermore, taking the limit as α 1 of the right-hand side of (33) in Theorem 2, yields
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] D KL ( P Y | Z P X | Z P Y , X | Z | P Z ) = 2 E [ σ 2 ( Y ) ] ( L ( X ; Y ) L ( Z ; Y ) ) ,
where L ( U ; V ) : = D KL ( P U P V P U , V ) is called the Lautum information between U and V, defined as the reverse KL divergence (i.e., the KL divergence between the product of marginals and the joint distribution) [38]. We, therefore, obtain an upper bound on the minimum excess risk in terms of the reverse KL divergence.
We close this section by specializing Theorem 2 to the case of bounded loss functions, hence obtaining a counterpart result to Corollary 1.
Corollary 2. 
Suppose the loss function l is bounded. Then for random vectors X, Y and Z such that Y X Z as described in Section 2.1, we have the following inequality for α ( 0 , 1 ) on the excess minimum risk:
L l * ( Y | Z ) L l * ( Y | X ) l 2 J S α ( P Y , X | Z P Y | Z P X | Z | P Z ) α ( 1 α ) .

3.3. Sibson Mutual Information-Based Upper Bound

Here we bound the excess minimum risk based on Sibson’s mutual information. We recall from Definition 6 that U and V are jointly distributed on measurable spaces U and V , with joint distribution P U , V and marginals P U and P V , assuming that all distributions are absolutely continuous with respect to a common sigma-finite measure ν , with densities p U , p V , and p U , V . Let U * denote the random variable whose distribution P U * attains the minimum in the definition of the Sibson mutual information I α S ( P U , V , P U * ) , with density p U * as given in (8). We now define an auxiliary distribution that will be central to the derivation of the main bounds in this section.
Definition 11. 
Let P U ^ , V ^ be a joint distribution on U × V determined by density p U ^ , V ^ that is obtained by tilting (using parameter α) the densities p U , V , p U * and p V as follows:
p U ^ , V ^ ( u , v ) = ( p U , V ( u , v ) ) α ( p U * ( u ) p V ( v ) ) ( 1 α ) ( p U , V ( u , v ) ) α ( p U * ( u ) p V ( v ) ) ( 1 α ) d u d v
for α ( 0 , 1 ) .
We state the following lemma based on the variational representation of the Sibson mutual information [9] (Theorem 2), which establishes a connection to the KL divergence. The proof of the lemma follows from [17] (Lemma 3) and [36] (Theorem 5.1).
Lemma 4. 
For the distributions P U ^ , V ^ , P U , V and P U * P V we have
α D KL ( P U ^ , V ^ P U , V ) + ( 1 α ) D KL ( P U ^ , V ^ P U * P V ) = ( 1 α ) I α S ( P U , V , P U * ) .
We now invoke a basic but important property of sub-Gaussian random variables that will be used later in our analysis. Specifically, the set of all sub-Gaussian random variables has a linear structure. This property is well established in the literature [39,40].
Lemma 5. 
If X is a σ X 2 -sub-Gaussian random variable, then for any α R , the random variable α X is | α | σ X 2 -sub-Gaussian. If Y is a σ Y 2 -sub-Gaussian random variable, then the sum X + Y is sub-Gaussian with parameter ( σ X + σ Y ) 2 .
We next provide the following lemma, whose proof is a slight generalization of [17] (Theorem 4) and [14] (Lemma 1). The proof is given in Appendix B.
Lemma 6. 
Given a function h : U × V R , assume that h ( u , V ) is γ 2 ( u ) -sub-Gaussian under P V for all u U and h ( U , V ) is σ 2 4 -sub-Gaussian under both P U P V and P U , V . Assume also that log E P U * [ e γ 2 ( U * ) ] < . Then for α ( 0 , 1 ) ,
E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] 2 ( ( 1 α ) σ 2 + α log E P U * [ e γ 2 ( U * ) ] ) I α S ( P U , V , P U * ) α .
We next use Lemma 6 to derive our upper bound on the excess minimum risk in terms of the Sibson mutual information. The proof of the theorem is in Appendix B.
Theorem 3. 
Let X, Y and Z be random vectors such that Y X Z form a Markov chain as described in Section 2.1. Assume that there exists an optimal estimator f of Y from X such that l ( y , f ( X ) ) is conditionally γ 2 ( y ) -sub-Gaussian under P X | Z for all y R p , where log E P Z P Y * | Z [ e γ 2 ( Y * ) ] < , and l ( Y , f ( X ) ) is conditionally σ 2 4 -sub-Gaussian under both P Y | Z P X | Z and P Y , X | Z , i.e., for all λ R
log E P X | Z e ( λ ( l ( y , f ( X ) ) ) E P X | Z [ l ( y , f ( X ) ) ] ) γ 2 ( y ) λ 2 2
for all y R p ,
log E P Y | Z P X | Z e ( λ ( l ( Y , f ( X ) ) ) E P Y | Z P X | Z [ l ( Y , f ( X ) ) ] ) σ 2 λ 2 8
and
log E P Y , X | Z e ( λ ( l ( Y , f ( X ) ) ) E P Y , X | Z [ l ( Y , f ( X ) ) ] ) σ 2 λ 2 8 .
Then for α ( 0 , 1 ) , the excess minimum risk satisfies
L l * ( Y | Z ) L l * ( Y | X ) 2 ( ( 1 α ) σ 2 + α E P Z [ Φ Y * | Z ( γ 2 ( Y * ) ) ] ) α E P Z I α S ( P Y , X | Z , P Y * | Z ) ,
where Φ P U ( V ) = log E P U [ e V ] and the distribution P Y * | Z has density
p Y * | Z ( y | z ) = p Y , X | Z ( y , x | z ) p Y | Z ( y | z ) p X | Z ( x | z ) α p X | Z ( x | z ) d x 1 α p Y , X | Z ( y , x | z ) p Y | Z ( y | z ) p X | Z ( x | z ) α p X | Z ( x | z ) d x 1 α p Y | Z ( y | z ) d y p Y | Z ( y | z ) .
Setting γ 2 ( Y * ) = σ 2 and taking the limit as α 1 on the right-hand side of (37) recovers the mutual information-based bound (27) of [14] (Theorem 3) in the case of a constant sub-Gaussian parameter. We conclude this section by presenting a specialization of Theorem 3 to the case of bounded loss functions.
Corollary 3. 
Suppose the loss function l is bounded. Then for random vectors X, Y and Z such that Y X Z as described in Section 2.1, we have the following inequality for α ( 0 , 1 ) on the excess minimum risk:
L l * ( Y | Z ) L l * ( Y | X ) l 2 ( 4 3 α ) α E P Z I α S ( P Y , X | Z , P Y * | Z ) .

3.4. Comparison of Proposed Upper Bounds

In this section, we give a simple comparison of the upper bounds based on the α -Jensen–Shannon divergence (Theorem 2) with those based on the Rényi divergence (Theorem 1) and the Sibson mutual information (Theorem 3) for bounded loss functions. Similar to [17] (Proposition 8), we provide a simple condition under which the upper bound based on the α -Jensen–Shannon divergence is tighter than those obtained using the other two divergence measures.
Proposition 1. 
Suppose the loss function is bounded. Let X, Y, and Z be random vectors such that Y X Z , as described in Section 2.1. Then, for any α ( 0 , 1 ) , the α-Jensen–Shannon divergence-based bound on the excess minimum risk,
l 2 J S α ( P Y , X | Z P Y | Z P X | Z | P Z ) α ( 1 α ) ,
is no larger than both the Rényi divergence-based bound,
l 2 D α ( P X | Y , Z P X | Z | P Y , Z ) α ,
and the Sibson mutual information-based bound,
l 2 ( 4 3 α ) α E P Z I α S ( P Y , X | Z , P Y * | Z ) ,
provided that
h b ( α ) 1 α D α ( P X | Y , Z P X | Z P Y , Z ) , and h b ( α ) 1 α E P Z I α S ( P Y , X | Z , P Y * | Z ) ,
where h b ( α ) = α log α ( 1 α ) log ( 1 α ) is the binary entropy function.
Remark 5. 
The function g ( α ) = h b ( α ) 1 α is strictly increasing for α ( 0 , 1 ) , lim α 0 g ( α ) = 0 , and lim α 1 g ( α ) = . On the other hand, at least for finite alphabets, the information quantities on the right-hand sides of the inequalities in (40), in general, converge to a positive constant as α 0 . In this case, there always exists an α * ( 0 , 1 ) such that the inequalities in (40) are satisfied for all 0 < α α * .
Proof. 
It is known that the α -Jensen–Shannon divergence is bounded above by the binary entropy h b ( α ) , with equality if and only if P and Q are mutually singular [3,41]. Applying this we obtain
J S α ( P Y , X | Z P Y | Z P X | Z P Z ) = E P Z J S α P Y , X | Z ( · Z ) P Y | Z ( · Z ) P X | Z ( · Z ) E P Z [ h b ( α ) ] = h b ( α ) .
Consequently, we obtain the bound
l 2 J S α ( P Y , X | Z P Y | Z P X | Z P Z ) α ( 1 α ) l 2 h b ( α ) α ( 1 α ) .
Therefore, under the assumption that
h b ( α ) 1 α D α ( P X | Y , Z P X | Z P Y , Z ) and h b ( α ) 1 α E P Z I α S ( P Y , X | Z , P Y * | Z ) ,
we conclude that the α -Jensen–Shannon bound is tighter than the other two bounds:
l 2 J S α ( P Y , X | Z P Y | Z P X | Z P Z ) α ( 1 α ) l 2 D α ( P X | Y , Z P X | Z P Y , Z ) α , l 2 J S α ( P Y , X | Z P Y | Z P X | Z P Z ) α ( 1 α ) l 2 ( 4 3 α ) α E P Z I α S ( P Y , X | Z , P Y * | Z ) .

4. Numerical Results

In this section, we present three examples where some of the proposed information divergence-based bounds outperform the mutual information-based bound. The first example considers a concatenated q-ary symmetric channel with a bounded loss function. The remaining two examples involve Gaussian additive noise channels and loss functions with non-constant sub-Gaussian parameters.
Example 1. 
We consider a concatenation of two q-ary symmetric channels, with input Y and noise variables U 1 and U 2 , all taking values in { 0 , 1 , , q 1 } . We assume that Y , U 1 and U 2 are independent. The input Y has distribution p = [ p 0 , p 1 , , p q 1 ] , while the noise variables U 1 and U 2 are governed by P ( U i = 0 ) = 1 ϵ i and P ( U i = a ) = ϵ i / ( q 1 ) for all a { 1 , , q 1 } and i = 1 , 2 , where ϵ 1 , ϵ 2 ( 0 , 1 ) are the crossover probabilities. The output X of the first channel is given by
X = ( Y + U 1 ) mod q
and serves as the input to the second channel. The final output Z is then given by
Z = ( X + U 2 ) mod q ,
which can also be written as Z = ( Y + U 1 + U 2 ) mod q . This construction naturally induces the Markov chain Y X Z .
Using a 0–1 loss function (defined as l ( y , y ) = 1 ( y y ) , where 1 ( · ) denotes the indicator function), we compute the bounds in Corollaries 1 and 2, corresponding to equations (28) and (35), respectively, as functions of α ( 0 , 1 ) . Figure 1 compares the Rényi-based bound (28) and the αJensen–Shannon-based bound (35) with the mutual information-based bound (27). Among the two, the α-Jensen–Shannon-based bound consistently performs the best over a wide range of α values. Moreover, as q increases in the q-ary symmetric channel, both the interval of α for which the proposed bounds outperform the mutual information-based bound and the magnitude of improvement become more pronounced. For this example, we set ϵ 1 = 0.15 and ϵ 2 = 0.05 . For q = 10 , 100 , 200 , we generate input distributions by sampling from a symmetric (i.e., with identical parameters) Dirichlet distribution on R q . Using a Dirichlet parameter greater than one gives balanced distributions that avoid placing too much weight on any single symbol. For q = 2 , 3 , 5 , the input distributions are explicitly specified in the figure captions.
Finally, we note that in this example, the specialized bound for bounded loss functions derived from the Sibson mutual information in Corollary 3 does not offer any improvement over the standard mutual information-based bound (27) and is, therefore, not presented. In the next two examples, we compare the αJensen–Shannon-based bound of Theorem 2 with the mutual information-based bound for loss functions with non-constant sub-Gaussian parameters.
Example 2. 
Consider a Gaussian additive noise channel with input Y and noise random variables W 1 and W 2 , where Y N ( 0 , σ ^ 2 ) , W 1 N ( 0 , σ 1 2 ) , and W 2 N ( 0 , σ 2 2 ) . Assume that Y is independent of ( W 1 , W 2 ) and W 1 is independent of W 2 . Define
X = Y + W 1
and
Z = X + W 2 = Y + W 1 + W 2 ,
inducing the Markov chain Y X Z .
We consider the loss function l ( y , y ) = min { | y y | , | y c | } for some c > 0 . For this model, we observe that
l ( y , f * ( X ) ) = min { | y f * ( X ) | , | y c | } | y c | | y | + | c | = | y | + c ,
where f * denotes the optimal estimator of Y from X. Thus, l ( y , f * ( X ) ) is a non-negative random variable that is almost surely bounded by | y | + c . By Hoeffding’s lemma, it follows that this loss is conditionally σ 2 ( y ) -sub-Gaussian under P X | Z , P X | Z , Y = y and P X | Z , Y = y ( α ) for all y R p , with
σ 2 ( y ) = ( | y | + c ) 2 4 .
Furthermore, for σ ^ 2 = 1 and σ i 2 = 1 , i = 1 , 2 , we have that
E [ σ 2 ( Y ) ] = E ( | Y | + c ) 2 4 = 1 4 E | Y | 2 + c 2 + 2 | Y | c = 1 + c 2 + 2 c 2 / π 4 .
Hence, the conditions of Theorem 2 are satisfied. Figure 2 compares the α-Jensen–Shannon-based bound in (33) with the mutual information-based bound in (27) for c = 1 . We observe that the α-Jensen–Shannon-based bound is tighter for values of α approximately in the range ( 0 , 0.3 ) .
Example 3. 
Consider a Gaussian additive noise model with input Z N ( 0 , σ ^ 2 ) and two noise variables W 1 N ( 0 , σ 1 2 ) and W 2 N ( 0 , σ 2 2 ) , all mutually independent. Let X = Z + W 1 and Y = X + W 2 = Z + W 1 + W 2 , inducing the Markov chain Z X Y , which is equivalent to the Markov chain Y X Z .
We again consider the loss function l ( y , y ) = min { | y y | , | y c | } for some c > 0 , and observe that l ( y , f * ( X ) ) | y | + c , where f * is the optimal estimator of Y from X. Hence, the loss is (conditionally) σ 2 ( y ) -sub-Gaussian as in the previous example with σ 2 ( y ) = ( | y | + c ) 2 4 . For σ ^ 2 = 2 , σ 1 2 = 39 and σ 2 2 = 1 , the expected sub-Gaussian parameter is
E [ σ 2 ( Y ) ] = 42 + c 2 + 2 c 84 / π 4 .
Therefore, the conditions of Theorem 2 continue to hold.
In contrast to the previous example, where Y was the input and Z the degraded observation, this example reverses that direction. Figure 3 compares the α-Jensen–Shannon-based bound in (33) with the mutual information-based bound in (27) for c = 1 . We observe that the α-Jensen–Shannon-based bound is tighter for values of α approximately in the range ( 0 , 0.7 ) .

5. Conclusions

In this paper, we studied the problem of bounding the excess minimum risk in statistical inference using generalized information divergence measures. Our results extend the mutual information-based bound in [14] by developing a family of bounds parameterized by the order α ( 0 , 1 ) , involving the Rényi divergence, the α -Jensen–Shannon divergence, and Sibson’s mutual information. For the Rényi divergence-based bounds, we employed the variational representation of the divergence, following the approach in [11], and for the α -Jensen–Shannon and Sibson-based bounds, we adopted the auxiliary distribution method introduced in [17].
Unlike the bounds in [11,17], which assume the sub-Gaussian parameter to be constant, our framework allows this parameter to depend on the (target) random vector, thereby making the bounds applicable to a broader class of joint distributions. We demonstrated the effectiveness of our approach through three numerical examples: one involving concatenated discrete q-ary symmetric channels, and two based on additive Gaussian noise channels. In all cases, we observed that at least one of our α -parametric bounds is tighter than the mutual information-based bound over certain ranges of α , with the improvements becoming more pronounced in the discrete example as the channel alphabet size q increased.
Future directions include exploring bounds under alternative f-divergence measures, developing tighter bounds for high-dimensional settings, and determining divergence rates in infinite-dimensional cases.

Author Contributions

Conceptualization, investigation and manuscript preparation, all authors; formal analysis and derivation, A.O.; validation and supervision, F.A. and T.L.; numerical simulation, A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was presented in part at the International Symposium on Information Theory and Its Applications, Taipei, Taiwan, November 2024 [42].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Lemma 3. 
Assume that the function h ( u , V ) is σ 2 ( u ) -sub-Gaussian under the distribution P V | U = u ( α ) for all u U . Then, by applying the Donsker–Varadhan representation [33] for D KL ( P V | U = u P V | U = u ( α ) ) , we obtain
D KL ( P V | U = u P V | U = u ( α ) ) E P V | U = u [ λ h ( u , V ) ] log E P V | U = u ( α ) [ e λ h ( u , V ) ] E P V | U = u [ λ h ( u , V ) ] E P V | U = u ( α ) [ λ h ( u , V ) ] λ 2 σ 2 ( u ) 2 ,
for all λ R . Rearranging terms, we obtain for all λ R and u U that
λ E P V | U = u [ h ( u , V ) ] E P V | U = u ( α ) [ h ( u , V ) ] D KL ( P V | U = u P V | U = u ( α ) ) + λ 2 σ 2 ( u ) 2 .
Similarly, using the assumption that the function h ( u , V ) is σ 2 ( u ) -sub-Gaussian under P V | U = u ( α ) for all u U and the Donsker–Varadhan representation [33] for D KL ( P V P V | U = u ( α ) ) , we have for all λ R that
λ E P V [ h ( u , V ) ] E P V | U = u ( α ) [ h ( u , V ) ] D KL ( P V P V | U = u ( α ) ) + λ 2 σ 2 ( u ) 2 .
If λ < 0 , then taking λ = α α 1 λ > 0 yields that
E P V | U = u ( α ) [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] D KL ( P V | U = u P V | U = u ( α ) ) | λ | + | λ | σ 2 ( u ) 2 ,
and
E P V [ h ( u , V ) ] E P V | U = u ( α ) [ h ( u , V ) ] D KL ( P V P V | U = u ( α ) ) λ + λ σ 2 ( u ) 2 .
Adding (A2) and (A3) yields for all λ < 0 that
E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ]   D KL ( P V | U = u P V | U = u ( α ) ) | λ | + | λ | σ 2 ( u ) 2 + D KL ( P V P V | U = u ( α ) ) λ + λ σ 2 ( u ) 2 ,   = D KL ( P V | U = u P V | U = u ( α ) ) | λ | + | λ | σ 2 ( u ) 2 + D KL ( P V P V | U = u ( α ) ) α α 1 λ + α α 1 λ σ 2 ( u ) 2 ,   = α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α | λ | + | λ | σ 2 ( u ) 2 ( 1 α ) .
Similarly for λ > 0 , taking λ = α α 1 λ < 0 , we have
E P V | U = u ( α ) [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] D KL ( P V | U = u P V | U = u ( α ) ) λ + λ σ 2 ( u ) 2 ,
and
E P V [ h ( u , V ) ] E P V | U = u ( α ) [ h ( u , V ) ] D KL ( P V P V | U = u ( α ) ) | λ | + | λ | σ 2 ( u ) 2 .
Adding (A5) and (A6) yields for all λ > 0 that
E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α λ + λ σ 2 ( u ) 2 ( 1 α ) .
From (A4) and (A7), we obtain the following non-negative parabola in terms of λ :
λ 2 σ 2 ( u ) 2 ( 1 α ) + λ E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] + α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α 0 ,
where (A8) holds trivially for λ = 0 . Thus, its discriminant is non-positive, and we have for all α ( 0 , 1 ) that
| E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] | 2   2 σ 2 ( u ) α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α ( 1 α ) .
Hence,
| E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] |   2 σ 2 ( u ) α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α ( 1 α ) .
As a result, we obtain that
| E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] |       = ( E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] ) P U ( d u )       E P V [ h ( u , V ) ] E P V | U = u [ h ( u , V ) ] P U ( d u )
2 σ 2 ( u ) α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α ( 1 α ) P U ( d u ) 2 σ 2 ( u ) P U ( d u )
× α D KL ( P V | U = u P V | U = u ( α ) ) + ( 1 α ) D KL ( P V P V | U = u ( α ) ) α ( 1 α ) P U ( d u )
= E [ 2 σ 2 ( U ) ] α D KL ( P U , V P U , V ( α ) ) + ( 1 α ) D KL ( P U P V P U , V ( α ) ) α ( 1 α )
= E [ 2 σ 2 ( U ) ] J S α ( P U , V P U P V ) α ( 1 α ) ,
where (A10) follows from Jensen’s inequality, (A11) follows from (A9), (A12) follows from the Cauchy–Schwarz inequality, (A13) follows from the definition of the conditional KL divergence and Definition 9 and (A14) follows from the definition of α -Jensen–Shannon divergence. □
Proof of Theorem 2. 
Let X ¯ , Y ¯ and Z ¯ be the same random variables defined in Theorem 1. Similar to the proof of Theorem 1, we apply Lemma 3 by setting U = Y , V = X and h ( u , v ) = l ( y , f ( x ) ) and taking regular expectations with respect to P Y , X | Z = z and P Y ¯ , X ¯ | Z = z . Since Y ¯ and X ¯ are conditionally independent given Z ¯ = z such that P Y ¯ , X ¯ | Z = z = P Y | Z P X | Z and P Z ¯ = P Z , we have that
| E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ | Z = z [ l ( Y ¯ , f ( X ¯ ) ) ] | = | E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y | Z = z P X | Z = z [ l ( Y , f ( X ) ) ] | 2 E [ σ 2 ( Y ) | Z = z ] α ( 1 α ) J S α ( P Y , X | Z = z P X | Z = z P Y | Z = z ) .
Thus,
E P Y , X [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ [ l ( Y ¯ , f ( X ¯ ) ) ] E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ | Z = z [ l ( Y ¯ , f ( X ¯ ) ) ] ) P Z ( d z ) 2 E [ σ 2 ( Y ) | Z = z ] α ( 1 α ) J S α ( P Y , X | Z = z P X | Z = z P Y | Z = z ) P Z ( d z ) 2 E [ σ 2 ( Y ) | Z = z ] P Z ( d z ) J S α ( P Y , X | Z = z P X | Z = z P Y | Z = z ) α ( 1 α ) P Z ( d z ) = 2 E [ σ 2 ( Y ) ] α ( 1 α ) J S α ( P Y , X | Z P Y | Z P X | Z | P Z ) ,
where the first inequality follows from Jensen’s inequality and since P Z ¯ = P Z , the second inequality follows from (A15), the third from the Cauchy–Schwarz inequality, and the equality follows from (5). From proof of Theorem 1, we know that Y ¯ Z ¯ X ¯ forms a Markov chain. Hence,
E P Y ¯ , X ¯ [ l ( Y ¯ , f ( X ¯ ) ) ] L l * ( Y | Z ) ,
Since f is an optimal estimator of Y from X, we also have
E P Y , X [ l ( Y , f ( X ) ) ] = L l * ( Y | X ) .
Therefore, using (A17) and (A18) in (A16) combined with the fact that L l * ( Y | Z ) L l * ( Y | X ) , we arrive at the desired inequality:
L l * ( Y | Z ) L l * ( Y | X ) 2 E [ σ 2 ( Y ) ] α ( 1 α ) J S α ( P Y , X | Z P Y | Z P X | Z | P Z ) .

Appendix B

Proof of Lemma 6. 
We first show that h ( U , V ) E P V [ h ( U , V ) ] is σ 2 -sub-Gaussian under P U , V . By assumption, h ( U , V ) is σ 2 4 -sub-Gaussian under P U , V . It remains to show that the term E P V [ h ( U , V ) ] is also σ 2 4 -sub-Gaussian under P U , V . Observe that
E P U , V e λ E P V [ h ( U , V ) ] E P U , V E P V [ h ( U , V ) ] = E P U e λ E P V [ h ( U , V ) ] E P U P V [ h ( U , V ) ]
E P U P V e λ h ( U , V ) E P U P V [ h ( U , V ) ]
exp λ 2 σ 2 8 ,
where (A19) follows from Jensen’s inequality, and (A20) follows from the assumption that h ( U , V ) is σ 2 4 -sub-Gaussian under P U P V . Thus, E P V [ h ( U , V ) ] is σ 2 4 -sub-Gaussian under P U , V .
Therefore, by Lemma 5, it follows that E P V [ h ( U , V ) ] is σ 2 4 -sub-Gaussian under P U , V , and hence the difference h ( U , V ) E P V [ h ( U , V ) ] is σ 2 -sub-Gaussian under P U , V , as claimed.
Let g : U × V R be defined by g ( u , v ) = h ( u , v ) E P V [ h ( u , V ) ] . Since h ( U , V ) E P V [ h ( U , V ) ] is σ 2 -sub-Gaussian under P U , V and by definition of g, we obtain that for all λ R :
log E P U , V e λ g ( U , V ) E P U , V [ g ( U , V ) ] σ 2 λ 2 2 .
This can be re-written as
log E P U , V e λ g ( U , V ) σ 2 λ 2 2 + E P U , V [ λ g ( U , V ) ] .
Moreover, since h ( u , V ) is γ ( u ) 2 -sub-Gaussian under P V for all u U , it follows that for all λ R ,
log E P V e λ h ( u , V ) E P V [ h ( u , V ) ] γ 2 ( u ) λ 2 2 .
Using the definition of g and applying the exponential to both sides, we obtain:
E P V e λ g ( u , V ) e γ 2 ( u ) λ 2 2 .
Taking the expectation with respect to P U * yields
E P U * P V e λ g ( U * , V ) E P U * e γ 2 ( U * ) λ 2 2 .
Finally, taking the logarithm and noting that
E P U * P V [ g ( U * , V ) ] = E P U * P V [ h ( U * , V ) ] E P U * P V [ h ( U * , V ) ] = 0 ,
we conclude that
log E P U * P V exp λ g ( U * , V ) λ 2 2 log E P U * [ e γ 2 ( U * ) ] + E P U * P V [ λ g ( U * , V ) ] .
Using the Donsker–Varadhan representation for D KL ( P U ^ , V ^ P U , V ) [33] and inequality (A21), we have for all λ R that
D KL ( P U ^ , V ^ P U , V ) E P U ^ , V ^ [ λ g ( U ^ , V ^ ) ] log E P U , V [ e λ g ( U , V ) ] E P U ^ , V ^ [ λ g ( U ^ , V ^ ) ] E P U , V [ λ g ( U , V ) ] σ 2 λ 2 2 .
Rearranging terms yields
λ E P U ^ , V ^ [ g ( U ^ , V ^ ) ] E P U , V [ g ( U , V ) ] D KL ( P U ^ , V ^ P U , V ) + σ 2 λ 2 2 .
Note that
E P U , V [ g ( U , V ) ] = E P U , V [ h ( U , V ) E P V [ h ( U , V ) ] ] = E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] .
Similarly, using the Donsker–Varadhan representation for D KL ( P U ^ , V ^ P U * P V ) [33] and inequality (A23), we have for all λ R :
λ E P U ^ , V ^ [ g ( U ^ , V ^ ) ] E P U * P V [ g ( U * , V ) ] D KL ( P U ^ , V ^ P U * P V ) + λ 2 2 log E P U * [ e γ 2 ( U * ) ] .
If λ > 0 , then choosing λ = α α 1 λ < 0 , we have from (A24) that
E P U ^ , V ^ [ g ( U ^ , V ^ ) ] E P U , V [ g ( U , V ) ] D KL ( P U ^ , V ^ P U , V ) λ + σ 2 λ 2 .
On the other hand, since E P U * P V [ g ( U * , V ) ] = 0 , (A25) yields with λ = α α 1 λ < 0 :
E P U ^ , V ^ [ g ( U ^ , V ^ ) ] D KL ( P U ^ , V ^ P U * P V ) | λ | + | λ | 2 log E P U * [ e γ 2 ( U * ) ] .
Adding (A26) and (A27) yields that for all λ > 0 :
E P U , V [ g ( U , V ) ] D KL ( P U ^ , V ^ P U , V ) λ + σ 2 λ 2 + D KL ( P U ^ , V ^ P U * P V ) | λ | + | λ | 2 log E P U * [ e γ 2 ( U * ) ] = D KL ( P U ^ , V ^ P U , V ) λ + σ 2 λ 2 + D KL ( P U ^ , V ^ P U * P V ) α 1 α λ + α 1 α λ 2 log E P U * [ e γ 2 ( U * ) ] = α D KL ( P U ^ , V ^ P U , V ) + ( 1 α ) D KL ( P U ^ , V ^ P U * P V ) α λ + λ ( 1 α ) σ 2 + α log E P U * [ e γ 2 ( U * ) ] 2 ( 1 α ) .
Similarly for λ < 0 , choosing λ = α α 1 λ > 0 , we have from (A24) and (A25) that
E P U ^ , V ^ [ g ( U ^ , V ^ ) ] E P U , V [ g ( U , V ) ] D KL ( P U ^ , V ^ P U , V ) | λ | + σ 2 | λ | 2 .
and
E P U ^ , V ^ [ g ( U , V ) ] D KL ( P U ^ , V ^ P U * P V ) λ + λ 2 log E P U * [ e γ 2 ( U * ) ] .
Adding (A29) and (A30) yields for all λ < 0 and λ = α α 1 λ > 0 that
E P U , V [ g ( U , V ) ] D KL ( P U ^ , V ^ P U , V ) | λ | + σ 2 | λ | 2 + D KL ( P U ^ , V ^ P U * P V ) λ + λ 2 log E P U * [ e γ 2 ( U * ) ] = D KL ( P U ^ , V ^ P V , U ) | λ | + σ 2 | λ | 2 + D KL ( P U ^ , V ^ P U * P V ) α 1 α | λ | + α 1 α | λ | 2 log E P U * [ e γ 2 ( U * ) ] = α D KL ( P U ^ , V ^ P V , U ) + ( 1 α ) D KL ( P U ^ , V ^ P U * P V ) α | λ | + | λ | ( 1 α ) σ 2 + α log E P U * [ e γ 2 ( U * ) ] 2 ( 1 α ) .
Considering (A28) and (A31), we have a non-negative parabola in λ given by
λ 2 ( 1 α ) σ 2 + α log E P U * [ e γ 2 ( U * ) ] 2 ( 1 α ) λ E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] + α D KL ( P U ^ , V ^ P U , V ) + ( 1 α ) D KL ( P U ^ , V ^ P U * P V ) α 0 ,
whose discriminant must be non-positive (for λ = 0 , the above inequality holds trivially). Thus, for all α ( 0 , 1 ) ,
E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] 2 ( ( 1 α ) σ 2 + α log E P U * [ e γ 2 ( U * ) ] ) ( 1 α ) × α D KL ( P U ^ , V ^ P U , V ) + ( 1 α ) D KL ( P U ^ , V ^ P U * P V ) α .
Finally, invoking Lemma 4 we obtain
E P U , V [ h ( U , V ) ] E P U P V [ h ( U , V ) ] 2 ( ( 1 α ) σ 2 + α log E P U * [ e γ 2 ( U * ) ] ) I α S ( P U , V , P U * ) α .
Proof of Theorem 3. 
Let X ¯ , Y ¯ and Z ¯ be the same random variables as defined in Theorem 1. We also consider the distribution P Y * | Z given by p Y * | Z ( y | z ) in (38) obtained from Definition 6 by considering the distributions P Y , X | Z and P Y | Z P X | Z .
We apply Lemma 6 by setting U = Y , V = X and h ( u , v ) = l ( y , f ( x ) ) and taking regular expectations with respect to P Y , X | Z = z and P Y ¯ , X ¯ | Z = z . Since Y ¯ and X ¯ are conditionally independent given Z ¯ = z such that P Y ¯ , X ¯ | Z = z = P Y | Z P X | Z , and P Z ¯ = P Z , we have that
| E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ | Z = z [ l ( Y ¯ , f ( X ¯ ) ) ] | = | E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y | Z = z P X | Z = z [ l ( Y , f ( X ) ) ] | 2 ( ( 1 α ) σ 2 + α Φ Y * | Z = z ( γ 2 ( Y * ) ) ) α I α S ( P Y , X | Z = z , P Y * | Z = z ) .
Now,
E P Y , X [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ [ l ( Y ¯ , f ( X ¯ ) ) ] = E P Z E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ | Z = z [ l ( Y ¯ , f ( X ¯ ) ) ] E P Z E P Y , X | Z = z [ l ( Y , f ( X ) ) ] E P Y ¯ , X ¯ | Z = z [ l ( Y ¯ , f ( X ¯ ) ) ] ) E P Z 2 ( ( 1 α ) σ 2 + α Φ Y * | Z = z ( γ 2 ( Y * ) ) ) I α S ( P Y , X | Z = z , P Y * | Z = z ) α 2 E P Z ( ( 1 α ) σ 2 + α Φ Y * | Z = z ( γ 2 ( Y * ) ) ) E P Z I α S ( P Y , X | Z = z , P Y * | Z = z ) α = 2 ( ( 1 α ) σ 2 + α E P Z [ Φ Y * | Z ( γ 2 ( Y * ) ) ] ) α E P Z I α S ( P Y , X | Z , P Y * | Z ) ,
where the first equality holds since P Z ¯ = P Z , the first inequality follows from Jensen’s inequality, the second inequality follows from (A34), and the third inequality follows from the Cauchy–Schwarz inequality. From the proof of Theorem 1, we know that Y ¯ Z ¯ X ¯ forms a Markov chain. Hence,
E P Y ¯ , X ¯ [ l ( Y ¯ , f ( X ¯ ) ) ] ) L l * ( Y | Z ) .
Since f is an optimal estimator of Y from X, we also have
E P Y , X [ l ( Y , f ( X ) ) ] ) = L l * ( Y | X ) .
Therefore, using (A36) and (A37) in (A35) along with the fact that L l * ( Y | Z ) L l * ( Y | X ) , we arrive at the desired inequality:
L l * ( Y | Z ) L l * ( Y | X ) 2 ( ( 1 α ) σ 2 + α E P Z [ Φ Y * | Z ( γ 2 ( Y * ) ) ] ) α E P Z I α S ( P Y , X | Z , P Y * | Z ) .

References

  1. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; Volume 4, pp. 547–562. [Google Scholar]
  2. Nielsen, F. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef]
  3. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
  4. Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie Verwandte Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
  5. Csiszar, I. Generalized cutoff rates and Renyi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
  6. Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  7. Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening mutual information-based bounds on generalization error. IEEE J. Sel. Areas Inf. Theory 2020, 1, 121–130. [Google Scholar] [CrossRef]
  8. Esposito, A.R.; Gastpar, M.; Issa, I. Robust generalization via f- mutual information. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2723–2728. [Google Scholar]
  9. Esposito, A.R.; Gastpar, M.; Issa, I. Variational characterizations of Sibson’s α-mutual information. In Proceedings of the 2024 IEEE International Symposium on Information Theory (ISIT), Athens, Greece, 7–12 July 2024; pp. 2110–2115. [Google Scholar]
  10. Esposito, A.R.; Gastpar, M.; Issa, I. Generalization error bounds via Rényi-, f-divergences and maximal leakage. IEEE Trans. Inf. Theory 2021, 67, 4986–5004. [Google Scholar] [CrossRef]
  11. Modak, E.; Asnani, H.; Prabhakaran, V.M. Rényi divergence based bounds on generalization error. In Proceedings of the 2021 IEEE Information Theory Workshop (ITW), Kanazawa, Japan, 17–21 October 2021; pp. 1–6. [Google Scholar]
  12. Ji, K.; Zhou, Y.; Liang, Y. Understanding estimation and generalization error of generative adversarial networks. IEEE Trans. Inf. Theory 2021, 67, 3114–3129. [Google Scholar] [CrossRef]
  13. Xu, A.; Raginsky, M. Minimum excess risk in Bayesian learning. IEEE Trans. Inf. Theory 2022, 68, 7935–7955. [Google Scholar] [CrossRef]
  14. Györfi, L.; Linder, T.; Walk, H. Lossless transformations and excess risk bounds in statistical inference. Entropy 2023, 25, 1394. [Google Scholar] [CrossRef]
  15. Hafez-Kolahi, H.; Moniri, B.; Kasaei, S. Information-theoretic analysis of minimax excess risk. IEEE Trans. Inf. Theory 2023, 69, 4659–4674. [Google Scholar] [CrossRef]
  16. Aminian, G.; Bu, Y.; Toni, L.; Rodrigues, M.R.; Wornell, G.W. Information-theoretic characterizations of generalization error for the Gibbs algorithm. IEEE Trans. Inf. Theory 2023, 70, 632–655. [Google Scholar] [CrossRef]
  17. Aminian, G.; Masiha, S.; Toni, L.; Rodrigues, M.R. Learning algorithm generalization error bounds via auxiliary distributions. IEEE J. Sel. Areas Inf. Theory 2024, 5, 273–284. [Google Scholar] [CrossRef]
  18. Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Rey-Bellet, L.; Wang, J. Variational representations and neural network estimation of Rényi divergences. SIAM J. Math. Data Sci. 2021, 3, 1093–1116. [Google Scholar] [CrossRef]
  19. Atar, R.; Chowdhary, K.; Dupuis, P. Robust bounds on risk-sensitive functionals via Rényi divergence. SIAM/ASA J. Uncertain. Quantif. 2015, 3, 18–33. [Google Scholar] [CrossRef]
  20. Anantharam, V. A variational characterization of Rényi divergences. IEEE Trans. Inf. Theory 2018, 64, 6979–6989. [Google Scholar] [CrossRef]
  21. Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hung. 1967, 2, 299–318. [Google Scholar]
  22. Huang, Y.; Xiao, F.; Cao, Z.; Lin, C.T. Fractal belief Rényi divergence with its applications in pattern classification. IEEE Trans. Knowl. Data Eng. 2024, 36, 8297–8312. [Google Scholar] [CrossRef]
  23. Huang, Y.; Xiao, F.; Cao, Z.; Lin, C.T. Higher order fractal belief Rényi divergence with its applications in pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14709–14726. [Google Scholar] [CrossRef]
  24. Zhang, L.; Xiao, F. Belief Rényi divergence of divergence and its application in time series classification. IEEE Trans. Knowl. Data Eng. 2024, 36, 3670–3681. [Google Scholar] [CrossRef]
  25. McAllester, D.A. PAC-Bayesian model averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT), Santa Cruz, CA, USA, 7–9 July 1999; pp. 164–170. [Google Scholar]
  26. Alquier, P.; Ridgway, J.; Chopin, N. Properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016, 17, 8372–8414. [Google Scholar]
  27. Lopez, A.; Jog, V. Generalization bounds via Wasserstein distance based algorithmic stability. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 6326–6335. [Google Scholar]
  28. Esposito, A.R.; Gastpar, M. From generalisation error to transportation-cost inequalities and back. In Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 294–299. [Google Scholar]
  29. Lugosi, G.; Neu, G. Generalization bounds via convex analysis. In Proceedings of the Machine Learning Research (PMLR), London, UK, 2–5 July 2022; pp. 1–23. [Google Scholar]
  30. Welfert, M.; Kurri, G.R.; Otstot, K.; Sankar, L. Addressing GAN training instabilities via tunable classification losses. IEEE J. Sel. Areas Inf. Theory 2024, 5, 534–553. [Google Scholar] [CrossRef]
  31. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
  32. Esposito, A.R.; Vandenbroucque, A.; Gastpar, A. Lower bounds on the Bayesian risk via information measures. J. Mach. Learn. Res. 2024, 25, 1–45. [Google Scholar]
  33. Donsker, M.; Varadhan, S. Asymptotic evaluation of certain Markov process expectations for large time. IV. Commun. Pure Appl. Math. 1983, 36, 183–212. [Google Scholar] [CrossRef]
  34. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  35. Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
  36. Esposito, A.R.; Gastpar, M.; Issa, I. Sibson’s α-mutual information and its variational representations. arXiv 2024, arXiv:2405.08352. [Google Scholar]
  37. Hoeffding, W. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding; Springer: New York, NY, USA, 1994; pp. 409–426. [Google Scholar]
  38. Palomar, D.P.; Verdú, S. Lautum information. IEEE Trans. Inf. Theory 2008, 54, 964–975. [Google Scholar] [CrossRef]
  39. Buldygin, V.V.; Kozachenko, Y.V. Sub-Gaussian random variables. Ukr. Math. J. 1980, 32, 483–489. [Google Scholar] [CrossRef]
  40. Rivasplata, O. Subgaussian Random Variables: An Expository Note. 2012. Available online: https://www.stat.cmu.edu/~arinaldo/36788/subgaussians.pdf (accessed on 24 May 2025).
  41. Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
  42. Omanwar, A.; Alajaji, F.; Linder, T. Bounding excess minimum risk via Rényi’s divergence. In Proceedings of the 2024 International Symposium on Information Theory and Its Applications (ISITA), Taipei, Taiwan, 10–13 November 2024; pp. 59–63. [Google Scholar]
Figure 1. Comparison of bounds versus α on minimum excess risk for two concatenated q-ary symmetric channels, where ϵ 1 = 0.15 and ϵ 2 = 0.05 .
Figure 1. Comparison of bounds versus α on minimum excess risk for two concatenated q-ary symmetric channels, where ϵ 1 = 0.15 and ϵ 2 = 0.05 .
Entropy 27 00727 g001aEntropy 27 00727 g001b
Figure 2. Comparison of bounds vs α on minimum excess risk for a Gaussian additive noise channel with c = 1 , σ ^ 2 = 1 and σ i 2 = 1 for all i = 1 , 2 .
Figure 2. Comparison of bounds vs α on minimum excess risk for a Gaussian additive noise channel with c = 1 , σ ^ 2 = 1 and σ i 2 = 1 for all i = 1 , 2 .
Entropy 27 00727 g002
Figure 3. Comparison of bounds vs α on minimum excess risk for a reverse Gaussian additive noise channel with c = 1 , σ ^ 2 = 2 , σ 1 2 = 39 and σ 2 2 = 1 .
Figure 3. Comparison of bounds vs α on minimum excess risk for a reverse Gaussian additive noise channel with c = 1 , σ ^ 2 = 2 , σ 1 2 = 39 and σ 2 2 = 1 .
Entropy 27 00727 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Omanwar, A.; Alajaji, F.; Linder, T. Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures. Entropy 2025, 27, 727. https://doi.org/10.3390/e27070727

AMA Style

Omanwar A, Alajaji F, Linder T. Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures. Entropy. 2025; 27(7):727. https://doi.org/10.3390/e27070727

Chicago/Turabian Style

Omanwar, Ananya, Fady Alajaji, and Tamás Linder. 2025. "Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures" Entropy 27, no. 7: 727. https://doi.org/10.3390/e27070727

APA Style

Omanwar, A., Alajaji, F., & Linder, T. (2025). Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures. Entropy, 27(7), 727. https://doi.org/10.3390/e27070727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop