Next Article in Journal
Derangetropy in Probability Distributions and Information Dynamics
Previous Article in Journal
Generalized Filter Bank Orthogonal Frequency Division Multiplexing: Low-Complexity Waveform for Ultra-Wide Bandwidth and Flexible Services
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Information-Theoretic Generalization Bounds for Batch Reinforcement Learning

School of Computing Science, Simon Fraser University, 8888 University Dr W, Burnaby, BC V5A 1S6, Canada
Entropy 2024, 26(11), 995; https://doi.org/10.3390/e26110995
Submission received: 27 September 2024 / Revised: 12 November 2024 / Accepted: 16 November 2024 / Published: 18 November 2024
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
We analyze the generalization properties of batch reinforcement learning (batch RL) with value function approximation from an information-theoretic perspective. We derive generalization bounds for batch RL using (conditional) mutual information. In addition, we demonstrate how to establish a connection between certain structural assumptions on the value function space and conditional mutual information. As a by-product, we derive a high-probability generalization bound via conditional mutual information, which was left open and may be of independent interest.

1. Introduction

Generalization is a fundamental concept in statistical machine learning. It measures how well a learning system performs on unseen data after being trained on a finite dataset. Effective generalization ensures that the learning approach captures the essential patterns in the data. Generalization in supervised learning has been studied for several decades. However, in reinforcement learning (RL), agnostic learning is generally infeasible and realizability is not a sufficient condition for efficient learning. Consequently, the study of generalization in RL poses more challenges.
In this work, we focus on batch reinforcement learning (batch RL), a branch of reinforcement learning where the agent learns a policy from a fixed dataset of previously collected experiences. This setting is favorable when online interaction is expensive, dangerous, or impractical. Batch RL, despite being a special case of supervised learning, still presents distinct challenges due to the complex temporal structures inherent in the data.
Originating from the work of [1,2], an information-theoretic framework has been developed to bound the generalization error of learning algorithms using the mutual information between the input dataset and the output hypothesis. This methodology formalizes the intuition that overfitted learning algorithms are less likely to generalize effectively. Unlike traditional approaches such as VC-dimension and Rademacher complexity, this information-theoretic framework offers the significant advantage of capturing all dependencies on the data distribution, hypothesis space, and learning algorithm. Given that reinforcement learning is a learning paradigm in which all the aforementioned aspects differ significantly from those in supervised learning, we believe this novel approach will provide us with more profound insights.

2. Preliminaries

2.1. Batch Reinforcement Learning with Function Approximation

An episodic Markov decision process (MDP) is defined by M ( S , A , P , r , H ) . We use Δ ( X ) to denote the set of the probability distribution over the set X . M ( S , A , P , r , H ) is specified by a finite state space S , a finite action space A , transition functions P h : S × A Δ ( S ) at step h [ H ] , reward function r h : S × A R at step h, and H is the number of steps in each episode. We assume the reward is bounded, i.e., r h ( s , a ) [ 0 , 1 ] (For rewards in [ R min , R max ] simply rescales these bounds.), ( s , a , h ) . See Figure 1 for a graphical illustration.
Let π = { π h : S Δ ( A ) } h [ H ] , where π h ( · s ) is the action distribution for policy π at state s and step h. Given a policy π , the value function V h π : S R at step h is defined as
V h π ( s ) : = E π h = h H r h ( s h , a h ) | s h = s .
The action-value function Q h π : S × A R at step h is defined as
Q h π ( s , a ) : = E π h = h H r h ( s h , a h ) | s h = s , a h = a .
The Bellman operators T h π and T h * project functions forward by one step through the following dynamics:
( T h π ) ( s , a ) = r h ( s , a ) + E s P h ( · | s , a ) [ E a π ( · | s ) [ Q ( s , a ) ] ] ,
( T h * ) ( s , a ) = r h ( s , a ) + E s P h ( · | s , a ) max a Q ( s , a ) .
Now, we denote the dataset Z = { ( s , a , r , s , h ) } , where ( s , a ) μ h , r r h ( s , a ) , and s P h ( · | s , a ) for a fixed h. We also denote D = D 1 × × D H , where ( s , a , r , s , h ) D h . We consider batch RL with value function approximation. The learner is given a function class F = F 1 × × F H to approximate the optimal Q-value function. Denote f = ( f 1 , , f H ) F . As no reward is collected in the ( H + 1 ) th step, we set f H + 1 = 0 . For each f F , define π f = { π f h } h = 1 H , where π f h ( a | s ) = 1 a = arg max a f h ( s , a ) . Next, we introduce the Bellman error and its empirical version.
Definition 1
(Bellman error). Under data distribution μ, we define the Bellman error of function f = ( f 1 , , f H ) as
E ( f ) : = 1 H h = 1 H f h T h f h + 1 μ h 2 .
Definition 2
(Mean squared empirical Bellman error (MSBE)). Given a dataset Z D , we define the Mean squared empirical Bellman error (MSBE) of function f = ( f 1 , , f H ) as
L ( f , Z ) = 1 H h = 1 H 1 n ( s , a , r , s , h ) Z h ( f h ( s , a ) r V f h + 1 ( s ) ) 2
where V f h + 1 ( s ) : = max a A f h + 1 ( s , a ) .
For convenience, we denote ( f h , Z h ) = 1 n ( s , a , r , s , h ) Z h ( f h ( s , a ) r V f h + 1 ( s ) ) 2 .
Bellman error is used in RL as a surrogate loss function to minimize the difference between the estimated value function and the true value function under a policy. The Bellman error serves as a proxy for the optimality gap, which is the difference between the current value function and the optimal value function. Under the concentrability assumption, minimizing the Bellman error is able to reduce the optimality gap.
Lemma 1
(Bellman error to value suboptimality [3]). If there exists a constant C, such that for any policy π
sup ( s , a , h ) S × A × [ H ] d P h π d μ h ( s , a ) C
then for any f F , we have
V 1 * ( s 1 ) V 1 π f ( s 1 ) 2 H C · E ( f ) .
We note that L ( f , Z ) is a biased estimate of E ( f ) . A common solution is to use the double sampling method, where for each state and action in the sample, at least two next states are generated [3,4,5], and define the unbiased MSBE as:
L DS ( f , Z ˜ ) = 1 n H ( s , a , r , s , s ˜ , h ) Z ˜ f h ( s , a ) r V f h + 1 ( s ) 2 1 2 V f h + 1 ( s ) V f h + 1 ( s ˜ ) 2 .
Note that L ( f , Z ) [ 0 , 4 H 2 ] , L DS ( f , Z ˜ ) [ 2 H 2 , 4 H 2 ] , and double sampling does not increase the sample size, except that it requires an additional generated s ˜ P h ( · | s , a ) . Therefore, the results presented in this paper can be easily extended to the double sampling setting.

2.2. Generalization Bounds

Definition 3
(Expected generalization bounds). Given a dataset Z D and an algorithm A , let L ( A ( Z ) , Z ) denote the training loss and let L ( A ( Z ) , D ) denote the true loss. The expected generalization error is defined as
E Z D [ L ( A ( Z ) , Z ) L ( A ( Z ) , D ) ] .
Definition 4
(High-probability generalization bounds). Given a dataset Z D , and an algorithm A , let L ( A ( Z ) , Z ) denote the training loss and let L ( A ( Z ) , D ) denote the true loss. Given a failure probability δ and an error tolerance η, the high-probability generalization error is defined as
P ( | L ( A ( Z ) , Z ) L ( A ( Z ) , D ) | η ) δ .

2.3. Mutual Information

First, we define the KL-divergence of two distributions.
Definition 5
(KL-Divergence [6]). Let P , Q be two distributions over the space Ω and suppose P is absolutely continuous with respect to Q . The Kullback–Leibler (KL) divergence from Q to P is
D ( P Q ) = E X P X log P X Q X ,
where P X and Q X denote the probability mass/density functions of P and Q on X, respectively.
Based on KL-divergence, we can define mutual information and conditional mutual information as follows.
Definition 6
([6]). Let X, Y, and Z be arbitrary random variables, and let D KL denote the Kullback–Leibler (KL) divergence. The mutual information between X and Y is defined as:
I ( X ; Y ) : = D KL ( P X , Y P X P Y ) .
The conditional mutual information is defined as:
I ( X ; Y | Z ) : = E Z [ D KL ( P X , Y | Z P X | Z P Y | Z ) ] .
Next, we introduce Rényi’s α -Divergence, which is a generalization of KL-divergence. Rényi’s α -Divergence has found many applications, such as hypothesis testing, differential privacy, several statistical inference, and coding problems [7,8,9,10].
Definition 7
(Rényi’s α -Divergence [11]). Let ( Ω , F , P ) , ( Ω , F , Q ) be two probability spaces. Let α > 0 be a positive real different from 1. Consider a measure μ, such that P μ and Q μ (such a measure always exists, e.g., μ = ( P + Q ) / 2 ) and denote with p , q the densities of P , Q with respect to μ. The α–divergence of P from Q is defined as follows:
D α ( P Q ) = 1 α 1 log p α q 1 α d μ .
Note that the above definition is independent of the chosen measure μ . With the definition of Rényi’s α -divergence, we are ready to state the definitions of α -mutual information and α -conditional mutual information.
Definition 8
( α -mutual information [7]). Let X , Y be two random variables jointly distributed according to P X Y . Let Q Y be any probability measure over Y . For α > 0 , the α-mutual information between X and Y is defined as follows:
I α ( X ; Y ) = min Q Y D α ( P X Y P X Q Y ) .
Definition 9
(Conditional α -mutual information). Let X , Y , Z be three random variables jointly distributed according to P X Y Z . Let Q Y | Z be any probability measure over Y | Z . For α > 0 , a conditional α-mutual information of order α between X and Y given Z is defined as follows:
I α Y | Z ( X ; Y | Z ) = min Q Y | Z D α ( P X Y Z P X | Z Q Y | Z P Z ) .

3. Generalization Bounds via Mutual Information

Mutual information bounds provide a direct link between the generalization error and the amount of information shared between the training data and the learned hypothesis. This offers a clear information-theoretic understanding of how overfitting can be controlled by reducing the dependency on the training data. Mutual information bounds are applicable to a wide range of learning algorithms and settings, including those with unbounded loss functions and complex hypothesis spaces. Moreover, the use of mutual information can simplify the analysis of generalization compared with traditional methods, particularly in cases where those traditional measures are difficult to compute. See Appendix A for related work.
Theorem 1
([2]). Let D be a distribution on Z. Let A : Z W be a randomized algorithm. Let : W × Z R be a loss function, which is σ-subgaussian with respect to Z. Let L : W × Z R be the empirical risk. Then
E Z D [ L ( A ( Z ) , Z ) L ( A ( Z ) , D ) ] 2 σ 2 n I ( A ( Z ) ; Z ) .
The above theorem provides a bound on the expected generalization error. High-probability generalization bounds can be obtained using the α -mutual information. Note that the α -mutual information shares many properties with standard mutual information.
Proposition 1
([7]). For discrete random variables X and Y, the following holds:
(i) 
Data Processing Inequality: given α > 0 , I α ( X , Z ) min { I α ( X , Y ) , I α ( Y , Z ) } if the Markov chain X Y Z holds.
(ii) 
I α ( X ; Y ) i s non-decreasing i n α .
(iii) 
I α ( X , Y ) min { log | X | , log | Y | } .
(iv) 
I α ( X , Y ) 0 with equality iff X and Y are independent.
Theorem 2
([11]). Let D be a distribution on Z. Let A : Z W be a randomized algorithm. Let : W × Z R be a loss function which is σ-subgaussian with respect to Z. Let L : W × Z R be the empirical risk. Given η , δ ( 0 , 1 ) and fix α 1 , if the number of samples n satisfies
n 2 σ 2 η 2 I α ( A ( Z ) , Z ) + log 2 + α α 1 log 1 δ .
then, we have
P | L ( A ( Z ) , Z ) L ( A ( Z ) , D ) | η 1 δ .
The mutual information bound can be infinite in some cases and thus be vacuous. To address this, the conditional mutual information (CMI) approach was introduced. CMI bounds normalize the information content for each data point, preventing the problem of infinite information content, particularly in continuous data distributions. This makes CMI a more robust and applicable method in scenarios where mutual information would otherwise be unbounded.
Definition 10.
Let Z D 2 n consist of 2 n samples drawn independently from D . Let U { 0 , 1 } n be uniformly random and independent from Z and the randomness of A . Define Z U Z , such that ( Z U ) i is the ( 2 i U i ) th sample in Z—that is, Z U is the subset of Z indexed by U. The conditional mutual information of A with respect to D is defined as I ( A ( Z U ) ; U | Z ) .
Theorem 3
([12]). Let D be a distribution on Z. Let A : Z W be a randomized algorithm. Let L : W × Z R be a function, such that | L ( w , z 1 ) L ( w , z 2 ) | Δ ( z 1 , z 2 ) for all z 1 , z 2 Z and w W given Δ : Z 2 R . Let U { 0 , 1 } n be uniformly random. Then
E Z D [ L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) ] 2 E z 1 , z 2 [ Δ ( z 1 , z 2 ) 2 ] n I ( A ( Z U ) ; U | Z ) .
Another advantage of the CMI bounds is that they can be derived from various concepts such as VC-dimension, compression schemes, stability, and differential privacy, offering a unified framework for generalization analysis. However, because CMI is defined as an expectation, i.e., I ( X ; Y | Z ) : = E Z [ D KL ( P X , Y | Z P X | Z P Y | Z ) ] , the above theorem does not provide a high-probability bound. Modifying this framework to ensure high-probability guarantees was left as future work in [12]. In the following, we use conditional α -mutual information to address this issue.
Theorem 4.
Let U { 0 , 1 } n be uniformly random. Given a dataset Z D 2 n consists of 2 n H samples. Let A : Z U W be a randomized algorithm. Let : W × Z R be a loss function which is σ-subgaussian with respect to Z. Let L : W × Z U R be the empirical risk. Given η , δ ( 0 , 1 ) and fix α 1 , if the number of samples n satisfies
n 2 σ 2 η 2 I α A ( Z U ) | Z ( A ( Z U ) ; U | Z ) + log 2 + α α 1 log 1 δ
then, we have
P | L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) | η 1 δ .
Proof. 
Let ( X × Y × Z , F , P X Y Z ) be a probability space, and let Q ( X | Z ) be the set of conditional probability measures Q X | Z , such that P X Y Z P Z Q X | Z P Y | Z . Given E F and z Z , x X , let E z , x = { y Y : ( x , y , z ) E } . We first prove that for a fixed α 1 ,
P X Y Z ( E ) E Z ess   sup Q X | Z Q ( X | Z ) P Y | Z ( E Z , X ) α 1 α exp α 1 α I α X | Z ( X ; Y | Z ) .
Using the Radon–Nikodym derivative of P X Y Z with respect to the product measure P Z Q X | Z P Y | Z , we have
P X Y Z ( E ) = E P Z Q X | Z P Y | Z d P X Y Z d P Z Q X | Z P Y | Z I E
where I E is the indicator function of the event E. Next, we introduce three sets of exponents α , α , α , and γ , γ , γ , such that
1 α + 1 γ = 1 α + 1 γ = 1 α + 1 γ = 1 .
By applying Hölder’s inequality three times to separate the different components of the expectation, we derive
E P Z Q X | Z P Y | Z d P X Y Z d P Z Q X | Z P Y | Z I E E P Z 1 α E Q X | Z α α E Q Y | Z α α d P X Y Z d P Z Q X | Z P Y | Z α E P Z 1 γ E Q X | Z γ γ E P Y | Z γ γ I E γ .
By setting α = α and α = 1 ,
E P Z 1 α E Q X | Z α α E Q Y | Z α α d P X Y Z d P Z Q X | Z P Y | Z α exp α 1 α I α X | Z ( X ; Y | Z ) .
Since α = 1 and 1 α + 1 γ = 1 , we have γ . As γ , E Q X | Z γ γ E P Y | Z γ γ I E γ tends to the essential supremum
ess   sup Q X | Z Q ( X | Z ) P Y | Z ( E Z , X ) .
As 1 γ = α 1 α , we have
E P Z 1 γ E Q X | Z γ γ E P Y | Z γ γ I E γ E Z ess   sup Q X | Z Q ( X | Z ) P Y | Z ( E Z , X ) α 1 α .
Thus, Equation (2) holds by combining all of the inequalities.
Now, let X = A ( Z U ) and Y = U . Consider the event
E = ( X , Y , Z ) : L ( X , Z Y ) E Y [ L ( X , D ) ] η ,
where L ( X , Z Y ) denotes the empirical risk defined as the average of n loss functions, and each loss function is σ -subgaussian. We can express E Z , X , the fibers of E, with respect to Z and X, as
E Z , X = Y : L ( X , Z Y ) E Y [ L ( X , D ) ] η .
For any fixed Z and X, the random variable Y remains independent of Z and X under any Q X | Z Q ( X | Z ) . Now, using Hoeffding’s inequality, for every X and Z,
P Y ( E Z , X ) 2 exp n η 2 2 σ 2 .
Therefore, from Equations (2) and (3),
P ( E ) 2 exp α 1 α · n η 2 2 σ 2 exp α 1 α I α A ( Z U ) | Z ( A ( Z U ) ; U | Z ) = 2 exp α 1 α I α A ( Z U ) | Z ( A ( Z U ) ; U | Z ) n η 2 2 σ 2 .
Lastly, by setting
n 2 σ 2 η 2 I α A ( Z U ) | Z ( A ( Z U ) ; U | Z ) + log 2 + α α 1 log 1 δ
we obtain the desired conclusion. □

4. Information-Theoretic Generalization Bounds for Batch RL

We now provide expected and high-probability generalization bounds for batch RL. The generalization bounds are derived from mutual information between the training data and the learned hypothesis. As mutual information bounds consider the data, algorithm, and hypothesis space comprehensively, they support the design of efficient learning algorithms and fine-grained theoretical analysis.
Theorem 5.
Given that dataset Z D n consists of n H samples, for any batch RL algorithm A with output A ( Z ) = f = ( f 1 , , f H ) F , the expected generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by
E Z D [ L ( A ( Z ) , Z ) L ( A ( Z ) , D ) ] 2 H 2 h = 1 H I ( f h ; Z h ) n .
Proof. 
We first recall the Donsker—Varadhan variational representation ([13]) of the KL-divergence between any two probability measures π and ρ on a common measurable space ( Ω , F )
D KL ( π ρ ) = sup F Ω F d π log Ω e F d ρ
where the supremum is over all measurable functions F : Ω R , such that e F L 1 ( ρ ) .
Let be Z = Z 1 Z H be a dataset where Z h = { ( s , a , r , s , h ) } D h . Let A ( Z ) = f = ( f 1 , , f H ) F be the output of some batch RL algorithm A . Let f ˜ h and Z ˜ h be the independent copies of f h and Z h . Let
L ( f , Z ) = 1 H h = 1 H ( f h , Z h ) = 1 H h = 1 H 1 n ( s , a , r , s , h ) Z h ( f h ( s , a ) r V f h + 1 ( s ) ) 2 .
Now, we have
I ( f h ; Z h ) = D KL ( P f h , Z h P f h P Z h ) = sup g E f h , Z h [ g ( f h , Z h ) ] log E f ˜ h , Z ˜ h [ e g ( f ˜ h , Z ˜ h ) ] ( Donsker–Varadhan   variational   representation ) λ E f h , Z h [ ( f h , Z h ) ] log E f ˜ h , Z ˜ h [ e λ ( f ˜ h , Z ˜ h ) ] . ( λ R )
As ( f h , Z h ) = 1 n ( s , a , r , s , h ) Z h ( f h ( s , a ) r V f h + 1 ( s ) ) 2 and ( f h ( s , a ) r V f h + 1 ( s ) ) 2 [ 0 , 4 H 2 ] for any h, it follows that
log E f ˜ h , Z ˜ h [ e λ ( ( f ˜ h , Z ˜ h ) E f ˜ h , Z ˜ h [ ( f ˜ h , Z ˜ h ) ] ) ] 2 λ 2 H 4 n .
Thus, we obtain
I ( f h ; Z h ) λ E f h , Z h [ ( f h , Z h ) ] E f ˜ h , Z ˜ h [ ( f ˜ h , Z ˜ h ) ] ) ] 2 λ 2 H 4 n I ( f h ; Z h ) λ + 2 λ 2 H 4 n E f h , Z h [ ( f h , Z h ) ] E f ˜ h , Z ˜ h [ ( f ˜ h , Z ˜ h ) ] ) ] .
By optimizing the above inequality over λ > 0 and λ < 0 , respectively, we derive
H 2 2 I ( f h ; Z h ) n E f h , Z h [ ( f h , Z h ) ] E f ˜ h , Z ˜ h [ ( f ˜ h , Z ˜ h ) ] ) ] H 2 2 I ( f h ; Z h ) n ,
and thus,
E f h , Z h [ ( f h , Z h ) ] E f ˜ h , Z ˜ h [ ( f ˜ h , Z ˜ h ) ] ) ] H 2 2 I ( f h ; Z h ) n .
Finally, we observe that
E Z D [ L ( A ( Z ) , Z ) L ( A ( Z ) , D ) ] = E Z D 1 H h = 1 H ( f h , Z h ) E Z D 1 H h = 1 H ( f h , Z h ) = 1 H h = 1 H E Z h D h ( f h , Z h ) E Z h D h ( f h , Z h ) = 1 H h = 1 H E f h , Z h [ ( f h , Z h ) ] E f ˜ h , Z ˜ h [ ( f ˜ h , Z ˜ h ) ] ) ] 1 H h = 1 H H 2 2 I ( f h ; Z h ) n ( By   Equation   ( 4 ) ) = 2 H 2 h = 1 H I ( f h ; Z h ) n .
The above result suggests that reducing the mutual information between the dataset Z h and the learned function f h at each step h can improve the generalization performance. Note that when the input domain is infinite, mutual information can become unbounded. To address this limitation, an approach based on conditional mutual information was introduced [12]. CMI bounds not only address the issue by normalizing the information content of each data point, but also establish connections with various other generalization concepts, as we will discuss in the next section. We now present a generalization bound using conditional mutual information.
Theorem 6.
Let U { 0 , 1 } n be uniformly random. Given that dataset Z D 2 n consists of 2 n H samples, for any batch RL algorithm A with output A ( Z U ) = f = ( f 1 , , f H ) F , the expected generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by
E Z D [ L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) ] 2 H 2 h = 1 H I ( f h ; U | Z h ) n .
Proof. 
Let U { 0 , 1 } n be uniformly random. Let Z = Z 1 Z H be a dataset where each Z h = { ( s , a , r , s , h ) } D h consists of 2 n samples. Define Z U = ( Z 1 ) U ( Z H ) U . Let A ( Z U ) = f = ( f 1 , , f H ) F be the output of some batch RL algorithm A . Let f ¯ h = A ( Z U ¯ ) h , Z ˜ h = ( Z h ) U and Z ¯ h = ( Z h ) U ¯ . Note that Z h = Z ˜ h Z ¯ h . We define the disintegrated mutual information
I Z ( X ; Y ) : = D K L ( P X , Y | Z P X P Y | Z ) .
Note that I ( X ; Y | Z ) = E Z [ I Z ( X ; Y ) ] . The rest of the proof is analogous to Theorem 5. We have
I Z h ( f h ; Z ˜ h | Z h ) = D KL ( P f h , Z ˜ h | Z h P f h | Z h P Z ˜ h | Z h ) = sup g E f h , Z ˜ h | Z h [ g ( f h , Z ˜ h ) ] log E f ¯ h , Z ¯ h | Z h [ e g ( f ¯ h , Z ¯ h ) ] ( Donsker–Varadhan   variational   representation ) λ E f h , Z ˜ h | Z h [ ( f h , Z ˜ h ) ] log E f ¯ h , Z ¯ h | Z h [ e λ ( f ¯ h , Z ¯ h ) ] . ( λ R )
As ( f h , Z h ) = 1 n ( s , a , r , s , h ) Z h ( f h ( s , a ) r V f h + 1 ( s ) ) 2 and ( f h ( s , a ) r V f h + 1 ( s ) ) 2 [ 0 , 4 H 2 ] for any h, it follows that
log E f ¯ h , Z ¯ h | Z h [ e λ ( ( f ¯ h , Z ¯ h ) E f ¯ h , Z ¯ h | Z h [ ( f ¯ h , Z ¯ h ) ] ) ] 2 λ 2 H 4 n .
Thus, we obtain
I Z h ( f h ; Z ˜ h | Z h ) λ E f h , Z ˜ h | Z h [ ( f h , Z ˜ h ) ] E f ¯ h , Z ¯ h | Z h [ ( f ¯ h , Z ¯ h ) ] ) ] 2 λ 2 H 4 n I Z h ( f h ; Z ˜ h | Z h ) λ + 2 λ 2 H 4 n E f h , Z ˜ h | Z h [ ( f h , Z ˜ h ) ] E f ¯ h , Z ¯ h | Z h [ ( f ¯ h , Z ¯ h ) ] ) ] .
By optimizing the above inequality over λ > 0 and λ < 0 , respectively, we derive
H 2 2 I Z h ( f h ; Z ˜ h | Z h ) n E f h , Z ˜ h | Z h [ ( f h , Z ˜ h ) ] E f ¯ h , Z ¯ h | Z h [ ( f ¯ h , Z ¯ h ) ] ) ] H 2 2 I Z h ( f h ; Z ˜ h | Z h ) n ,
and thus,
E f h , Z ˜ h | Z h [ ( f h , Z ˜ h ) ] E f ¯ h , Z ¯ h | Z h [ ( f ¯ h , Z ¯ h ) ] ) ] H 2 2 I Z h ( f h ; Z ˜ h | Z h ) n .
Finally, we conclude that
E Z D [ L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) ] = E Z D 1 H h = 1 H ( f h , Z ˜ h ) E Z D 1 H h = 1 H ( f h , Z ˜ h ) = 1 H h = 1 H E Z h D h ( f h , Z ˜ ) E Z h D h ( f h , Z ˜ ) 1 H h = 1 H E Z h D h E f ¯ h , Z ¯ h | Z h [ ( f ¯ h , Z ¯ h ) ] ) ] E f h , Z ˜ h | Z h [ ( f h , Z ˜ h ) ] ) ] 1 H h = 1 H H 2 E Z h D h 2 I Z h ( f h ; Z ˜ h | Z h ) n ( By   Equation   ( 5 ) ) 1 H h = 1 H H 2 2 E Z h D h [ I Z h ( f h ; Z ˜ h | Z h ) ] n = 2 H 2 h = 1 H I ( f h ; Z ˜ h | Z h ) n = 2 H 2 h = 1 H I ( f h ; U | Z h ) n .
Note that our setting is identical to that in [3], i.e., batch RL with value function approximation for episodic MDPs. They established a bound of the order O ˜ H 2 1 n + h = 1 H R ( F h ) , where R ( F h ) represents the Rademacher complexity of the function space F h . In contrast, our result yields an error bound of the order O H h = 1 H I ( f h ; Z h ) n . As demonstrated in the subsequent section, under structural assumptions like a finite pseudo-dimension or effective dimension d, this bound can be refined to O ˜ H 2 d n .
Next, we proceed to derive the high-probability version of these generalization bounds using α -mutual information.
Theorem 7.
Given a dataset Z D n consists of n H samples, for any batch RL algorithm A with output A ( Z ) = f = ( f 1 , , f H ) F , if
n 2 H 4 ϵ 2 I α ( A ( Z ) ; Z ) + log 2 + α α 1 log 1 δ ,
then, the generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by
L ( A ( Z ) , Z ) L ( A ( Z ) , D ) ϵ
with a probability of at least 1 δ .
Proof. 
Let Z = Z 1 Z H be a dataset where Z h = { ( s , a , r , s , h ) } D h . Let A ( Z ) = f = ( f 1 , , f H ) F be the output of some batch RL algorithm A . Let
L ( f , Z ) = 1 H h = 1 H ( f h , Z h ) = 1 H h = 1 H 1 n ( s , a , r , s , h ) Z h ( f h ( s , a ) r V f h + 1 ( s ) ) 2 .
As ( f , Z ) [ 0 , 4 H 2 ] for every f, it is 2 H 2 -sub-Gaussian. By Theorem 2, we have
| ( f h , Z h ) E Z h D h [ ( f h , Z h ) ] | ϵ
with probability at least 1 δ for
n 8 H 4 ϵ 2 I α ( f h ; Z h ) + log 2 + α α 1 log 1 δ .
As we have n samples at each h [ H ] , we require
n 8 H 4 ϵ 2 max h I α ( f h ; Z h ) + log 2 + α α 1 log 1 δ .
The claim is now followed by the union bound by setting δ = δ / H . □
Recall that conditional mutual information is defined as an expectation over the KL divergence. Thus, all prior works using the CMI framework have only provided bounds on the expected generalization error. We wish to establish generalization bounds with high-probability guarantees similar to Theorem 7.
Theorem 8.
Let U { 0 , 1 } n be uniformly random. Given that dataset Z D 2 n consists of 2 n H samples, for any batch RL algorithm A with output A ( Z U ) = f = ( f 1 , , f H ) F , if
n 8 H 4 ϵ 2 max h I α f h | Z h ( f h ; U | Z h ) + log 2 + α α 1 log H δ .
then, the generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by
L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) ϵ
with probability at least 1 δ .
Proof. 
By substituting Theorem 2 with Theorem 4 in the proof of Theorem 7, the proof is thereby obtained. □

5. Value Functions Under Structural Assumptions

Due to the challenges stemming from large state-action spaces, long horizons, and the temporal nature of data, there is increasing interest in identifying structural assumptions for RL with value function approximation. These works include, but are not limited to, Bellman rank [14], Witness rank [15], and Eluder dimension [16]. These structural conditions aim to develop a unified theory of generalization in RL. In this section, we demonstrate that if a function class satisfies certain structural conditions reflecting a manageable complexity, the mutual information can be effectively upper bounded.
Definition 11
(Covering number). The covering number of a function class F = F 1 × × F H under metric ρ ( f , g ) = max h f h g h , denoted as N ( F , ϵ ) , is the minimum integer n, such that there exists a subset F ϵ F with | F ϵ | = n , and for any f F , there exists g F ϵ , such that ρ ( x , y ) ϵ .
Lemma 2.
For discrete random variables X , Y , and Z, we have I ( X ; Y | Z ) log | X | .
Proof. 
Denote H ( X Z ) the conditional entropy of X given Z.
I ( X ; Y | Z ) = H ( X | Z ) H ( X | Y , Z ) H ( X | Z ) ( H ( X | Y , Z ) 0 ) = E z [ H ( X | Z = z ) ] E z [ log | X | ] = log | X | .
Theorem 9.
Suppose the function class F has a covering number of N ( F , ϵ ) . Let U { 0 , 1 } n be uniformly random. Given that dataset Z consists of 2 n H samples, for any batch RL algorithm A with output A ( Z U ) = f = ( f 1 , , f H ) F , the expected generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by
E Z D [ L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) ] 2 H 3 log ( | N ( F , ϵ ) | ) n + 8 ϵ H + 2 ϵ 2 .
Proof. 
Let Z ˜ h = ( Z h ) U . We first define an oracle algorithm A o capable of outputting a function A o ( Z U ) = f * = ( f 1 * , , f H * ) , such that
ρ ( f , f * ) ϵ .
Note that A o is only used for theoretical analysis. Observe that
L ( A ( Z U ) , Z U ) = 1 H h = 1 H 1 n ( s , a , r , s , h ) Z ˜ h ( f h ( s , a ) r V f h + 1 ( s ) ) 2 = 1 H h = 1 H 1 n ( s , a , r , s , h ) Z ˜ h ( f h ( s , a ) f h * ( s , a ) + f h * ( s , a ) r V f h + 1 ( s ) ) 2 = ϵ 2 + 1 H h = 1 H 1 n ( s , a , r , s , h ) Z ˜ h ( f h * ( s , a ) r V f h + 1 ( s ) ) 2 + 2 ϵ 1 H h = 1 H 1 n ( s , a , r , s , h ) Z ˜ h ( f h * ( s , a ) r V f h + 1 ( s ) ) ϵ 2 + L ( A o ( Z U ) , Z U ) + 4 ϵ H .
Thus,
L ( A ( Z U ) , Z U ) L ( A o ( Z U ) , Z U ) 4 ϵ H + ϵ 2 .
Bounding | L ( A ( Z U ) , D ) L ( A o ( Z U ) , D ) | is similar. Now, we have
L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) = L ( A ( Z U ) , Z U ) L ( A o ( Z U ) , Z U ) + L ( A o ( Z U ) , Z U ) L ( A o ( Z U ) , D ) + L ( A o ( Z U ) , D ) L ( A ( Z U ) , D ) .
As | L ( A ( Z U ) , Z U ) L ( A o ( Z U ) , Z U ) | ϵ and | L ( A ( Z U ) , D ) L ( A o ( Z U ) , D ) | ϵ , we have
L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) L ( A o ( Z U ) , Z U ) L ( A o ( Z U ) , D ) + 8 ϵ H + 2 ϵ 2 .
By Theorem 6,
E Z D [ L ( A o ( Z U ) , Z U ) L ( A o ( Z U ) , D ) ] 2 H 2 h = 1 H I ( f h * ; U | Z h ) n 2 H 2 h = 1 H log ( | F ϵ | ) n ( By   Lemma   2 ) = 2 H 3 log ( | F ϵ | ) n = 2 H 3 log ( | N ( F , ϵ ) | ) n .
Therefore,
E Z D [ L ( A ( Z U ) , Z U ) L ( A ( Z U ) , D ) ] 2 H 3 log ( | N ( F , ϵ ) | ) n + 8 ϵ H + 2 ϵ 2 .
Structural assumptions on the function space typically entail a finite covering number. Next, we consider the simplest case: the pseudo-dimension. The pseudo-dimension is a complexity measure of real-valued function classes, analogous to the VC dimension used for binary classification. Although the value function space may be infinite, it remains learnable if it has a finite pseudo-dimension.
Definition 12
(VC-Dimension [17]). Given hypothesis class H X { 0 , 1 } , its VC-dimension VCdim ( H ) is defined as the maximal cardinality of a set X = { x 1 , , x | X | } X that satisfies | H X | = 2 | X | (or X is shattered by H ), where H X is the restriction of H to X, namely { ( h ( x 1 ) , , h ( x | X | ) ) : h H } .
Definition 13
(Pseudo dimension [18]). Suppose X is a feature space. Given hypothesis class H X R , its pseudo dimension Pdim ( H ) is defined as Pdim ( H ) = VCdim ( H + ) , where H + = { ( x , ξ ) 1 [ h ( x ) > ξ ] : h H } X × R { 0 , 1 } } .
Lemma 3
(Bounding covering number by pseudo dimension [19]). Given hypothesis class H X R with Pdim ( H ) d , we have
log N ( H , ϵ ) O ( d log ( 1 / ϵ ) ) .
Corollary 1.
Suppose the function class F h F has a finite pseudo dimension Pdim ( F h ) = d . For any batch RL algorithm with n training samples, the expected generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by O ˜ ( H 2 d / n ) .
Proof. 
As Pdim ( F h ) = d and F = F 1 × × F H , we have log N ( F , ϵ ) O ( d H log ( 1 / ϵ ) ) . The claim follows from Theorem 9 by setting ϵ = H d n . □
A prior study on finite sample guarantees for minimizing the Bellman error, using pseudo-dimension, demonstrated a sample complexity with a dependence of O ˜ ( d 2 ) [5]. In contrast, our sample complexity exhibits a dependence of O ˜ ( d ) on the pseudo-dimension.
Now, we introduce another complexity measure known as the effective dimension [20], which has a similar covering number to the pseudo-dimension. The effective dimension quantifies how the function class responds to data, indicating the minimum number of samples required to learn effectively.
Definition 14
( ϵ -effective dimension of a set [20]). The ϵ-effective dimension of a set X is the minimum integer d eff ( X , ϵ ) = n , such that
sup x 1 , , x n X 1 n log det I + 1 ϵ 2 i = 1 n x i x i e 1 .
Definition 15
( ϵ -effective dimension of a function class [20]). Given a function class F defined on X , its ϵ-effective dimension d eff ( F , ϵ ) = n is the minimum integer n, such that there exists a separable Hilbert space H and a mapping ϕ : X H , so that
  • for every f F , there exists θ f B H ( 1 ) satisfying f ( x ) = θ f , ϕ ( x ) H for all x X ,
  • d eff ( ϕ ( X ) , ϵ ) = n , where ϕ ( X ) = { ϕ ( x ) : x X } .
Definition 16
(Kernel MDPs [21]). In a kernel MDP of effective dimension d, for each step h [ H ] , there exist feature mappings ϕ h : S × A H and ψ h : S H , where H is a separable Hilbert space, so that the transition measure can be represented as the inner product of features, i.e.,
P h ( s s , a ) = ϕ h ( s , a ) , ψ h ( s ) H .
Besides, the reward function is linear in ϕ, i.e.,
r h ( s , a ) = ϕ h ( s , a ) , θ h r H
for some θ h r H . Here, ϕ is known to the learner while ψ and θ h r are unknown. Moreover, a kernel MDP satisfies the following regularization conditions: for all h
  • θ h r H 1 and ϕ h ( s , a ) H 1 for all s , a .
  • s S V ( s ) ψ h ( s ) H 1 for any function V : S [ 0 , 1 ] .
  • dim eff ( X h , ϵ ) d for all h, where X h = { ϕ h ( s , a ) : ( s , a ) S × A } .
Kernel MDPs are extensions of the traditional MDPs where the transition dynamics and rewards are represented in a Reproducing Kernel Hilbert Space (RKHS). In this setup, the value functions or Q-functions are approximated using kernel methods, allowing the model to capture more complex dependencies in the data compared to linear models. To learn kernel MDPs, it is necessary to construct a function class F .
Lemma 4
(Bounding covering number by effective dimension [21]). Let M be a kernel MDP of effective dimension d, then
log N ( F , ϵ ) O ( H d log ( 1 + d H / ϵ ) ) .
Corollary 2.
Suppose the function class F has a finite effective dimension d. For any batch RL algorithm with n training samples, the expected generalization error for the mean squared empirical Bellman error (MSBE) loss is upper bounded by O ˜ ( H 2 d / n ) .
We showed that when a function class contains infinitely many elements, a finite covering number can be used to upper bound the generalization error. Just as the VC-dimension imposes a finite cardinality, various concepts in real-valued function classes, such as pseudo-dimension and effective dimension, result in a finite covering number, thereby ensuring efficient learning.

6. Discussion

In this paper, we analyzed the generalization property of batch reinforcement learning within the framework of information theory. We established generalization bounds using both conditional and unconditional mutual information. Besides, we demonstrated how to leverage the structure of the function space to guarantee generalization. Due to the merits of the information-theoretic approach, there are several appealing future research directions.
The first interesting avenue is to extend the results to the online setting. It is noteworthy that in on-policy learning, the inputs (e.g., the reward and the next state), are influenced by the output (e.g., the policy or the model), which highlights a significant disparity compared to off-policy and supervised learning. In supervised learning, a small mutual information between the input and the output indicates that the model is not overfitting. In on-policy learning, analyzing the mutual information between the input and the output can be more complicated and insightful. For example, in model-based reinforcement learning, where the model is a part of the output, a small mutual information might indicate that the learned model focuses more on the goal of maximizing the cumulative reward rather than solely capturing the transition dynamics. How to learn an effective model beyond merely fitting the transition is the central theme in decision-aware model-based reinforcement learning [22,23,24,25,26,27,28].
As in the supervised learning setting, where various algorithms such as Stochastic Gradient Descent (SGD) [29] and Stochastic Gradient Langevin Dynamics (SGLD) have been studied [30], a promising future direction is to analyze information-theoretic generalization bounds for specific reinforcement learning algorithms such as stochastic policy gradient methods.
In addition, the information-theoretic approach has the potential to unify various concepts related to generalization, such as differential privacy and stability [12,31]. It would be interesting to explore how these notions in reinforcement learning can be leveraged to guarantee generalization.
Analyzing generalization for reinforcement learning is inherently more challenging than in supervised learning [32,33,34]. Therefore, we hope that the information-theoretic approach will provide more insights into understanding the generalization of reinforcement learning.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No data were created or analyzed in this theoretical study. Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Related Work

Appendix A.1. Batch Reinforcement Learning

A body of literature focuses on finite sample guarantees for batch reinforcement learning with function approximation [35,36,37,38,39,40]. Common assumptions in batch RL, such as concentrability, realizability, and completeness, have also been examined in more recent studies [41,42,43]. The most relevant work to ours [3] investigates the generalization performance of batch RL under the same setting using Rademacher complexities.

Appendix A.2. Structural Conditions for Efficient RL

Analogous to complexity measures in supervised learning, several structural conditions have been studied to enable efficient reinforcement learning, including Bellman rank [14], Witness rank [15], Eluder dimension [16], Bellman Eluder dimension [21], and more [20,37,44]. Identifying structural conditions and classifying RL problems clarifies the limits of what can be learned and guides the design of efficient algorithms.

Appendix A.3. Information-Theoretic Study of Generalization

The information-theoretic approach was initially introduced by [1,2] and subsequently refined to derive tighter bounds [45,46,47]. Besides, various other information-theoretic bounds have been proposed, leveraging concepts such as conditional mutual information [12], f-divergence [11], the Wasserstein distance [48,49], and more [50,51]. Some studies have focused on analyzing specific algorithms [29,30,52,53,54,55] while others have examined particular settings such as deep learning [56], iterative semi-supervised learning [57], transfer learning [58], and meta-learning [59,60]. There are also works attempting to provide a unified framework for generalization from an information-theoretic perspective [31,61,62].

References

  1. Russo, D.; Zou, J. Controlling bias in adaptive data analysis using information theory. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 1232–1240. [Google Scholar]
  2. Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Adv. Neural Inf. Process. Syst. 2017, 30, 2521–2530. [Google Scholar]
  3. Duan, Y.; Jin, C.; Li, Z. Risk bounds and rademacher complexity in batch reinforcement learning. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 2892–2902. [Google Scholar]
  4. Sutton, R.S.; Barto, A.G. Reinforcement learning: An introduction. Robotica 1999, 17, 229–235. [Google Scholar] [CrossRef]
  5. Antos, A.; Szepesvári, C.; Munos, R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 2008, 71, 89–129. [Google Scholar] [CrossRef]
  6. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  7. Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
  8. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  9. Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
  10. Mironov, I. Rényi differential privacy. In Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Santa Barbara, CA, USA, 21–25 August 2017; pp. 263–275. [Google Scholar]
  11. Esposito, A.R.; Gastpar, M.; Issa, I. Generalization error bounds via Rényi-, f-divergences and maximal leakage. IEEE Trans. Inf. Theory 2021, 67, 4986–5004. [Google Scholar] [CrossRef]
  12. Steinke, T.; Zakynthinou, L. Reasoning about generalization via conditional mutual information. In Proceedings of the Conference on Learning Theory, Graz, Austria, 9–12 July 2020; pp. 3437–3452. [Google Scholar]
  13. Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
  14. Jiang, N.; Krishnamurthy, A.; Agarwal, A.; Langford, J.; Schapire, R.E. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1704–1713. [Google Scholar]
  15. Sun, W.; Jiang, N.; Krishnamurthy, A.; Agarwal, A.; Langford, J. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 2898–2933. [Google Scholar]
  16. Wang, R.; Salakhutdinov, R.R.; Yang, L. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Adv. Neural Inf. Process. Syst. 2020, 33, 6123–6135. [Google Scholar]
  17. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  18. Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput. 1992, 100, 78–150. [Google Scholar] [CrossRef]
  19. Haussler, D. Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Comb. Theory Ser. A 1995, 69, 217–232. [Google Scholar] [CrossRef]
  20. Du, S.; Kakade, S.; Lee, J.; Lovett, S.; Mahajan, G.; Sun, W.; Wang, R. Bilinear classes: A structural framework for provable generalization in rl. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 2826–2836. [Google Scholar]
  21. Jin, C.; Liu, Q.; Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Adv. Neural Inf. Process. Syst. 2021, 34, 13406–13418. [Google Scholar]
  22. Wei, R.; Lambert, N.; McDonald, A.; Garcia, A.; Calandra, R. A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning. arXiv 2023, arXiv:2310.06253. [Google Scholar]
  23. Farahmand, A.M.; Barreto, A.; Nikovski, D. Value-aware loss function for model-based reinforcement learning. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1486–1494. [Google Scholar]
  24. Farahmand, A.M. Iterative value-aware model learning. Adv. Neural Inf. Process. Syst. 2018, 31, 9090–9101. [Google Scholar]
  25. Abachi, R. Policy-Aware Model Learning for Policy Gradient Methods; University of Toronto (Canada): Toronto, ON, Canada, 2020. [Google Scholar]
  26. Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to trust your model: Model-based policy optimization. Adv. Neural Inf. Process. Syst. 2018, 32, 12519–12530. [Google Scholar]
  27. Ji, T.; Luo, Y.; Sun, F.; Jing, M.; He, F.; Huang, W. When to update your model: Constrained model-based reinforcement learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23150–23163. [Google Scholar]
  28. Wang, X.; Zheng, R.; Sun, Y.; Jia, R.; Wongkamjan, W.; Xu, H.; Huang, F. Coplanner: Plan to roll out conservatively but to explore optimistically for model-based RL. arXiv 2023, arXiv:2310.07220. [Google Scholar]
  29. Neu, G.; Dziugaite, G.K.; Haghifam, M.; Roy, D.M. Information-theoretic generalization bounds for stochastic gradient descent. In Proceedings of the Conference on Learning Theory, Boulder, CO, USA, 15–19 August 2021; pp. 3526–3545. [Google Scholar]
  30. Negrea, J.; Haghifam, M.; Dziugaite, G.K.; Khisti, A.; Roy, D.M. Information-theoretic generalization bounds for SGLD via data-dependent estimates. Adv. Neural Inf. Process. Syst. 2019, 32, 11015–11025. [Google Scholar]
  31. Haghifam, M.; Dziugaite, G.K.; Moran, S.; Roy, D. Towards a unified information-theoretic framework for generalization. Adv. Neural Inf. Process. Syst. 2021, 34, 26370–26381. [Google Scholar]
  32. Du, S.S.; Kakade, S.M.; Wang, R.; Yang, L.F. Is a good representation sufficient for sample efficient reinforcement learning? arXiv 2019, arXiv:1910.03016. [Google Scholar]
  33. Weisz, G.; Amortila, P.; Szepesvári, C. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. In Proceedings of the Algorithmic Learning Theory, Virtual, 16–19 March 2021; pp. 1237–1264. [Google Scholar]
  34. Wang, Y.; Wang, R.; Kakade, S. An exponential lower bound for linearly realizable mdp with constant suboptimality gap. Adv. Neural Inf. Process. Syst. 2021, 34, 9521–9533. [Google Scholar]
  35. Munos, R.; Szepesvári, C. Finite-Time Bounds for Fitted Value Iteration. J. Mach. Learn. Res. 2008, 9, 815–857. [Google Scholar]
  36. Farahmand, A.; Ghavamzadeh, M.; Mannor, S.; Szepesvári, C. Regularized policy iteration. Adv. Neural Inf. Process. Syst. 2008, 21, 441–448. [Google Scholar]
  37. Zanette, A.; Lazaric, A.; Kochenderfer, M.; Brunskill, E. Learning near optimal policies with low inherent bellman error. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 10978–10989. [Google Scholar]
  38. Lazaric, A.; Ghavamzadeh, M.; Munos, R. Finite-sample analysis of least-squares policy iteration. J. Mach. Learn. Res. 2012, 13, 3041–3074. [Google Scholar]
  39. Farahm, A.M.; Ghavamzadeh, M.; Szepesvári, C.; Mannor, S. Regularized policy iteration with nonparametric function spaces. J. Mach. Learn. Res. 2016, 17, 1–66. [Google Scholar]
  40. Le, H.; Voloshin, C.; Yue, Y. Batch policy learning under constraints. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3703–3712. [Google Scholar]
  41. Chen, J.; Jiang, N. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 1042–1051. [Google Scholar]
  42. Wang, R.; Foster, D.P.; Kakade, S.M. What are the statistical limits of offline RL with linear function approximation? arXiv 2020, arXiv:2010.11895. [Google Scholar]
  43. Xie, T.; Jiang, N. Batch value-function approximation with only realizability. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11404–11413. [Google Scholar]
  44. Foster, D.J.; Kakade, S.M.; Qian, J.; Rakhlin, A. The statistical complexity of interactive decision making. arXiv 2021, arXiv:2112.13487. [Google Scholar]
  45. Asadi, A.; Abbe, E.; Verdú, S. Chaining mutual information and tightening generalization bounds. Adv. Neural Inf. Process. Syst. 2018, 31, 7245–7254. [Google Scholar]
  46. Hafez-Kolahi, H.; Golgooni, Z.; Kasaei, S.; Soleymani, M. Conditioning and processing: Techniques to improve information-theoretic generalization bounds. Adv. Neural Inf. Process. Syst. 2020, 33, 16457–16467. [Google Scholar]
  47. Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening mutual information-based bounds on generalization error. IEEE J. Sel. Areas Inf. Theory 2020, 1, 121–130. [Google Scholar] [CrossRef]
  48. Lopez, A.T.; Jog, V. Generalization error bounds using Wasserstein distances. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar]
  49. Wang, H.; Diaz, M.; Santos Filho, J.C.S.; Calmon, F.P. An information-theoretic view of generalization via Wasserstein distance. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 577–581. [Google Scholar]
  50. Aminian, G.; Toni, L.; Rodrigues, M.R. Information-theoretic bounds on the moments of the generalization error of learning algorithms. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Virtual, 12–20 July 2021; pp. 682–687. [Google Scholar]
  51. Aminian, G.; Masiha, S.; Toni, L.; Rodrigues, M.R. Learning algorithm generalization error bounds via auxiliary distributions. IEEE J. Sel. Areas Inf. Theory 2024, 5, 273–284. [Google Scholar] [CrossRef]
  52. Pensia, A.; Jog, V.; Loh, P.L. Generalization error bounds for noisy, iterative algorithms. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 546–550. [Google Scholar]
  53. Haghifam, M.; Negrea, J.; Khisti, A.; Roy, D.M.; Dziugaite, G.K. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Adv. Neural Inf. Process. Syst. 2020, 33, 9925–9935. [Google Scholar]
  54. Harutyunyan, H.; Raginsky, M.; Ver Steeg, G.; Galstyan, A. Information-theoretic generalization bounds for black-box learning algorithms. Adv. Neural Inf. Process. Syst. 2021, 34, 24670–24682. [Google Scholar]
  55. Wang, H.; Gao, R.; Calmon, F.P. Generalization bounds for noisy iterative algorithms using properties of additive noise channels. J. Mach. Learn. Res. 2023, 24, 1–43. [Google Scholar]
  56. He, H.; Yu, C.L.; Goldfeld, Z. Information-Theoretic Generalization Bounds for Deep Neural Networks. arXiv 2024, arXiv:2404.03176. [Google Scholar]
  57. He, H.; Hanshu, Y.; Tan, V. Information-theoretic generalization bounds for iterative semi-supervised learning. In Proceedings of the The Tenth International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
  58. Wu, X.; Manton, J.H.; Aickelin, U.; Zhu, J. On the generalization for transfer learning: An information-theoretic analysis. IEEE Trans. Inf. Theory 2024, 70, 7089–7124. [Google Scholar] [CrossRef]
  59. Jose, S.T.; Simeone, O. Information-theoretic generalization bounds for meta-learning and applications. Entropy 2021, 23, 126. [Google Scholar] [CrossRef]
  60. Chen, Q.; Shui, C.; Marchand, M. Generalization bounds for meta-learning: An information-theoretic analysis. Adv. Neural Inf. Process. Syst. 2021, 34, 25878–25890. [Google Scholar]
  61. Chu, Y.; Raginsky, M. A unified framework for information-theoretic generalization bounds. Adv. Neural Inf. Process. Syst. 2023, 36, 79260–79278. [Google Scholar]
  62. Alabdulmohsin, I. Towards a unified theory of learning and information. Entropy 2020, 22, 438. [Google Scholar] [CrossRef]
Figure 1. Directed graph representing the training process in Batch RL under episodic MDP.
Figure 1. Directed graph representing the training process in Batch RL under episodic MDP.
Entropy 26 00995 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X. Information-Theoretic Generalization Bounds for Batch Reinforcement Learning. Entropy 2024, 26, 995. https://doi.org/10.3390/e26110995

AMA Style

Liu X. Information-Theoretic Generalization Bounds for Batch Reinforcement Learning. Entropy. 2024; 26(11):995. https://doi.org/10.3390/e26110995

Chicago/Turabian Style

Liu, Xingtu. 2024. "Information-Theoretic Generalization Bounds for Batch Reinforcement Learning" Entropy 26, no. 11: 995. https://doi.org/10.3390/e26110995

APA Style

Liu, X. (2024). Information-Theoretic Generalization Bounds for Batch Reinforcement Learning. Entropy, 26(11), 995. https://doi.org/10.3390/e26110995

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop