Information-Theoretic Generalization Bounds for Meta-Learning and Applications

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.


Introduction
As formalized by the "no free lunch theorem", any effective learning procedure must be based on prior assumptions on the task of interest [1]. These include the selection of a model class and of the hyperparameters of a learning algorithm, such as weight initialization and learning rate. In conventional single-task learning, these assumptions, collectively known as inductive bias, are fixed a priori relying on domain knowledge or validation [1][2][3]. Fixing a suitable inductive bias can significantly reduce the sample complexity of the learning process, and is thus crucial to any learning procedure. The goal of meta-learning is to automatically infer the inductive bias, thereby learning to learn from past experiences via the observation of a number of related tasks, so as to speed up learning a new and unseen task [4][5][6][7][8].
In this work, we consider the meta-learning problem of inferring the hyperparameters of a learning algorithm. The learning algorithm (henceforth, called base-learning algorithm or base-learner) is defined as a stochastic mapping P W|Z m ,u from the input training set Z m = (Z 1 , . . . , Z m ) of m samples to a model parameter W ∈ W for a fixed hyperparameter vector u. The meta-learning algorithm (or meta-learner) infers the hyperparameter vector u, which defines the inductive bias, by observing a finite number of related tasks.
For example, consider the well-studied algorithm of biased regularization for supervised learning [9,10]. Let us denote each data point Z = (X, Y) as a tuple of input features X ∈ R d and label Y ∈ R. The loss function l : W × Z → R is given as the quadratic measure l(w, z) = ( w, x − y) 2 that quantifies the loss accrued by the inferred model parameter w on a data sample z. Corresponding to each per-task data set Z m , the biased regularization algorithm P W|Z m ,u is a Kronecker delta function centered at the minimizer of the following optimization problem which corresponds to an empirical risk minimization problem with a biased regularizer.
Here, λ > 0 is a regularization constant that weighs the deviation of the model parameter w from a bias vector u. The bias vector u can be then thought of as a common "mean" among related tasks. In the context of meta-learning, the objective then is to infer the bias vector u by observing data sets from a number of similar related tasks. Different meta-learning algorithms have been developed for this problem [11,12].
In the meta-learning problem under study, we follow the standard setting of Baxter [13] and assume that the learning tasks belong to a task environment, which is defined by a probability distribution P T on the space of learning tasks T , and per-task data distributions {P Z|T=τ } τ∈T . The data set Z m for a task τ is then generated i.i.d. according to the distribution P Z|T=τ . The meta-learner observes the performance of the base-learner on the meta-training data from a finite number of meta-training tasks, which are sampled independently from the task environment, and infers the hyperparameter U such that it can learn a new task, drawn from the same task environment, from fewer data samples.
The quality of the inferred hyperparameter U is measured by the meta-generalization loss, L g (U), which is the average loss incurred on the data set Z m ∼ P Z m |T of a new, previously unseen task T sampled from the task distribution P T . The notation will be formally introduced in Section 2.2. While the goal of meta-learning is to infer a hyperparameter U that minimizes the meta-generalization loss L g (U), this is not computable, since the underlying task and data distributions are unknown. Instead, the meta-learner can evaluate an empirical estimate of the loss, L t (U|Z m 1:N ), using the meta-training set Z m 1:N of data from N tasks, which is referred to as meta-training loss. The difference between the meta-generalization loss and the meta-training loss is the meta-generalization gap, ∆L(U|Z m 1:N ) = L g (U) − L t (U|Z m 1:N ), (2) and measures how well the inferred hyperparameter U generalizes to a new, previously unseen task. In particular, if the meta-generalization gap is small, on average or with high probability, then the performance of the meta-learner on the meta-training set can be taken as a reliable estimate of the meta-generalization loss.
In this paper, we study information-theoretic upper bounds on the average metageneralization gap E P Z m . Specifically, we extend the recent line of work initiated by Russo and Zhou [14], and Xu and Raginsky [15], which obtain mutual information (MI)-based bounds on the average generalization gap for conventional learning, to meta-learning. To the best of our knowledge, this is the first work that studies information-theoretic bounds for meta-learning.
The bounds on average meta-generalization gap, studied in this work, are distinct from the other well-known bounds on meta-generalization gap in literature. Broadly speaking, existing bounds on the meta-generalization gap can be grouped into two-high probability, probably approximately correct (PAC) bounds, and high probability PAC-Bayesian bounds. These upper bounds take the general form, E P U|Z m 1:N [∆L(U|Z m 1:N )] ≤ , that hold with probability at least 1 − δ, for δ ∈ (0, 1), over the meta-training set Z m 1:N . In contrast, our work focuses on bounding E P Z m 1:N E P U|Z m 1:N [∆L(U|Z m 1:N )] on average also over the meta-training set. Notable PAC bounds on meta-generalization gap include the bound of Baxter [13] obtained using the framework of Vapnik-Chervonenkis (VC) dimensions; and of Maurer [16], which employs the algorithmic stability [17,18] properties. In contrast, the PAC-Bayesian bounds also incorporate prior beliefs on the base-learner and the meta-learner posteriors via an auxiliary data-independent prior distribution Q W|U and a hyper-prior distribution Q U , respectively. Most notably, PAC-Bayesian bounds include that of Pentina and Lambert [19], the tighter bound of Amit and Meir [20], and most recently, the bounds of Rothfuss et al. [21]. While the high-probability bounds are agnostic to task and data distributions, our information-theoretic bounds depend explicitly on the task and per-task data distributions, on the loss function, and on the meta-training algorithm, in accordance to prior work on information-theoretic generalization bounds.
Another general property inherited from the information-theoretic approach adopted in this paper is that the bounds on the average meta-generalization gap under study are designed to hold for arbitrary base-learners and meta-learners. As such, they generally do not result in tighter bounds as compared to non-information theoretic generalization guarantees obtained for specific meta-learning problems, such as the ridge regression problem with meta-learned bias vector mentioned above [22]. In contrast, the general purpose of the bounds in this paper is to provide insights into the number of tasks, and the number of samples per task required to ensure that the training-based metrics are a good approximation to their population counterparts.

Main Contributions
The derivation of bounds on average meta-generalization gap differs from conventional learning owing to two levels of uncertainties-environment-level uncertainty and within-task uncertainty. While within-task uncertainty results from observing a finite number m of data samples per task as in conventional learning, environment-level uncertainty results from observing a finite number N of tasks from the task-environment. The relative importance of these two forms of uncertainty depend on the use made by the meta-learner of the meta-training data. In fact, depending on how the meta-training data are used by the meta-learner, we identify two main classes of meta-training algorithmswith separate within-task training and test sets, and joint within-task training and test sets. The former class includes the state-of-the-art meta-learning algorithms, such as model agnostic meta-learning (MAML) [23], that splits the training data corresponding to each task into training and test sets, with the latter reserved for within-task validation. In contrast, the second class of algorithms, such as reptile [24], use the entire per-task data both for training and testing. Our main contributions are as follows.

•
In Theorem 1, we show that, for the case with separate within-task training and test sets, the average meta-generalization gap contains only the contribution of environment-level uncertainty. This is captured by a ratio of the mutual information (MI) between the output of the meta-learner U and the meta-training set Z m 1:N , and the number of tasks N, as where σ 2 is the sub-Gaussianity variance factor of the meta-loss function. This is a direct parallel of the MI-based bounds for single-task learning [25].
• In Theorem 3, we then show that, for the case with joint within-task training and test sets, the bound on the average meta-generalization gap also contains a contribution due to the within-task uncertainty via the ratio of the MI between the output of the base-learner and within-task training data and the per-task data sample size m. Specifically, we have the following bound where δ 2 T is the sub-Gaussianity variance factor of the loss function l(w, z) for task T.
• In Theorems 2 and 4, we extend the individual sample MI (ISMI) bound of [26] to obtain novel individual task MI (ITMI)-based bounds on the meta-generalization gap for both separate and within-task training and test sets as and These bounds can be seen to be tighter than the MI-based bounds in (3) and (4), respectively.
• Finally, we study the applications of the derived bounds to two meta-learning problems. The first is a parameter estimation setup that involves one-shot meta-learning and baselearning procedures, for which a closed form expression for meta-generalization gap can be derived. The second application covers a broad range of noisy iterative metalearning algorithms and is inspired by the work of Pensia et al. [27] for conventional learning.

Related Work
For conventional learning, there exists a rich literature on diverse frameworks for deriving upper bounds on the generalization gap, i.e., on the difference between generalization and training losses. Classical bounds from statistical learning theory quantify the generalization gap in terms of measures of complexity of the model class, most notably VC dimension [28] and Radmacher complexity [29]. This approach obtains high-probability, probably approximate correct (PAC) bounds on the generalization gap with respect to the training set. An alternate line of high-probability bounding techniques relies on the notion of algorithmic stability, which measures the sensitivity of the output of a learning algorithm to the replacement of individual samples from the training data set. The pioneering work [30] has been extended to include various notions of algorithmic stability [31][32][33]. As a notable example, a distributional notion of stability in terms of differential privacy, which quantifies the sensitvity of the distribution of algorithm's output to data set, has been studied in [34,35]. The high-probability PAC-Bayesian bounds rely on change of measure arguments and uses the Kullback-Leibler (KL) divergence between the algorithm and a data-independent prior to quantifying the algorithmic sensitivity [36][37][38].
Following the initial work of Russo and Zou [14], information-theoretic bounds on the average generalization gap for conventional learning have been widely investigated in recent years. Xu and Raginsky [25] showed that the MI between the output of the learning algorithm and its training data set yields an upper bound in expectation on the generalization gap. The bound has been shown to offer computable generalization gaurentees for noisy iterative algorithms, including stochastic gradient Langevin dynamics (SGLD) in [27]. Various refinements of the MI-based bound have since been analyzed to obtain tighter bounds. In particular, the bounds in [39] employ chaining mutual information techniques to tighten the bounds in [25], while the bound in [26] depends on the MI between the output of the algorithm and an individual data sample. The MI between the output of the algorithm and a random subset of the data set appears in the bounds introduced in [40]. The total variation information between the joint distribution of the training data and algorithmic output and the product of marginals was shown in [41] to yield a bound on the generalization gap for any bounded loss function. Subsequent works in [42][43][44] consider other information-theoretic measures, such as maximum leakage and lautum information. Most recently, a conditional mutual information (CMI)-based approach has been proposed in [45] to develop generalization bounds.

Notation
Throughout this paper, upper case letters, e.g., X, denote random variables and lower case letters, e.g., x, their realizations. We use P (·) to denote the set of all probability distributions on the argument set or vector space. For a discrete or continuous random variable X taking values in a set or vector space X , P X ∈ P (X ) denotes its probability distribution, with P X (x) being the probability mass or density value at x ∈ X . We denote as P X n the n-fold product distribution induced by P X . The conditional distribution of a random variable X given random variable Y is similarly defined as P X|Y , with P X|Y (x|y) representing the probability mass or density at X = x conditioned on the event Y = y. We use || · || 2 to denote the Euclidean norm of the argument vector, and I d to denote a d-dimensional identity matrix. We define the Kronecker delta δ(x − x 0 ) = 1 if x = x 0 and δ(x − x 0 ) = 0 otherwise.

Problem Definition
In this section, we define the problem of interest by introducing the key definitions of generalization gap for conventional, or single-task, learning and for meta-learning.

Generalization Gap for Single-Task Learning
Consider first the conventional problem of learning a task τ ∈ T . As illustrated in Figure 1, each task τ ∈ T is associated with an underlying unknown data distribution, P Z|T=τ ∈ P (Z ), defined in a subset or vector space Z. Henceforth, we use P Z|τ to denote P Z|T=τ for notational convenience. The training procedure, which is referred to as the base-learner, has access to a training data set Z m = (Z 1 , Z 2 , . . . , Z m ) ∼ P Z m |τ of m independent and identically distributed (i.i.d.) samples drawn from distribution P Z|τ . The base-learner uses this data set to choose a model, or hypothesis, W from the model class W by using a randomized training procedure defined by a conditional distribution P W|Z m ,u as The conditional distribution P W|Z m ,u defines a stochastic mapping from the training data set Z m to the model class W. The training procedure (7) is parameterized by a vector u ∈ U of hyperparameters, which defines the inductive bias. As an example, the base-learner P W|Z m ,u may follow stochastic gradient descent (SGD) updates with hyperparameters u, including the learning rate and the initialization point.
The performance of a parameter vector w ∈ W on a data sample z ∈ Z is measured by a loss function l : W × Z → R + . The generalization loss for a model parameter vector w ∈ W is the average over a test example Z independently drawn from the data distribution P Z|τ . The subscript g is used to distinguish the generalization loss from the training loss defined below. The generalization loss cannot be computed by the learner, given that the data distribution P Z|τ is unknown. Instead, the learner can evaluate the training loss on the data set Z m , which is defined as the empirical average The subscript t specifies that the loss is the empirical training loss. The difference between generalization loss (8) and training loss (9) is known as generalization gap, and is a key metric that quantifies the level of uncertainty (This type of uncertainty is known as epistemic.) at the learner regarding the data distribution P Z|τ . The average generalization gap for the data distribution P Z|τ and base-learner P W|Z m ,u is defined as where the expectation is taken with respect to the joint distribution P Z m ,W|τ,u = P Z m |τ P W|Z m ,u . A summary of the variables involved in the Definition of the generalization gap (11) can be found in Figure 1. Intuitively, if the generalization gap is small, on average or with high probability, then the base-learner can take the performance (9) on the training set Z m as a reliable measure of the generalization loss (8) of the trained model W. Furthermore, data-dependent bounds on the generalization gap can be used as regularization terms to avoid overfitting, yielding generalized Bayesian inference problems [46,47].

Generalization Gap for Meta-Learning
As discussed, in single-task learning, the inductive bias u, defining the hyperparameters of the training procedure, must be selected a priori, i.e., without having access to task-specific data. The inductive bias determines the training data set size m needed to ensure a small generalization loss (8), since, generally speaking, richer models require more data to be trained [1]. The sample complexity can be generally reduced if one selects a suitable inductive bias based on prior information. Such prior information is typically obtained from domain knowledge on the problem under study. In contrast, meta-learning aims at automatically inferring an effective inductive bias based on data from related tasks.
To elaborate, we follow the setting of [13], in which a meta-learner observes data from a number of tasks, known as meta-training tasks, from the same task environment. A task environment is defined by a task distribution P T ∈ P (T ), supported on the space T of tasks, and by a per-task data distribution P Z|τ for each task τ ∈ T . Using the metatraining data drawn from a randomly selected subset of tasks, the meta-learner infers a hyperparameter vector u ∈ U defining the inductive bias. This is done with the goal of ensuring that, using hyperparameter u, the base-learner P W|Z m ,u can efficiently learn on a new task, referred to as meta-test task, drawn independently from the same task distribution P T .
To elaborate, the meta-training data consist of N data sets Z m 1:N = (Z m 1 , . . . , Z m N ). Each ith data set is generated independently by first drawing a task T i ∼ P T from the task environment and then a task-specific training data set Z m i ∼ P Z m |T i . The meta-learner uses the meta-training data set Z m 1:N to infer a hyperparameter vector u ∈ U . To this end, we consider a randomized meta-learner where P U|Z m 1:N is a stochastic mapping from the meta-training set Z m 1:N to the space U of hyperparameters. We distinguish two different formulations of meta-learning that are often considered in the literature. In the first, the per-task data set Z m is split into training, or support, and test, or query subsets [23,48]; while, in the second, the entire data set Z m is used for both within-task training and testing [13,19,20].

Separate Within-Task Training and Test Sets
As seen in Figure 2, in this first approach to meta-learning, each meta-training sub data set Z m i is split into a training set and a test set as The within-task base-learner P W|Z m tr i ,u ∈ P (W ) maps the per-task training subset Z m tr i to random model parameter W i ∼ P W|Z m tr i ,u for a given hyperparameter U = u. The test subset is used to evaluate the empirical training loss of a model w for task T i as where Z m te i,j denote the jth example of the test subset Z m te i . Furthermore, the overall empirical meta-training loss for a hyperparameter u is computed by summing up all meta-training tasks as where is the average per-task training loss over the base-learner. We emphasize that the meta-training loss (14) can be computed by the meta-learner and used as a criterion to select the meta-learning procedure (12), since it is obtained from the meta-training data Z m 1:N . We also note that the rationale of splitting training and test sets is that the average training loss L sep t (u|Z m i ) is an unbiased estimate of the corresponding average generalization loss E P W|Z m tr The true goal of the meta-learner is to minimize the meta-generalization loss, where P T,Z m tr = P T P Z m tr |T and L g (W|T) are as defined in (8). Unlike the meta-training loss (14), the meta-generalization loss is evaluated on a new, meta-test task T and on the corresponding training data Z m tr . We distinguish the meta-generalization loss and metatraining loss by the subscripts g and t, respectively in (16) and (14). The difference between the meta-generalization loss (16) and the meta-training loss (14), known as the metageneralization gap, is defined as The quantity of interest to us is the average meta-generalization gap, defined as where the expectation is with respect to the joint distribution P Z m 1:N ,U = P Z m 1:N P U|Z m 1:N , of the meta-training set Z m 1:N and of the hyperparameter U. Note that P Z m 1:N is the marginal of the joint distribution ∏ N i=1 P T=T i P Z M |T=T i . Intuitively, if the meta-generalization gap is small, on average or with high probability, the meta learner can take the performance (14) on the meta-training data as a reliable measure of the accuracy of the inferred hyperparameter vector in terms of the metageneralization loss (16). Furthermore, data-dependant bounds on the meta-generalization gap can be used as regularization terms to avoid meta-overfitting. Meta-overfitting occurs when the meta-trained hyperparameter yields a small meta-training loss but a large metatest loss, due to an excessive dependence on the meta-training set [13].

Joint Within-Task Training and Test Sets
In the second formulation of meta-learning, as illustrated in Figure 3, the entire data set Z m i is used for within-task training and testing. Accordingly, the meta-learner computes the meta-training loss where is the average per-task training loss. Note here that in evaluating the meta-training loss in (19), the data set Z m i is used to infer model parameters W and to evaluate the pertask training loss. The expectation in (20) is taken over the output of the base-learner W given the hyperparameter vector u. As discussed, the meta-generalization loss for hyperparameter u ∈ U is computed by randomly selecting a novel task T ∼ P T as where P T,Z m = P T P Z m |T and L g (W|T) is as defined in (8). In a manner similar to (17), the meta-generalization gap for a task distribution P T , data distribution P Z m |T , meta-learning algorithm P U|Z m 1:N , and base-learner P W|Z m ,U is defined as The average meta-generalization gap is then given as E P Z m where the expectation is taken over all meta-training sets and over the output of the meta-learner.

Information-Theoretic Generalization Bounds for Single-Task Learning
In this section, we review two information-theoretic bounds on the generalization gap (11) for conventional learning derived in [25,26]. The material covered in this section provides the necessary background for the analysis of the meta-generalization gap to be studied in the rest of the paper. Throughout this section, we fix a task τ ∈ T . Since the generalization and meta-generalization gaps measure the deviation of empirical-mean random variables representing training and meta-training losses from reference values, we will make use of tools and definitions from large-deviation theory (see, e.g., [49]). We discuss the key essential definitions below.

Preliminaries
To start, the cumulant generating function (CGF) of a random variable As a special case, if X is bounded in the interval [a, b], i.e., if the inequality 0 < a ≤ X ≤ b < ∞ holds for some constants a and b, then X is (b − a) 2 /4-sub-Gaussian.

Mutual Information (MI) Bound
We first present the mutual information (MI)-based upper bound obtained in [25]. Key to this result is the following Assumption. Assumption 1. The loss function l(w, Z) is δ 2 τ -sub-Gaussian under Z ∼ P Z|τ for all model parameters w ∈ W.
In particular, if the loss function is bounded, i.e., if the inequalities −∞ < a ≤ l(w, z) ≤ b < ∞ hold for all for w ∈ W and z ∈ Z, Assumption 1 is satisfied with δ 2 τ = (b − a) 2 /4. The main result is as follows.
Lemma 1 ([25]). Under Assumption 1, the following bound on the generalization gap holds for any base-learner W ∼ P W|Z m ,u The proof of Lemma 1 is based on a decoupling estimate Lemma, which is reported for completeness in Lemma A1. We also note that the result in Lemma 1 can be extended to account for loss function l(w, Z) with bounded CGF [14].
The bound (24) on the generalization gap is in terms of the mutual information I(W; Z m ), which quantifies the overall dependence between the base-learner output W and the input training data set Z m . The mutual information in (24) is hence a measure of the sensitivity of the base-learner output to the data set. Using the terminology in [25], if I(W; Z m ) ≤ , the base-learner P W|Z m ,u is said to be ( , P Z|τ )-MI stable, in which case the bound in (24) evaluates to 2δ 2 τ /m. The relationship between generalization and stability of a training algorithm is well-established [1], and the result (24) amounts to a formulation of this link in information-theoretic terms.
The traditional notion of algorithmic stability measures how much the base-learner output changes with the replacement of an individual training sample [30,50]. In the next section, we review the bound in [26] that translates this per-sample stability concept within an information-theoretic framework.

Individual Sample MI (ISMI) Bound
The MI-based bound in Lemma 1 has the disadvantage of being vacuous, i.e., I(W; Z m ) = ∞, for deterministic base-learning algorithms P W|Z m ,u defined on continuous parameter space W. An individual sample MI (ISMI)-based bound that address this shortcoming was introduced in [26]. The ISMI bound borrows the standard algorithmic stability notion of sensitivity of the base-learner output to the replacement of any individual training sample [17,18]. Accordingly, the resulting bound is in terms of the MI between the trained parameter W and each data point Z i of the training data set Z m . The bound, summarized in Lemma 2, applies under the following assumption. Assumption 2. The loss function l(w, z) satisfies either of the following two conditions: We note that, in general, Assumption 1 does not imply Assumption 2(b) (see ([40], Appendix C)), and vice versa (see [26]). There are, however, loss functions l(w, z) and relevant distributions for which both the assumptions hold, including the case of loss functions l(·, ·) which takes values in a bounded interval [a, b].

Lemma 2 ([26]
). Under Assumption 2, the following bound on the average generalization gap holds for any base-learner P W|Z m ,u For a loss function satisfying Assumption 1, the ISMI bound (25) is tighter than (24), i.e., The inequality in (26) follows from the chain rule of mutual information and Jensen's inequality [26].

Information-Theoretic Generalization Bounds for Meta-Learning
In this section, we first derive novel MI-based bounds on the meta-generalization gap with separate within-task training and test sets, as introduced in Section 4.1, and then we consider joint within-task training and test sets, as described in Section 4.2.

Bounds on Meta-Generalization Gap with Separate Within-Task Training and Test Sets
In this section, we present two novel MI-based bounds on the meta-generalization gap (18) for the setup with separate within-task training and testing sets. The first is an MI-based bound, which is akin to Lemma 1, and the second is an individual task MI (ITMI) bound, which resembles Lemma 2 for conventional learning.

MI-Based Bound
In order to derive the MI-based bound, we make the following assumption on L sep t (u|Z m ) in (15). Throughout, we use P Z m to denote the marginal of the joint distribution P T,Z m = P T P Z m |T .
Distinct from the assumptions in Section 3 on loss function l(w, z), we note that Assumption 3 is on the average per-task training loss L sep t (u|Z m ). This is because the loss function l(w, z) satisfying Assumption 1 do not in general guarantee the sub-Gaussianity of L sep t (u|Z m ) with respect to Z m ∼ P Z m . However, if the loss function is bounded, Assumption 3 can be easily verified to hold, as given in the following lemma.
Under Assumption 3, the following theorem presents an upper bound on the metageneralization gap (18).
Proof. See Appendix B.
The technical lemmas required for the proof of Theorem 1 and the theorems that follow are included in Appendix A.
In order to prove Theorem 1, one needs to overcome an additional challenge as compared to the derivation of bounds for learning reviewed in Section 3. In fact, the metageneralization gap is caused by two distinct sources of uncertainty: (a) environment-level uncertainty due to a finite number N of observed tasks, and (b) within-task uncertainty resulting from the finite number m of per-task data samples. Our proof approach involves applying the single-task MI-based bound in Lemma 1 to bound the effect of both sources of uncertainties.
Towards this, we start by introducing the average training loss for the randomly selected meta-test task as The subscript g, t denotes that the loss is generalization (g) with expectation over P T,Z m at the environment level, and training (t) at the task level with L sep t (u|Z m ). Note that this differs from the meta-test loss L sep g (u) in (16) in that the per-task loss is evaluated in (28) on the training set. With this definition, the meta-generalization gap can be decomposed as In (29), the second difference L  28)). This is no longer true for joint within-task training and test sets, as we discuss in Section 4.2.
The decomposition approach adopted here follows the main steps of the bounding techniques introduced in ([16], Equation (6)). In contrast, the PAC-Bayesian bounds in [20,21] rely on a different decomposition of the meta-generalization gap. The environment and within-task generalization gaps are then separately bounded in high probability, and are combined via union bound to obtain the required PAC-Bayesian bounds.
The bound (27) relates the meta-generalization gap to the information-theoretic stability of the meta-training procedure. As first introduced here, this stability is measured by the MI I(U; Z m 1:N ) between the hyperparameter U and the meta-training data set Z m 1:N , in a manner similar to the MI-based bounds in Lemma 1 for conventional learning. Importantly, as we will discuss in Section 4.2, this direct parallel between learning and meta-learning no longer applies with joint within-task training and test data sets.

ITMI Bound
We now present the ITMI bound, which holds under the following assumption.
where the MI I(U; Z m i ) is computed with respect to the joint distribution P Z m i ,U obtained by marginalizing the probability distribution P Z m 1:N ,U .
Proof. See Appendix B.
As can be seen from (30), the ITMI bound on the meta-generalization gap is in terms of the MI I(U; Z m i ) between the output U of the meta learner and each per-task data set Z m i . This, in turn, quantifies the sensitivity of the meta learner output to the replacement of a single per-task data set. Moreover, under Assumption 3, the ITMI bound (30) yields a tighter bound than the MI-based bound (27). This can be seen from the following sequence of relations where Z m (i−1) = (Z m 1 , . . . , Z m i−1 ); (a) follows, since Z m i is independent of Z m (i−1) ; and (b) follows from Jensen's inequality.

Bounds on Generalization Gap with Joint Within-Task Training and Test Sets
We now derive MI and ITMI-based bounds on the meta-generalization gap in (22) for the case with joint within-task training and test sets. As we will see, the key difference with respect to the case with separate within-task training and test sets is that the uncertainty due to finite number of per-task samples, measured by the second term in the decomposition (29), contributes in a non-negligible way to the meta-generalization gap. Since there is no split into separate within-task training and test sets, the average training loss with respect to the learning algorithm is given by L joint t (u|Z m ) in (20).

MI-Based Bound
In order to derive the MI-based bound, we make the following assumptions.

Assumption 5.
We consider the following assumptions. An easily verifiable sufficient condition for the above assumption to hold is the boundedness of loss function l(w, z), which follows in a manner similar to Lemma 3.
where the MI I(W; Z m |T = τ) is evaluated with respect to the distribution P Z m ,W|T=τ obtained by marginalizing the joint distribution P W|Z m ,U P Z m 1:N ,U P Z m |T=τ .

Proof. See Appendix C.
With joint within-task training and test sets, the bound (32) on the meta-generalization gap contains the contributions of two mutual informations. The first, I(U; Z m 1:N ), quantifies the sensitivity of the meta learner output U to the meta-training data set Z m 1:N . This term also appeared in the bound (27) with separate within-task training and test sets. Decomposing the meta-generalization gap in a manner analogous to (29), it corresponds to a bound on the average of the second difference. The second contribution, I(W; Z m |T = τ), quantifies the sensitivity of the output of the base-learner P W|Z m ,U to the data set Z m of the meta-test task T, when the hyperparameter is randomly selected by the meta-learner P U|Z m 1:N using the meta-training set Z m 1:N . This second term is in line with the single-task generalization gap bounds (24), and it bounds the corresponding first difference in the decomposition (29).
We finally note that the dependence of the bound in (32) on the number of tasks N and per-task samples m is of the order 1/ √ N + 1/ √ m. Meta-generalization bounds with similar dependence have been derived in [20] using PAC-Bayesian arguments. The bounds on excess risk for representation learning also follow a similar order of dependence on N and m (c.f [51], [Thm. 2]).

ITMI Bound on (22)
For deriving the ITMI bound on the meta-generalization gap (22), we assume the following. Assumption 6. Either of the following assumptions hold: (a) Assumption 5 holds, or (b) For each task τ ∈ T , the loss function l(W, Z) is δ 2 τ -sub-Gaussian when (W, Z) ∼ P W|τ P Z|τ , where P W|τ is the marginal of the joint distribution P W|Z m ,U P Z m 1:N ,U P Z m |τ . The average per-task training loss L joint t (U|Z m ) is σ 2 -sub-Gaussian when (U, Z m ) ∼ P U P Z m .
As in Section 4.1.2, Assumption 6 can be seen to be implied by the sufficient conditions in Lemma 3.
where the MI I(U; Z m i ) is evaluated with respect to P Z m i ,U obtained by marginalizing P Z m 1:N ,U , and the MI I(W; Z j |T = τ) is with respect to P Z j ,W|T=τ obtained by marginalizing P Z m ,W|T=τ .

Proof. See Appendix C.
Similar to the bound in (32), the bounds on meta-generalization gap in (33) are in terms of two types of mutual information, the first describing the sensitivity of the metalearner and the second the sensitivity of the base-learner. Specifically, the MI I(U; Z m i ) quantifies the sensitivity of the output of the meta learner to per-task data set Z m i , and the MI I(W; Z j |T = τ) measures the sensitivity of the output of the base-learner, P W|Z m ,U to each data sample Z i within the training set Z m of the meta-test task T. Moreover, it can be shown, in a manner similar to (31c), that, under Assumption 5, the ITMI bound in (33) is tighter than the MI bound in (32).

Discussion on Bounds
The bounds on the average meta-generalization gap obtained in this section generalize the bounds for conventional single-task learning in Section 3. To see this, consider the task distribution P T = δ(T − τ) to be centered at some task τ ∈ T . Recall that in conventional learning, the hyperparameter u is fixed a priori. As such, the mutual information I(U; Z m 1:N ) (for MI-based bounds) and I(U; Z m i ) (for ITMI-based bounds) vanishes. For the separate within-task training and test sets, this implies that the average generalization gap is zero, which follows since the per-task test loss L t (W|Z m te i ) is an unbiased estimate of per-task generalization loss L g (W|T i ). The MI-and ITMI-based bounds for the joint within-task training and test sets then reduce to and respectively, where I(W; Z m ) is evaluated with respect to the joint distribution P W,Z m |τ,u and I(W; Z j ) with respect to P W,Z j |τ,u . The MI-and ITMI-based bounds derived in this section point that a smaller correlation between hyperparameters and meta-training set and thus small mutual information I(U; Z m 1:N ) improves the meta-generalization gap, although this seems deleterious to performance. To clarify this contradiction, we would like to emphasize that these bounds quantify the difference between meta-generalization loss and empirical training loss, which in turn depends on the sensitivity of the meta-learner and base-learner to their input meta-training set and per-task training set, respectively. The mutual information terms in our bounds capture these sensitivities. Consequently, our bounds suggest that a meta-learner that is highly correlated to the input meta-training set (i.e., when I(U; Z m 1:N ) is large) does not generalize well (i.e., yields large meta-generalization gap). This property aligns with a previous information-theoretic analysis for generalization in conventional learning [25].
To the best of our knowledge, the MI-and ITMI-based bounds studied here are the first bounds on the average meta-generalization gap. As discussed in the introduction, these bounds are distinct from the high-probability PAC and PAC-Bayesian bounds on the meta-generalization gap studied previously on meta-learning. Consequently, the bounds studied in this work are not directly comparable with the existing high-probability bounds.
Finally, we note that similarity between tasks is crucial to meta-learning. If the per-task data distributions P Z|T=τ in the task environment are 'closer' to each other, a meta-learner can efficiently learn the shared characteristics of tasks, and can generalize well to new tasks from the task environment. In our setting, the statistical properties of the task environment (P T , {P Z|T=τ } τ∈T ) dictate this similarity. Although our MI-and ITMI-based bounds do not explicitly capture this, we note that the properties of task environment are implicitly accounted for by the mutual information terms I(U; Z m 1:N ) and I(U; Z m i ), where the meta-training data set Z m 1:N is generated from the task environment, and also by the sub-Gaussianity considerations in Assumptions 3-6. From preliminary studies, we believe that information-theoretic bounds that explicitly capture the impact of task similarity require a different performance metric than the average meta-generalization gap considered here, and is left to future work.

Applications
In this section, we consider two applications of the information-theoretic bounds proposed in Section 4.1. The first, simpler, example concerns a parameter estimation problem for which an optimized meta-learner can be obtained in closed form. In contrast, the second application covers a broad class of iterative meta-training schemes.

Parameter Estimation
To illustrate the bounds on the meta-generalization gap derived in Section 4.1, we first consider the problem of prediction for a Bernoulli process with a 'soft' predictor that uses only a few samples from the process, as well as meta-training data. Towards this, we consider an arbitrary discrete finite set of tasks T = {τ 1 , . . . , τ M }. The data distribution P Z|T=τ k for each task τ k ∈ T , k ∈ {1, . . . , M}, is given as Bernoulli(µ τ k ) with mean parameter µ τ k . The task distribution P T is then defined over the finite set of mean parameters {µ τ 1 , . . . , µ τ M }. The base-learner uses training data, distributed i.i.d. from Bernoulli(µ τ k ) to determine the parameter W k , which is used as a predictor of new observation Z ∼ Bernoulli(µ τ k ) at test time. The loss function is defined as l(w, z) = (w − z) 2 , measuring the quadratic error between prediction and realized test input z. Note that the optimal (Bayes) predictor, computable in the ideal case of known distribution P Z|T=τ k , is given as W = µ τ k . We now distinguish the two cases with separate and joint within-task training and test sets.

Separate Within-Task Training and Test Sets
The base-learner P W|Z m tr k ,u for task τ k ∈ T , deterministically selects the prediction where D m tr k = 1 m tr ∑ m tr j=1 Z m tr k,j is an empirical average over the training set Z m tr k,j , u is a hyperparameter defining a bias that can be meta-trained, and α ∈ [0, 1) is a fixed scalar. Here, Z m tr k,j denote the jth data sample in the training set Z m tr k of task τ k . The bias term in (36) may help approximate the ideal Bayes predictor in the presence of limited data Z m tr k . The objective of the meta-learner is to infer the hyperparameter u. For a given metatraining data set Z m 1:N , comprising of data sets from N tasks sampled according to P T , the meta-learner can compute the empirical meta-training loss as where Z m te k,j denote the jth example in the test set of Z m k , the kth sub-data set of Z m 1:N . The meta-learner P U|Z m 1:N then deterministically selects the minimizing hyperparameter u of the meta-training empirical loss function in (37). This optimization yields where D m te k = ∑ m te j=1 Z m te k,j /m te . Note that D m te k and D m tr k are binomial random variables and by (38), U takes finitely many discrete values and is bounded as The meta-test loss can be explicitly computed as whereμ T = 1 − µ T , and the average meta-generalization gap evaluates to where To compute the MI-and ITMI-based bounds on the meta-generalization gap (40), it is easy to verify that the average training loss L sep t (·|Z m ) is bounded, i.e., 0 ≤ L sep t (·|Z m ) ≤ (1 + α) 2 for all u ∈ U and Z m ∈ Z m . Thus, Assumption 3 for the MI bound and also Assumption 4 for the ITMI bound hold with σ 2 = (1 + α) 4 /4. For the MI bound, we note that, since the meta-learner is deterministic, we have that I(U; Z m 1:N ) = H(U). The ITMI bound (30) is given as The information-theoretic measures in (41) can be evaluated numerically as discussed in Appendix D.
For a numerical illustration, Figure 4 plots the average of the meta-generalization loss (39) and average meta-training loss (A16) along with the ITMI bound in (41) and MI bound in (27). It can be seen that the ITMI bound is tighter than MI bound and correctly predicts the decrease in the meta-generalization gap as the number N of tasks increases.

Joint Within-Task Training and Testing sets
We now consider the case with joint within-task training and test sets. The baselearner P W|Z m k ,U for task τ k ∈ T still uses the predictor (36), but now the empirical average over the training set is given as D k = ∑ m j=1 Z m k,j /m. As before, the meta-learner P U|Z m 1:N deterministically selects the minimizing hyperparameter u of the meta-training empirical loss function, L Z m 1: As discussed in Appendix D, the meta-generalization loss for this example can also be explicitly computed and the meta-generalization gap bounds in (32) and (33) can be evaluated numerically. Figure 5 plots the average meta-generalization loss and average meta-training loss along with the MI bound in (32) and ITMI bound in (A18), as a function of per-task data samples m. The ITMI bound is seen to better reflect the decrease of the meta-training loss as a function of m. Figure 5. Comparison of the MI-and ITMI-based bound obtained in (A18) with the metageneralization gap for meta-learning with joint within-task training and test sets, as a function of the per-task data samples m for N = 5 and α = 0.55. The task environment is defined by M = 9 tasks.

Noisy Iterative Meta-Learning Algorithms
Most meta-learning algorithms are built around a nested loop structure, with the inner loop applying the base-learner on the meta-training set and the outer loop updating the hyperparameters U. In this section, we focus on a vast class of such meta-learning algorithms in which the inner loop applies training procedures dependent on the current iterate of the hyperparameter, while the outer loop updates the hyperparameter using a stochastic rule. This class includes stochastic variants of state-of-the-art algorithms such as MAML [23] and reptile [24]. We apply the derived information-theoretic bounds to study the meta-generalization performance of the mentioned class of meta-training iterative stochastic rules by focusing on the case of separate within-task training and test sets here, which is assumed e.g., by MAML. The analysis for the setup with joint within-task training and test sets can also be carried out at the cost of a more cumbersome notation.
To start, let U j ∈ R d denote the hyperparameter vector at outer iteration j, with U 0 ∈ R d being an arbitrary initialization. For example, in MAML, the hyperparameter U defines the initial iterate used by each base-learner in the inner loop to update the model parameter W τ corresponding to task τ. At each iteration j ≥ 1, we randomly select a mini-batch of task indices K j ⊆ [1, . . . , N] from the meta-training data Z m 1:N , obtaining the corresponding data set Z m K j = (Z m tr K j , Z m te K j ) ⊆ Z m 1:N , where Z m tr K j = {Z m tr k } k∈K j and Z m te K j = {Z m te k } k∈K j are the separate training and test sets for the selected tasks. For each index k ∈ K j , in the inner loop, the base-learner selects the model parameter W j k as a possibly stochastic function For instance, in MAML, the function g(U j−1 , Z m tr k ) ∈ R d in (42) represents the output of an SGD procedure that starts from initialization U j−1 and uses the task training data Z m tr k to iteratively update the model parameters, producing the final iterate W j k . We denote as W K j = {W j k } k∈K j the collection of the base-learners' outputs for all task indices k ∈ K j at outer iteration j.
In the outer loop, the meta-learner uses the task-specific adapted parameters W K j from the inner loop and the meta-test set Z m te K j to update the past iterate U j−1 according to the general update rule where F(·) and G(·, ·, ·) are arbitrary deterministic functions; β j is the step-size; and is an isotropic Gaussian noise, independently drawn for j = 1, 2, . . . ,. As an example, in MAML, the function F(·) is the identity function and function G(·, ·, ·) equals the gradient of the empirical loss 1 (14) with respect to U j−1 . Note, however, that MAML does not add noise, i.e., γ 2 j = 0 for all j.  We now derive an upper bound on the meta-generalization gap for the general class of iterative meta-learning algorithm satisfying (42) and (43) under the following assumptions.
Proof. See Appendix E.
The bound in (45) has the same form as the generalization gap derived in [27] for conventional learning. From (45), the generalization gap can be reduced by increasing the variance γ 2 j of the injected Gaussian noise. In particular, the meta-generalization gap depends on the ratios β 2 j /γ 2 j between squared step size β 2 j and variance γ 2 j . For example, SGLD sets γ j = β j , and a step size β j decaying over time according to the standard Robbins-Monro conditions, in order to ensure convergence of the output samples to the generalized posterior distribution of the hyperparameters [52].
Example: To illustrate bound (45), we now consider a simple logistic regression problem that generalizes the example studied in Section 5.1. Accordingly, each data point Z corresponds to labelled data Z = (X, Y), where X ∈ {0, 1} d represents the input vector and Y ∈ {0, 1} represents the corresponding binary label. The data distribution P Z|τ k = P X|τ k P Y|X,τ k for each task τ k ∈ T = {τ 1 , . . . , τ M } is such that X ∼ P X|τ k is a ddimensional Bernoulli vector obtained via d independent draws from Bernoulli(ν) and ) is the sigmoid function and µ τ k ∈ R d , with ||µ τ k || 2 ≤ 1. The task distribution P T then defines a distribution over the parameter vectors {µ τ 1 , . . . , µ τ M }. The base-learner uses training data generated i.i.d. from P Z|τ k to obtain a prediction w of the parameter vector µ τ k for task τ k ∈ T . The loss function is taken as the quadratic error l(w, z) = (φ(w T x) − y) 2 .
At each iteration j, starting from initialization point U j−1 , the base-learner in (42) uses a one-step projected gradient descent algorithm on the training data set Z m tr k to obtain the prediction W j k as where α > 0 is the step-size W = {w ∈ R d ||w|| 2 ≤ 1} is the set of feasible model parameters and proj A (b) = 1 2 min a∈A ||a − b|| 2 2 is the projection operator. The meta-learner (43) updates the initialization vector according to the noisy gradient descent rule where β j is the step-size; and ξ j ∼ N (0, γ 2 j I d ) is isotropic Gaussian noise. This update rule corresponds to performing a first order MAML (FOMAML) [23] with the addition of noise.
For this problem, it is easy to verify that Assumption 7 is satisfied, since the loss function l(·, ·) is bounded in the interval [0, 1], whereby L sep t (u|Z m ) is also [a, b]-bounded. We also have the inequality The MI bound in (45) then evaluates to We now evaluate the meta-training and meta-test loss, along with the bound (49) as a function of the ratio γ 2 j /β 2 j in Figure 7. For the experiment, we considered a task environment of M = 20 tasks with ν = 0.4, d = 3, N = 4 meta-training tasks with m tr = 10 training data samples and m te = 5 test data samples. For the inner-loop (46), we fixed step-size α = 10 −4 and for the outer-loop (47), we set |K t | = N, β j = 0.25 and T = 200 iterations.
As suggested by Lemma 4, the meta-generalization gap decreases with addition of noise. While the MI bound (45) is generally loose, it correctly quantifies the dependence of the meta-generalization loss and the ratio γ 2 j /β 2 j , and it can hence serve as a useful meta-training criterion [20,48].

Conclusions
This work has presented novel information-theoretic upper bounds on the average generalization gap of meta-learning algorithms, thereby extending the well-studied information-theoretic approaches in conventional learning to meta-learning. The proposed bounds capture two sources of uncertainty-environment-level uncertainty and withintask uncertainty-and bound them via separate mutual information terms. Applications were also discussed, with the aim of elucidating the use of the bounds to quantify metaoverfitting and guide the choice of the meta-inductive bias, i.e., the class of inductive biases. The derived bounds are amenable to further refinements, such as those along the lines of [39,40,45]. It would also be interesting to study the meta-generalization bounds on noisy iterative meta-learning algorithms using the tighter information-theoretic bounds such as [26,40].

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Decoupling Estimate Lemmas
The proofs of the main results rely on the following decoupling estimate lemmas, which bound the difference in expectations under a change of measure from the joint P X,Y to the product of the marginals P X P Y . In order to state the most general form of decoupling estimate lemmas, we first define a generalized sub-Gaussian random variable.
Lemma A1 (Decoupling Estimate [14]). Let X ∈ X and Y ∈ Y be two jointly distributed random variables with joint distribution P X,Y , and let f (X, Y) be a real valued function such that f (x, Y) is (Ψ + , Ψ − , ∞, −∞)-generalized sub-Gaussian for all x ∈ X when Y ∼ P Y . Then we have the following inequalities where ( X, Y) ∼ P X P Y .
Lemma A2 (General Decoupling Estimate [26]). Let X ∈ X and Y ∈ Y be two jointly distributed random variables with joint distribution P X,Y , and let f (X, Y) be a real valued function such that f (X, Y) is a (Ψ + , Ψ − , b + , b − )-generalized sub-Gaussian when (X, Y) ∼ P X P Y . Then, we have the inequality (A5).
Note that in Lemma A2, the random variables X, Y are jointly distributed according to P X,Y . Assuming that the function f (X, Y) is generalized sub-Gaussian under X ∼ P X and Y ∼ P Y with P X and P Y being the marginals of P X,Y , the lemma provides an upper bound on the difference between average of f (X, Y) when (X, Y) is jointly distributed according to P X,Y and the average of f (X, Y) when (X, Y) is independent with X ∼ P X and Y ∼ P Y . The resultant bound thus provides an estimate of the effect of decoupling of the joint distribution to its marginals with respect to function f (X, Y).

Appendix B. Proofs of Theorems 1 and 2
For the proof of Theorem 1, we use the decomposition (29) of the meta-generalization gap into average environment-level and within-task generalization gaps as where (A6) follows since the average within-task generalization gap for a random meta-test task E P Z m  We now evaluate the first difference in (A6). It can be seen that for a fixed task τ ∈ T , the average within-task uncertainty evaluates to where (a) follows, since W and Z m te are conditionally independent given Z m tr , whereby E P Z m tr |T=τ E P W|Z m tr [E P Z m te |T=τ L t (W|Z m te )] = E P Z m tr |T=τ E P W|Z m tr [L g (W|T = τ)]. Substituting (A7) and (A8) in (A6), then concludes the proof.
For Theorem 2, the proof follows along the same line, bounding the average environmentlevel uncertainty E P Z m . Towards this, we note that the environmentlevel uncertainty can be equivalently written as where Z m and U in the first term are conditionally independent random variables distributed as (Z m i , U) ∼ P Z m P U , while, in the second term, they are jointly distributed according to P Z m i ,U , which is obtained by marginalizing the joint distribution P Z m 1:N ,U . Under Assumption 4(a), we can bound the difference E P Z m for all u ∈ U , Lemma A1 yields the following bound The bound in (A10) can also be obtained using Assumption 4(b) by resorting to the general decoupling estimate in Lemma A2 by fixing . Substituting the bound in (A10) in (A9) then yields the required bound in (27).

Appendix C. Proofs of Theorems 3 and 4
For Theorem 3, we start from the following decomposition of the average metageneralization gap analogous to (A6) E P Z m The main difference between the separate and joint within-task training and test sets scenarios is that while the average within-task uncertainty vanishes in the former scenario, this is not the case for joint within-task training and training sets. Consequently, we now bound the average within-task generalization gap denoted by the first difference in (A11). For given task τ ∈ T , to bound the within-task generalization gap E P Z m |T=τ P W|Z m [∆L(W|Z m , T = τ)], we resort to Lemma A1 with X = W, Y = Z m and f (X, Y) = L t (W|Z m ), so that E P X,Y [ f (X, Y)] = E P W,Z m |T=τ [L t (W|Z m )]. It can be then verified that E P X P Y [ f ( X, Y)] = E P W|T=τ E P Z m |T=τ [L t (W|Z m )] = E P W|T=τ [L g (W|T = τ)] = E P W,Z m |T=τ [L g (W|T = τ)], where P W,Z m |T=τ = P W|Z m P Z m |T=τ . Since L t (w|Z m ) is the sum of i.i.d δ 2 τ -sub-Gaussian random variables l(w, Z i ) (from Assumption 5(a)), we have that L t (w|Z m ) is δ 2 τ /m-sub-Gaussian under Z m ∼ P Z m |T=τ for all w ∈ W [53]. Consequently, Lemma A1 yields the following bound E P Z m |T=τ P W|Z m [∆L(W|Z m , T = τ)] ≤ 2δ 2 τ m I(W; Z m |T = τ).
Averaging with respect to P T on both sides of (A12), and combining with the bound on average environment-level uncertainty yields the required bound in (32) via Jensen's inequality.
For Theorem 4, the proof follows along the same line. The ITMI bound on the expected environment-level uncertainty can be obtained along the lines of (A10), using the assumption on L joint t (u|Z m ) in either Assumption 6(a) or Assumption 6(b). We now show that we can similarly bound the within-task uncertainty using the assumption on loss function l(w, z) in either Assumption 6(a) or Assumption 6(b). Towards this, for fixed task τ ∈ T , we write the average within-task uncertainty equivalently as where W and Z j in the second term are jointly distributed according to P W,Z j |T=τ , which is the marginal of the joint distribution P W,Z m |T=τ . In contrast, W and Z j in the first term are conditionally independent random variables distributed as (W, Z j ) ∼ P W|T=τ P Z j |T=τ where P W|T=τ is the marginal distribution of P W,Z j |T=τ . Now, fixing X = W, Y = Z j and Averaging with respect to P T on both sides of (A14), and combining with the bound on average environment-level uncertainty yields the required bound in (33).

Appendix D. Details of Example
We first give details of the derivation of meta-generalization gap for the case with separate within-task training and test sets. The average meta-generalization loss can be computed as E P Z m with U defined as in (38). The meta-generalization gap in (40) then results by taking the difference of (A15) and (A16), and using that E P Z m We now evaluate the mutual pieces of information I(U; Z m 1:N ) and I(U; Z m i ). For the first MI, note that, since the meta-learner is deterministic (see (38) . It can be seen that random variables U and U|Z m i = z m are mixtures of probability distributions, whose entropies can be evaluated following standard methods [54].
For the case with joint within-task training and test sets, the meta-generalization gap can be obtained in a similar way as E P Z m All information measures can be easily evaluated numerically [54].

Appendix E. Proof of Lemma 4
From the update rule of the meta-learner in (43), we get the Markov dependency where U (j−1) = {U 1 , . . . , U j−1 } is the history vector of hyperparameters. The sampling strategy in (44) together with (A19) then implies the following relation where, the inequality in (a) follows from data processing inequality on Markov chain Z m 1:N → U (J) → U; (b) follows from the Markov chain Z m 1:N → {Z m K i } J i=1 → U (J) ; and the equality in (c) follows from U (j−2) → U j−1 → U j and (A20). Finally, the computation of bound in (A24) follows similar to Lemma 5 in [27].