Information-Theoretic Generalization Bounds for Meta-Learning and Applications

Jose, Sharu Theresa; Simeone, Osvaldo

doi:10.3390/e23010126

Open AccessArticle

Information-Theoretic Generalization Bounds for Meta-Learning and Applications

by

Sharu Theresa Jose

^*

and

Osvaldo Simeone

Department of Engineering, King’s College London, London WC2R 2LS, UK

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(1), 126; https://doi.org/10.3390/e23010126

Submission received: 17 November 2020 / Revised: 13 January 2021 / Accepted: 14 January 2021 / Published: 19 January 2021

(This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.

Keywords:

meta-learning; generalization bounds; mutual information; noisy iterative algorithms

1. Introduction

As formalized by the “no free lunch theorem”, any effective learning procedure must be based on prior assumptions on the task of interest [1]. These include the selection of a model class and of the hyperparameters of a learning algorithm, such as weight initialization and learning rate. In conventional single-task learning, these assumptions, collectively known as inductive bias, are fixed a priori relying on domain knowledge or validation [1,2,3]. Fixing a suitable inductive bias can significantly reduce the sample complexity of the learning process, and is thus crucial to any learning procedure. The goal of meta-learning is to automatically infer the inductive bias, thereby learning to learn from past experiences via the observation of a number of related tasks, so as to speed up learning a new and unseen task [4,5,6,7,8].

In this work, we consider the meta-learning problem of inferring the hyperparameters of a learning algorithm. The learning algorithm (henceforth, called base-learning algorithm or base-learner) is defined as a stochastic mapping

P_{W | Z^{m}, u}

from the input training set

Z^{m} = (Z_{1}, \dots, Z_{m})

of m samples to a model parameter

W \in W

for a fixed hyperparameter vector u. The meta-learning algorithm (or meta-learner) infers the hyperparameter vector u, which defines the inductive bias, by observing a finite number of related tasks.

For example, consider the well-studied algorithm of biased regularization for supervised learning [9,10]. Let us denote each data point

Z = (X, Y)

as a tuple of input features

X \in R^{d}

and label

Y \in R

. The loss function

l : W \times Z \to R

is given as the quadratic measure

l (w, z) = {(〈 w, x 〉 - y)}^{2}

that quantifies the loss accrued by the inferred model parameter w on a data sample z. Corresponding to each per-task data set

Z^{m}

, the biased regularization algorithm

P_{W | Z^{m}, u}

is a Kronecker delta function centered at the minimizer of the following optimization problem

\begin{matrix} \frac{1}{m} \sum_{j = 1}^{m} l (w, Z_{j}) + \frac{λ}{2} | | w - {u | |}^{2}, \end{matrix}

(1)

which corresponds to an empirical risk minimization problem with a biased regularizer. Here,

λ > 0

is a regularization constant that weighs the deviation of the model parameter w from a bias vector u. The bias vector u can be then thought of as a common “mean” among related tasks. In the context of meta-learning, the objective then is to infer the bias vector u by observing data sets from a number of similar related tasks. Different meta-learning algorithms have been developed for this problem [11,12].

In the meta-learning problem under study, we follow the standard setting of Baxter [13] and assume that the learning tasks belong to a task environment, which is defined by a probability distribution

P_{T}

on the space of learning tasks

T

, and per-task data distributions

{P_{Z | T = τ}}_{τ \in T}

. The data set

Z^{m}

for a task

τ

is then generated i.i.d. according to the distribution

P_{Z | T = τ}

. The meta-learner observes the performance of the base-learner on the meta-training data from a finite number of meta-training tasks, which are sampled independently from the task environment, and infers the hyperparameter U such that it can learn a new task, drawn from the same task environment, from fewer data samples.

The quality of the inferred hyperparameter U is measured by the meta-generalization loss,

L_{g} (U)

, which is the average loss incurred on the data set

Z^{m} \sim P_{Z^{m} | T}

of a new, previously unseen task T sampled from the task distribution

P_{T}

. The notation will be formally introduced in Section 2.2. While the goal of meta-learning is to infer a hyperparameter U that minimizes the meta-generalization loss

L_{g} (U)

, this is not computable, since the underlying task and data distributions are unknown. Instead, the meta-learner can evaluate an empirical estimate of the loss,

L_{t} (U | Z_{1 : N}^{m})

, using the meta-training set

Z_{1 : N}^{m}

of data from N tasks, which is referred to as meta-training loss. The difference between the meta-generalization loss and the meta-training loss is the meta-generalization gap,

\begin{matrix} Δ L (U | Z_{1 : N}^{m}) = L_{g} (U) - L_{t} (U | Z_{1 : N}^{m}), \end{matrix}

(2)

and measures how well the inferred hyperparameter U generalizes to a new, previously unseen task. In particular, if the meta-generalization gap is small, on average or with high probability, then the performance of the meta-learner on the meta-training set can be taken as a reliable estimate of the meta-generalization loss.

In this paper, we study information-theoretic upper bounds on the average meta-generalization gap

E_{P_{Z_{1 : N}^{m}} P_{U | Z_{1 : N}^{m}}} [Δ L (U | Z_{1 : N}^{m})]

, where the average is with respect to the meta-training set

Z_{1 : N}^{m}

and the meta-learner defined by the stochastic kernel

P_{U | Z_{1 : N}^{m}}

. Specifically, we extend the recent line of work initiated by Russo and Zhou [14], and Xu and Raginsky [15], which obtain mutual information (MI)-based bounds on the average generalization gap for conventional learning, to meta-learning. To the best of our knowledge, this is the first work that studies information-theoretic bounds for meta-learning.

The bounds on average meta-generalization gap, studied in this work, are distinct from the other well-known bounds on meta-generalization gap in literature. Broadly speaking, existing bounds on the meta-generalization gap can be grouped into two—high probability, probably approximately correct (PAC) bounds, and high probability PAC-Bayesian bounds. These upper bounds take the general form,

E_{P_{U | Z_{1 : N}^{m}}} [Δ (U | Z_{1 : N}^{m})] \leq ϵ

, that hold with probability at least

1 - δ

, for

δ \in (0, 1)

, over the meta-training set

Z_{1 : N}^{m}

. In contrast, our work focuses on bounding

E_{P_{Z_{1 : N}^{m}}} E_{P_{{U | Z}_{1 : N}^{m}}} [Δ L (U | Z_{1 : N}^{m})]

on average also over the meta-training set. Notable PAC bounds on meta-generalization gap include the bound of Baxter [13] obtained using the framework of Vapnik–Chervonenkis (VC) dimensions; and of Maurer [16], which employs the algorithmic stability [17,18] properties. In contrast, the PAC-Bayesian bounds also incorporate prior beliefs on the base-learner and the meta-learner posteriors via an auxiliary data-independent prior distribution

Q_{W | U}

and a hyper-prior distribution

Q_{U}

, respectively. Most notably, PAC-Bayesian bounds include that of Pentina and Lambert [19], the tighter bound of Amit and Meir [20], and most recently, the bounds of Rothfuss et al. [21]. While the high-probability bounds are agnostic to task and data distributions, our information-theoretic bounds depend explicitly on the task and per-task data distributions, on the loss function, and on the meta-training algorithm, in accordance to prior work on information-theoretic generalization bounds.

Another general property inherited from the information-theoretic approach adopted in this paper is that the bounds on the average meta-generalization gap under study are designed to hold for arbitrary base-learners and meta-learners. As such, they generally do not result in tighter bounds as compared to non-information theoretic generalization guarantees obtained for specific meta-learning problems, such as the ridge regression problem with meta-learned bias vector mentioned above [22]. In contrast, the general purpose of the bounds in this paper is to provide insights into the number of tasks, and the number of samples per task required to ensure that the training-based metrics are a good approximation to their population counterparts.

1.1. Main Contributions

The derivation of bounds on average meta-generalization gap differs from conventional learning owing to two levels of uncertainties—environment-level uncertainty and within-task uncertainty. While within-task uncertainty results from observing a finite number m of data samples per task as in conventional learning, environment-level uncertainty results from observing a finite number N of tasks from the task-environment. The relative importance of these two forms of uncertainty depend on the use made by the meta-learner of the meta-training data. In fact, depending on how the meta-training data are used by the meta-learner, we identify two main classes of meta-training algorithms—with separate within-task training and test sets, and joint within-task training and test sets. The former class includes the state-of-the-art meta-learning algorithms, such as model agnostic meta-learning (MAML) [23], that splits the training data corresponding to each task into training and test sets, with the latter reserved for within-task validation. In contrast, the second class of algorithms, such as reptile [24], use the entire per-task data both for training and testing. Our main contributions are as follows.

In Theorem 1, we show that, for the case with separate within-task training and test sets, the average meta-generalization gap contains only the contribution of environment-level uncertainty. This is captured by a ratio of the mutual information (MI) between the output of the meta-learner U and the meta-training set $Z_{1 : N}^{m}$ , and the number of tasks N, as

$\begin{matrix} |E_{P_{Z_{1 : N}^{m}} P_{U | Z_{1 : N}^{m}}} [Δ L^{sep} (U | Z_{1 : N}^{m})]| & \leq \sqrt{\frac{2 σ^{2}}{N} I (U; Z_{1 : N}^{m})}, \end{matrix}$

(3)

where $σ^{2}$ is the sub-Gaussianity variance factor of the meta-loss function. This is a direct parallel of the MI-based bounds for single-task learning [25].
In Theorem 3, we then show that, for the case with joint within-task training and test sets, the bound on the average meta-generalization gap also contains a contribution due to the within-task uncertainty via the ratio of the MI between the output of the base-learner and within-task training data and the per-task data sample size m. Specifically, we have the following bound

$\begin{matrix} |E_{P_{Z_{1 : N}^{m}} P_{U | Z_{1 : N}^{m}}} [Δ L^{joint} (U | Z_{1 : N}^{m})]| & \leq \sqrt{\frac{2 σ^{2}}{N} I (U; Z_{1 : N}^{m})} + E_{P_{T}} [\sqrt{\frac{2 δ_{T}^{2}}{m} I (W; Z^{m} | T = τ)}], \end{matrix}$

(4)

where $δ_{T}^{2}$ is the sub-Gaussianity variance factor of the loss function $l (w, z)$ for task T.
In Theorems 2 and 4, we extend the individual sample MI (ISMI) bound of [26] to obtain novel individual task MI (ITMI)-based bounds on the meta-generalization gap for both separate and within-task training and test sets as

$\begin{matrix} |E_{P_{Z_{1 : N}^{m}} P_{U | Z_{1 : N}^{m}}} [Δ L^{sep} (U | Z_{1 : N}^{m})]| & \leq \frac{1}{N} \sum_{i = 1}^{N} \sqrt{2 σ^{2} I (U; Z_{i}^{m})}, \end{matrix}$

(5)

and

$\begin{matrix} |E_{P_{Z_{1 : N}^{m}} P_{U | Z_{1 : N}^{m}}} [Δ L^{joint} (U | Z_{1 : N}^{m})]| \leq \frac{1}{N} \sum_{i = 1}^{N} \sqrt{2 σ^{2} I (U; Z_{i}^{m})} + E_{P_{T}} [\frac{1}{m} \sum_{j = 1}^{m} \sqrt{2 δ_{T}^{2} I (W; Z_{j} | T = τ)}] . \end{matrix}$

(6)

These bounds can be seen to be tighter than the MI-based bounds in (3) and (4), respectively.

Finally, we study the applications of the derived bounds to two meta-learning problems. The first is a parameter estimation setup that involves one-shot meta-learning and base-learning procedures, for which a closed form expression for meta-generalization gap can be derived. The second application covers a broad range of noisy iterative meta-learning algorithms and is inspired by the work of Pensia et al. [27] for conventional learning.

1.2. Related Work

For conventional learning, there exists a rich literature on diverse frameworks for deriving upper bounds on the generalization gap, i.e., on the difference between generalization and training losses. Classical bounds from statistical learning theory quantify the generalization gap in terms of measures of complexity of the model class, most notably VC dimension [28] and Radmacher complexity [29]. This approach obtains high-probability, probably approximate correct (PAC) bounds on the generalization gap with respect to the training set. An alternate line of high-probability bounding techniques relies on the notion of algorithmic stability, which measures the sensitivity of the output of a learning algorithm to the replacement of individual samples from the training data set. The pioneering work [30] has been extended to include various notions of algorithmic stability [31,32,33]. As a notable example, a distributional notion of stability in terms of differential privacy, which quantifies the sensitvity of the distribution of algorithm’s output to data set, has been studied in [34,35]. The high-probability PAC–Bayesian bounds rely on change of measure arguments and uses the Kullback–Leibler (KL) divergence between the algorithm and a data-independent prior to quantifying the algorithmic sensitivity [36,37,38].

Following the initial work of Russo and Zou [14], information-theoretic bounds on the average generalization gap for conventional learning have been widely investigated in recent years. Xu and Raginsky [25] showed that the MI between the output of the learning algorithm and its training data set yields an upper bound in expectation on the generalization gap. The bound has been shown to offer computable generalization gaurentees for noisy iterative algorithms, including stochastic gradient Langevin dynamics (SGLD) in [27]. Various refinements of the MI-based bound have since been analyzed to obtain tighter bounds. In particular, the bounds in [39] employ chaining mutual information techniques to tighten the bounds in [25], while the bound in [26] depends on the MI between the output of the algorithm and an individual data sample. The MI between the output of the algorithm and a random subset of the data set appears in the bounds introduced in [40]. The total variation information between the joint distribution of the training data and algorithmic output and the product of marginals was shown in [41] to yield a bound on the generalization gap for any bounded loss function. Subsequent works in [42,43,44] consider other information-theoretic measures, such as maximum leakage and lautum information. Most recently, a conditional mutual information (CMI)-based approach has been proposed in [45] to develop generalization bounds.

1.3. Notation

Throughout this paper, upper case letters, e.g., X, denote random variables and lower case letters, e.g., x, their realizations. We use

P (\cdot)

to denote the set of all probability distributions on the argument set or vector space. For a discrete or continuous random variable X taking values in a set or vector space

X, P_{X} \in P (X)

denotes its probability distribution, with

P_{X} (x)

being the probability mass or density value at

x \in X

. We denote as

P_{X^{n}}

the n-fold product distribution induced by

P_{X}

. The conditional distribution of a random variable X given random variable Y is similarly defined as

P_{X | Y}

, with

P_{X | Y} (x | y)

representing the probability mass or density at

X = x

conditioned on the event

Y = y

. We use

| | \cdot {| |}_{2}

to denote the Euclidean norm of the argument vector, and

I_{d}

to denote a d-dimensional identity matrix. We define the Kronecker delta

δ (x - x_{0}) = 1

if

x = x_{0}

and

δ (x - x_{0}) = 0

otherwise.

2. Problem Definition

In this section, we define the problem of interest by introducing the key definitions of generalization gap for conventional, or single-task, learning and for meta-learning.

2.1. Generalization Gap for Single-Task Learning

Consider first the conventional problem of learning a task

τ \in T

.

As illustrated in Figure 1, each task

τ \in T

is associated with an underlying unknown data distribution,

P_{Z | T = τ} \in P (Z)

, defined in a subset or vector space. Henceforth, we use

P_{Z | τ}

to denote

P_{Z | T = τ}

for notational convenience.

The training procedure, which is referred to as the base-learner, has access to a training data set

Z^{m} = (Z_{1}, Z_{2}, \dots, Z_{m}) \sim P_{Z^{m} | τ}

of m independent and identically distributed (i.i.d.) samples drawn from distribution

P_{Z | τ}

. The base-learner uses this data set to choose a model, or hypothesis, W from the model class

W

by using a randomized training procedure defined by a conditional distribution

P_{W | Z^{m}, u}

as

\begin{matrix} W \sim P_{W | Z^{m}, u} . \end{matrix}

(7)

The conditional distribution

P_{W | Z^{m}, u}

defines a stochastic mapping from the training data set

Z^{m}

to the model class

W

. The training procedure (7) is parameterized by a vector

u \in U

of hyperparameters, which defines the inductive bias. As an example, the base-learner

P_{W | Z^{m}, u}

may follow stochastic gradient descent (SGD) updates with hyperparameters u, including the learning rate and the initialization point.

The performance of a parameter vector

w \in W

on a data sample

z \in Z

is measured by a loss function

l : W \times Z \to R_{+}

. The generalization loss for a model parameter vector

w \in W

is the average

\begin{matrix} L_{g} (w | τ) = E_{P_{Z | τ}} [l (w, Z)], \end{matrix}

(8)

over a test example Z independently drawn from the data distribution

P_{Z | τ}

. The subscript g is used to distinguish the generalization loss from the training loss defined below. The generalization loss cannot be computed by the learner, given that the data distribution

P_{Z | τ}

is unknown. Instead, the learner can evaluate the training loss on the data set

Z^{m}

, which is defined as the empirical average

\begin{matrix} L_{t} (w | Z^{m}) = \frac{1}{m} \sum_{i = 1}^{m} l (w, Z_{i}) . \end{matrix}

(9)

The subscript t specifies that the loss is the empirical training loss.

The difference between generalization loss (8) and training loss (9) is known as generalization gap,

\begin{matrix} Δ L (w | Z^{m}, τ) = L_{g} (w | τ) - L_{t} (w | Z^{m}), \end{matrix}

(10)

and is a key metric that quantifies the level of uncertainty (This type of uncertainty is known as epistemic.) at the learner regarding the data distribution

P_{Z | τ}

. The average generalization gap for the data distribution

P_{Z | τ}

and base-learner

P_{W | Z^{m}, u}

is defined as

\begin{matrix} E_{P_{Z^{m}, W | τ, u}} [Δ L (W | Z^{m}, τ)], \end{matrix}

(11)

where the expectation is taken with respect to the joint distribution

P_{Z^{m}, W | τ, u} = P_{Z^{m} | τ} P_{W | Z^{m}, u}

. A summary of the variables involved in the Definition of the generalization gap (11) can be found in Figure 1.

Intuitively, if the generalization gap is small, on average or with high probability, then the base-learner can take the performance (9) on the training set

Z^{m}

as a reliable measure of the generalization loss (8) of the trained model W. Furthermore, data-dependent bounds on the generalization gap can be used as regularization terms to avoid overfitting, yielding generalized Bayesian inference problems [46,47].

2.2. Generalization Gap for Meta-Learning

As discussed, in single-task learning, the inductive bias u, defining the hyperparameters of the training procedure, must be selected a priori, i.e., without having access to task-specific data. The inductive bias determines the training data set size m needed to ensure a small generalization loss (8), since, generally speaking, richer models require more data to be trained [1]. The sample complexity can be generally reduced if one selects a suitable inductive bias based on prior information. Such prior information is typically obtained from domain knowledge on the problem under study. In contrast, meta-learning aims at automatically inferring an effective inductive bias based on data from related tasks.

To elaborate, we follow the setting of [13], in which a meta-learner observes data from a number of tasks, known as meta-training tasks, from the same task environment. A task environment is defined by a task distribution

P_{T} \in P (T)

, supported on the space

T

of tasks, and by a per-task data distribution

P_{Z | τ}

for each task

τ \in T

. Using the meta-training data drawn from a randomly selected subset of tasks, the meta-learner infers a hyperparameter vector

u \in U

defining the inductive bias. This is done with the goal of ensuring that, using hyperparameter u, the base-learner

P_{W | Z^{m}, u}

can efficiently learn on a new task, referred to as meta-test task, drawn independently from the same task distribution

P_{T}

.

To elaborate, the meta-training data consist of N data sets

Z_{1 : N}^{m} = (Z_{1}^{m}, \dots, Z_{N}^{m})

. Each ith data set is generated independently by first drawing a task

T_{i} \sim P_{T}

from the task environment and then a task-specific training data set

Z_{i}^{m} \sim P_{Z^{m} | T_{i}}

. The meta-learner uses the meta-training data set

Z_{1 : N}^{m}

to infer a hyperparameter vector

u \in

. To this end, we consider a randomized meta-learner

\begin{matrix} U \sim P_{U | Z_{1 : N}^{m}}, \end{matrix}

(12)

where

P_{U | Z_{1 : N}^{m}}

is a stochastic mapping from the meta-training set

Z_{1 : N}^{m}

to the space

U

of hyperparameters. We distinguish two different formulations of meta-learning that are often considered in the literature. In the first, the per-task data set

Z^{m}

is split into training, or support, and test, or query subsets [23,48]; while, in the second, the entire data set

Z^{m}

is used for both within-task training and testing [13,19,20].

2.2.1. Separate Within-Task Training and Test Sets

As seen in Figure 2, in this first approach to meta-learning, each meta-training sub data set

Z_{i}^{m}

is split into a training set and a test set as

Z_{i}^{m} = (Z_{i}^{m_{tr}}, Z_{i}^{m_{te}})

, where

Z_{i}^{m_{tr}}

contains

m_{tr}

i.i.d. training examples and

Z_{i}^{m_{te}}

contains

m_{te}

i.i.d. test examples, with

m = m_{tr} + m_{te}

. The within-task base-learner

P_{W | Z_{i}^{m_{tr}}, u} \in P (W)

maps the per-task training subset

Z_{i}^{m_{tr}}

to random model parameter

W_{i} \sim P_{W | Z_{i}^{m_{tr}}, u}

for a given hyperparameter

U = u

. The test subset is used to evaluate the empirical training loss of a model w for task

T_{i}

as

\begin{matrix} L_{t} (w | Z_{i}^{m_{te}}) = \frac{1}{m_{te}} \sum_{j = 1}^{m_{te}} l (w, Z_{i, j}^{m_{te}}), \end{matrix}

(13)

where

Z_{i, j}^{m_{te}}

denote the jth example of the test subset

Z_{i}^{m_{te}}

. Furthermore, the overall empirical meta-training loss for a hyperparameter u is computed by summing up all meta-training tasks as

\begin{matrix} L_{t}^{sep} (u | Z_{1 : N}^{m}) = \frac{1}{N} \sum_{i = 1}^{N} L_{t}^{sep} (u | Z_{i}^{m}), \end{matrix}

(14)

where

\begin{matrix} L_{t}^{sep} (u | Z^{m}) = E_{P_{W | Z^{m_{tr}}, u}} [L_{t} (W | Z^{m_{te}})] \end{matrix}

(15)

is the average per-task training loss over the base-learner.

We emphasize that the meta-training loss (14) can be computed by the meta-learner and used as a criterion to select the meta-learning procedure (12), since it is obtained from the meta-training data

Z_{1 : N}^{m}

. We also note that the rationale of splitting training and test sets is that the average training loss

L_{t}^{sep} (u | Z_{i}^{m})

is an unbiased estimate of the corresponding average generalization loss

E_{P_{W | Z_{i}^{m_{tr}}, u}} [L_{g} (W | T_{i})]

.

The true goal of the meta-learner is to minimize the meta-generalization loss,

\begin{matrix} L_{g}^{sep} (u) = E_{P_{T, Z^{m_{tr}}}} E_{P_{W | Z^{m_{tr}}, u}} [L_{g} (W | T)], \end{matrix}

(16)

where

P_{T, Z^{m_{tr}}} = P_{T} P_{Z^{m_{tr}} | T}

and

L_{g} (W | T)

are as defined in (8). Unlike the meta-training loss (14), the meta-generalization loss is evaluated on a new, meta-test task T and on the corresponding training data

Z^{m_{tr}}

. We distinguish the meta-generalization loss and meta-training loss by the subscripts g and t, respectively in (16) and (14). The difference between the meta-generalization loss (16) and the meta-training loss (14), known as the meta-generalization gap, is defined as

\begin{matrix} Δ L^{sep} (u | Z_{1 : N}^{m}) = L_{g}^{sep} (u) - L_{t}^{sep} (u | Z_{1 : N}^{m}) . \end{matrix}

(17)

The quantity of interest to us is the average meta-generalization gap, defined as

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{sep} (U | Z_{1 : N}^{m})], \end{matrix}

(18)

where the expectation is with respect to the joint distribution

P_{Z_{1 : N}^{m}, U} = P_{Z_{1 : N}^{m}} P_{U | Z_{1 : N}^{m}}

, of the meta-training set

Z_{1 : N}^{m}

and of the hyperparameter U. Note that

P_{Z_{1 : N}^{m}}

is the marginal of the joint distribution

\prod_{i = 1}^{N} P_{T = T_{i}} P_{Z^{M} | T = T_{i}}

.

Intuitively, if the meta-generalization gap is small, on average or with high probability, the meta learner can take the performance (14) on the meta-training data as a reliable measure of the accuracy of the inferred hyperparameter vector in terms of the meta-generalization loss (16). Furthermore, data-dependant bounds on the meta-generalization gap can be used as regularization terms to avoid meta-overfitting. Meta-overfitting occurs when the meta-trained hyperparameter yields a small meta-training loss but a large meta-test loss, due to an excessive dependence on the meta-training set [13].

2.2.2. Joint Within-Task Training and Test Sets

In the second formulation of meta-learning, as illustrated in Figure 3, the entire data set

Z_{i}^{m}

is used for within-task training and testing. Accordingly, the meta-learner computes the meta-training loss

\begin{matrix} L_{t}^{joint} (u | Z_{1 : N}^{m}) = \frac{1}{N} \sum_{i = 1}^{N} L_{t}^{joint} (u | Z_{i}^{m}), \end{matrix}

(19)

where

\begin{matrix} L_{t}^{joint} (u | Z^{m}) = E_{P_{W | Z^{m}, u}} [L_{t} (W | Z^{m})] \end{matrix}

(20)

is the average per-task training loss. Note here that in evaluating the meta-training loss in (19), the data set

Z_{i}^{m}

is used to infer model parameters W and to evaluate the per-task training loss. The expectation in (20) is taken over the output of the base-learner W given the hyperparameter vector u. As discussed, the meta-generalization loss for hyperparameter

u \in U

is computed by randomly selecting a novel task

T \sim P_{T}

as

\begin{matrix} L_{g}^{joint} (u) = E_{P_{T, Z^{m}}} E_{P_{W | Z^{m}, u}} [L_{g} (W | T)], \end{matrix}

(21)

where

P_{T, Z^{m}} = P_{T} P_{Z^{m} | T}

and

L_{g} (W | T)

is as defined in (8). In a manner similar to (17), the meta-generalization gap for a task distribution

P_{T}

, data distribution

P_{Z^{m} | T}

, meta-learning algorithm

P_{U | Z_{1 : N}^{m}}

, and base-learner

P_{W | Z^{m}, U}

is defined as

\begin{matrix} Δ L^{joint} (u | Z_{1 : N}^{m}) = L_{g}^{joint} (u) - L_{t}^{joint} (u | Z_{1 : N}^{m}) . \end{matrix}

(22)

The average meta-generalization gap is then given as

E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{joint} (U | Z_{1 : N}^{m})]

, where the expectation is taken over all meta-training sets and over the output of the meta-learner.

3. Information-Theoretic Generalization Bounds for Single-Task Learning

In this section, we review two information-theoretic bounds on the generalization gap (11) for conventional learning derived in [25,26]. The material covered in this section provides the necessary background for the analysis of the meta-generalization gap to be studied in the rest of the paper. Throughout this section, we fix a task

τ \in T

. Since the generalization and meta-generalization gaps measure the deviation of empirical-mean random variables representing training and meta-training losses from reference values, we will make use of tools and definitions from large-deviation theory (see, e.g., [49]). We discuss the key essential definitions below.

3.1. Preliminaries

To start, the cumulant generating function (CGF) of a random variable

X \sim P_{X} \in P (X)

is defined as

Λ_{X} (λ) = log E_{P_{X}} [e^{λ (X - E_{P_{X}} [X])}]

. If it is well-defined, the CGF

Λ_{X} (λ)

is convex and it satisfies the equalities

Λ_{X} (0) = Λ_{X}^{'} (0) = 0

. A random variable X with finite mean, i.e., with

E_{P_{X}} [X] < \infty

, is said to

σ^{2}

-sub-Gaussian if its CGF is bounded as

\begin{matrix} Λ_{X} (λ) \leq \frac{λ^{2} σ^{2}}{2}, for all λ \in R . \end{matrix}

(23)

As a special case, if X is bounded in the interval

[a, b]

, i.e., if the inequality

0 < a \leq X \leq b < \infty

holds for some constants a and b, then X is

{(b - a)}^{2} / 4

-sub-Gaussian.

3.2. Mutual Information (MI) Bound

We first present the mutual information (MI)-based upper bound obtained in [25]. Key to this result is the following Assumption.

Assumption 1.

The loss function

l (w, Z)

is

δ_{τ}^{2}

-sub-Gaussian under

Z \sim P_{Z | τ}

for all model parameters

w \in W

.

In particular, if the loss function is bounded, i.e., if the inequalities

- \infty < a \leq l (w, z) \leq b < \infty

hold for all for

w \in W

and

z \in Z

, Assumption 1 is satisfied with

δ_{τ}^{2} = {(b - a)}^{2} / 4

. The main result is as follows.

Lemma 1

([25]). Under Assumption 1, the following bound on the generalization gap holds for any base-learner

W \sim P_{W | Z^{m}, u}

\begin{matrix} | E_{P_{Z^{m}, W | τ, u}} [Δ L (W | Z^{m}, τ)] | \leq \sqrt{\frac{2 σ^{2}}{m} I (W; Z^{m})} . \end{matrix}

(24)

The proof of Lemma 1 is based on a decoupling estimate Lemma, which is reported for completeness in Lemma A1. We also note that the result in Lemma 1 can be extended to account for loss function

l (w, Z)

with bounded CGF [14].

The bound (24) on the generalization gap is in terms of the mutual information

I (W; Z^{m})

, which quantifies the overall dependence between the base-learner output W and the input training data set

Z^{m}

. The mutual information in (24) is hence a measure of the sensitivity of the base-learner output to the data set. Using the terminology in [25], if

I (W; Z^{m}) \leq ϵ

, the base-learner

P_{W | Z^{m}, u}

is said to be

(ϵ, P_{Z | τ})

-MI stable, in which case the bound in (24) evaluates to

\sqrt{2 δ_{τ}^{2} ϵ / m}

. The relationship between generalization and stability of a training algorithm is well-established [1], and the result (24) amounts to a formulation of this link in information-theoretic terms.

The traditional notion of algorithmic stability measures how much the base-learner output changes with the replacement of an individual training sample [30,50]. In the next section, we review the bound in [26] that translates this per-sample stability concept within an information-theoretic framework.

3.3. Individual Sample MI (ISMI) Bound

The MI-based bound in Lemma 1 has the disadvantage of being vacuous, i.e.,

I (W; Z^{m}) = \infty

, for deterministic base-learning algorithms

P_{W | Z^{m}, u}

defined on continuous parameter space

W

. An individual sample MI (ISMI)-based bound that address this shortcoming was introduced in [26]. The ISMI bound borrows the standard algorithmic stability notion of sensitivity of the base-learner output to the replacement of any individual training sample [17,18]. Accordingly, the resulting bound is in terms of the MI between the trained parameter W and each data point

Z_{i}

of the training data set

Z^{m}

. The bound, summarized in Lemma 2, applies under the following assumption.

Assumption 2.

The loss function

l (w, z)

satisfies either of the following two conditions:

(a): Assumption 1, or
(b): $l (W, Z)$ is a $δ_{τ}^{2}$ -sub-Gaussian random variable when $(W, Z) \sim P_{W | u, τ} P_{Z | τ}$ , where $P_{W | u, τ} \in P (W)$ is the marginal of the joint distribution $P_{W | Z^{m}, u} P_{Z^{m} | τ}$ .

We note that, in general, Assumption 1 does not imply Assumption 2

(b)

(see ([40], Appendix C)), and vice versa (see [26]). There are, however, loss functions

l (w, z)

and relevant distributions for which both the assumptions hold, including the case of loss functions

l (\cdot, \cdot)

which takes values in a bounded interval

[a, b]

.

Lemma 2

([26]). Under Assumption 2, the following bound on the average generalization gap holds for any base-learner

P_{W | Z^{m}, u}

\begin{matrix} | E_{P_{Z^{m}, W | τ, u}} [Δ L (W | Z^{m}, τ)] | \leq \frac{1}{m} \sum_{i = 1}^{m} \sqrt{2 σ^{2} I (W; Z_{i})} . \end{matrix}

(25)

For a loss function satisfying Assumption 1, the ISMI bound (25) is tighter than (24), i.e.,

\begin{matrix} \frac{1}{m} \sum_{i = 1}^{m} \sqrt{2 δ_{τ}^{2} I (W; Z_{i})} \leq \sqrt{\frac{2 δ_{τ}^{2}}{m} I (W; Z^{m})} . \end{matrix}

(26)

The inequality in (26) follows from the chain rule of mutual information and Jensen’s inequality [26].

4. Information-Theoretic Generalization Bounds for Meta-Learning

In this section, we first derive novel MI-based bounds on the meta-generalization gap with separate within-task training and test sets, as introduced in Section 4.1, and then we consider joint within-task training and test sets, as described in Section 4.2.

4.1. Bounds on Meta-Generalization Gap with Separate Within-Task Training and Test Sets

In this section, we present two novel MI-based bounds on the meta-generalization gap (18) for the setup with separate within-task training and testing sets. The first is an MI-based bound, which is akin to Lemma 1, and the second is an individual task MI (ITMI) bound, which resembles Lemma 2 for conventional learning.

4.1.1. MI-Based Bound

In order to derive the MI-based bound, we make the following assumption on

L_{t}^{sep} (u | Z^{m})

in (15). Throughout, we use

P_{Z^{m}}

to denote the marginal of the joint distribution

P_{T, Z^{m}} = P_{T} P_{Z^{m} | T}

.

Assumption 3.

For all

u \in U

, the average per-task training loss

L_{t}^{sep} (u | Z^{m})

is

σ^{2}

-sub-Gaussian under

Z^{m} \sim P_{Z^{m}}

.

Distinct from the assumptions in Section 3 on loss function

l (w, z)

, we note that Assumption 3 is on the average per-task training loss

L_{t}^{sep} (u | Z^{m})

. This is because the loss function

l (w, z)

satisfying Assumption 1 do not in general guarantee the sub-Gaussianity of

L_{t}^{sep} (u | Z^{m})

with respect to

Z^{m} \sim P_{Z^{m}}

. However, if the loss function is bounded, Assumption 3 can be easily verified to hold, as given in the following lemma.

Lemma 3.

If the loss function

l (\cdot, \cdot)

is

[a, b] -

bounded, then

L_{t}^{sep} (\cdot | Z^{m})

is also

[a, b]

bounded for all

Z^{m} \in Z^{m}

. Consequently,

L_{t}^{sep} (u | Z^{m})

is

{(b - a)}^{2} / 4

-sub-Gaussian under

Z^{m} \sim P_{Z^{m}}

for all

u \in U

.

Under Assumption 3, the following theorem presents an upper bound on the meta-generalization gap (18).

Theorem 1.

Let Assumption 3 hold for the base-learner

P_{W | Z^{m_{tr}}, u}

. Then, for any meta learner

P_{U | Z_{1 : N}^{m}}

such that the inequality

I (U; Z_{1 : N}^{m}) < \infty

holds, we have the following bound on the average meta-generalization gap

\begin{matrix} | E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{sep} (U | Z_{1 : N}^{m})] | & \leq \sqrt{\frac{2 σ^{2}}{N} I (U; Z_{1 : N}^{m})} . \end{matrix}

(27)

Proof.

See Appendix B. □

The technical lemmas required for the proof of Theorem 1 and the theorems that follow are included in Appendix A.

In order to prove Theorem 1, one needs to overcome an additional challenge as compared to the derivation of bounds for learning reviewed in Section 3. In fact, the meta-generalization gap is caused by two distinct sources of uncertainty:

(a)

environment-level uncertainty due to a finite number N of observed tasks, and

(b)

within-task uncertainty resulting from the finite number m of per-task data samples. Our proof approach involves applying the single-task MI-based bound in Lemma 1 to bound the effect of both sources of uncertainties.

Towards this, we start by introducing the average training loss for the randomly selected meta-test task as

\begin{matrix} L_{g, t}^{sep} (u) & = E_{P_{T, Z^{m}}} [L_{t}^{sep} (u | Z^{m})] . \end{matrix}

(28)

The subscript

g, t

denotes that the loss is generalization (g) with expectation over

P_{T, Z^{m}}

at the environment level, and training (t) at the task level with

L_{t}^{sep} (u | Z^{m})

. Note that this differs from the meta-test loss

L_{g}^{sep} (u)

in (16) in that the per-task loss is evaluated in (28) on the training set. With this definition, the meta-generalization gap can be decomposed as

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{sep} (U | Z_{1 : N}^{m})] \\ = E_{P_{Z_{1 : N}^{m}, U}} [(L_{g}^{sep} (U) - L_{g, t}^{sep} (U)) + (L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m}))] . \end{matrix}

(29)

In (29), the second difference

L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})

corresponds to the environment-level uncertainty and arises from the observation of a finite number N of tasks. In fact, as N increases, the meta-training loss

L_{t}^{sep} (u | Z_{1 : N}^{m})

almost surely tends to

L_{g, t}^{sep} (u)

by the law of large numbers. However, the average

E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})]

is not equal to zero in general for finite values of N. The within-task generalization gap is instead measured by the difference

L_{g}^{sep} (u) - L_{g, t}^{sep} (u)

. In the setup under study with separate within-task training and test sets, this term equals zero, since, as we discussed, the average empirical loss

L_{t}^{sep} (u | Z_{i}^{m})

is an unbiased estimate of the corresponding average test loss

E_{P_{W | Z_{i}^{m_{tr}}, u}} [L_{g} (W | T_{i})]

(cf. (28)). This is no longer true for joint within-task training and test sets, as we discuss in Section 4.2.

The decomposition approach adopted here follows the main steps of the bounding techniques introduced in ([16], Equation (6)). In contrast, the PAC-Bayesian bounds in [20,21] rely on a different decomposition of the meta-generalization gap. The environment and within-task generalization gaps are then separately bounded in high probability, and are combined via union bound to obtain the required PAC-Bayesian bounds.

The bound (27) relates the meta-generalization gap to the information-theoretic stability of the meta-training procedure. As first introduced here, this stability is measured by the MI

I (U; Z_{1 : N}^{m})

between the hyperparameter U and the meta-training data set

Z_{1 : N}^{m}

, in a manner similar to the MI-based bounds in Lemma 1 for conventional learning. Importantly, as we will discuss in Section 4.2, this direct parallel between learning and meta-learning no longer applies with joint within-task training and test data sets.

4.1.2. ITMI Bound

We now present the ITMI bound, which holds under the following assumption.

Assumption 4.

Either of the following assumptions on the average per-task training loss,

L_{t}^{sep} (u | Z^{m})

holds:

(a): $L_{t}^{sep} (u | Z^{m})$ satisfies Assumption 3, or
(b): $L_{t}^{sep} (U | Z^{m})$ is $σ^{2}$ -sub-Gaussian under $(U, Z^{m}) \sim P_{U} P_{Z^{m}}$ , where $P_{U}$ is the marginal of the joint distribution $P_{Z_{1 : N}^{m}, U}$ and $P_{Z^{m}}$ is the marginal of the joint distribution $P_{T, Z^{m}}$ .

Assumption 4 can be seen to be implied by the sufficient conditions in Lemma 3.

Theorem 2.

Let Assumption 4 hold for the base-learner

P_{W | Z^{m_{tr}}, U}

. Then, for any meta learner

P_{U | Z_{1 : N}^{m}}

, the following bound on the meta-generalization gap (18) holds

\begin{matrix} |E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{sep} (U | Z_{1 : N}^{m})]| & \leq \frac{1}{N} \sum_{i = 1}^{N} \sqrt{2 σ^{2} I (U; Z_{i}^{m})} . \end{matrix}

(30)

where the MI

I (U; Z_{i}^{m})

is computed with respect to the joint distribution

P_{Z_{i}^{m}, U}

obtained by marginalizing the probability distribution

P_{Z_{1 : N}^{m}, U}

.

Proof.

See Appendix B. □

As can be seen from (30), the ITMI bound on the meta-generalization gap is in terms of the MI

I (U; Z_{i}^{m})

between the output U of the meta learner and each per-task data set

Z_{i}^{m}

. This, in turn, quantifies the sensitivity of the meta learner output to the replacement of a single per-task data set. Moreover, under Assumption 3, the ITMI bound (30) yields a tighter bound than the MI-based bound (27). This can be seen from the following sequence of relations

\sqrt{\frac{1}{N} I (U; Z_{1 : N}^{m})} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} I (U; Z_{i}^{m} | Z_{(i - 1)}^{m})}

(31a)

\overset{(a)}{\geq} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} I (U; Z_{i}^{m})}

(31b)

\overset{(b)}{\geq} \frac{1}{N} \sum_{i = 1}^{N} \sqrt{I (U; Z_{i}^{m})},

(31c)

where

Z_{(i - 1)}^{m} = (Z_{1}^{m}, \dots, Z_{i - 1}^{m})

;

(a)

follows, since

Z_{i}^{m}

is independent of

Z_{(i - 1)}^{m}

; and

(b)

follows from Jensen’s inequality.

4.2. Bounds on Generalization Gap with Joint Within-Task Training and Test Sets

We now derive MI and ITMI-based bounds on the meta-generalization gap in (22) for the case with joint within-task training and test sets. As we will see, the key difference with respect to the case with separate within-task training and test sets is that the uncertainty due to finite number of per-task samples, measured by the second term in the decomposition (29), contributes in a non-negligible way to the meta-generalization gap. Since there is no split into separate within-task training and test sets, the average training loss with respect to the learning algorithm is given by

L_{t}^{joint} (u | Z^{m})

in (20).

4.2.1. MI-Based Bound

In order to derive the MI-based bound, we make the following assumptions.

Assumption 5.

We consider the following assumptions.

(a): For each task $τ \in T$ , the loss function $l (w, Z)$ satisifies Assumption 1, and
(b): The average per-task training loss $L_{t}^{joint} (u | Z^{m})$ in (20) is $σ^{2}$ -sub-Gaussian for all $u \in U$ when $Z^{m} \sim P_{Z^{m}}$ .

An easily verifiable sufficient condition for the above assumption to hold is the boundedness of loss function

l (w, z)

, which follows in a manner similar to Lemma 3.

Theorem 3.

Let Assumption 5 hold for a base-learner

W \sim P_{W | Z^{m}, U}

. Then, for any meta learner

P_{U | Z_{1 : N}^{m}}

, we have the following bound on the meta-generalization gap (22)

\begin{matrix} |E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{joint} (U | Z_{1 : N}^{m})]| & \leq \sqrt{\frac{2 σ^{2}}{N} I (U; Z_{1 : N}^{m})} + E_{P_{T}} [\sqrt{\frac{2 δ_{T}^{2}}{m} I (W; Z^{m} | T = τ)}] . \end{matrix}

(32)

where the MI

I (W; Z^{m} | T = τ)

is evaluated with respect to the distribution

P_{Z^{m}, W | T = τ}

obtained by marginalizing the joint distribution

P_{W | Z^{m}, U} P_{Z_{1 : N}^{m}, U} P_{Z^{m} | T = τ}

.

Proof.

See Appendix C. □

With joint within-task training and test sets, the bound (32) on the meta-generalization gap contains the contributions of two mutual informations. The first,

I (U; Z_{1 : N}^{m})

, quantifies the sensitivity of the meta learner output U to the meta-training data set

Z_{1 : N}^{m}

. This term also appeared in the bound (27) with separate within-task training and test sets. Decomposing the meta-generalization gap in a manner analogous to (29), it corresponds to a bound on the average of the second difference. The second contribution,

I (W; Z^{m} | T = τ)

, quantifies the sensitivity of the output of the base-learner

P_{W | Z^{m}, U}

to the data set

Z^{m}

of the meta-test task T, when the hyperparameter is randomly selected by the meta-learner

P_{U | Z_{1 : N}^{m}}

using the meta-training set

Z_{1 : N}^{m}

. This second term is in line with the single-task generalization gap bounds (24), and it bounds the corresponding first difference in the decomposition (29).

We finally note that the dependence of the bound in (32) on the number of tasks N and per-task samples m is of the order

1 / \sqrt{N} + 1 / \sqrt{m}

. Meta-generalization bounds with similar dependence have been derived in [20] using PAC-Bayesian arguments. The bounds on excess risk for representation learning also follow a similar order of dependence on N and m (c.f [51], [Thm. 2]).

4.2.2. ITMI Bound on (22)

For deriving the ITMI bound on the meta-generalization gap (22), we assume the following.

Assumption 6.

Either of the following assumptions hold:

(a): Assumption 5 holds, or
(b): For each task $τ \in T$ , the loss function $l (W, Z)$ is $δ_{τ}^{2}$ -sub-Gaussian when $(W, Z) \sim P_{W | τ} P_{Z | τ}$ , where $P_{W | τ}$ is the marginal of the joint distribution $P_{W | Z^{m}, U} P_{Z_{1 : N}^{m}, U} P_{Z^{m} | τ}$ . The average per-task training loss $L_{t}^{joint} (U | Z^{m})$ is $σ^{2}$ -sub-Gaussian when $(U, Z^{m}) \sim P_{U} P_{Z^{m}}$ .

As in Section 4.1.2, Assumption 6 can be seen to be implied by the sufficient conditions in Lemma 3.

Theorem 4.

Under Assumption 6, for any meta learner

P_{U | Z_{1 : N}^{m}}

, the following bound holds on the average meta-generalization gap

\begin{matrix} |E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{joint} (U | Z_{1 : N}^{m})]| \leq \frac{1}{N} \sum_{i = 1}^{N} \sqrt{2 σ^{2} I (U; Z_{i}^{m})} + E_{P_{T}} [\frac{1}{m} \sum_{j = 1}^{m} \sqrt{2 δ_{T}^{2} I (W; Z_{j} | T = τ)}], \end{matrix}

(33)

where the MI

I (U; Z_{i}^{m})

is evaluated with respect to

P_{Z_{i}^{m}, U}

obtained by marginalizing

P_{Z_{1 : N}^{m}, U}

, and the MI

I (W; Z_{j} | T = τ)

is with respect to

P_{Z_{j}, W | T = τ}

obtained by marginalizing

P_{Z^{m}, W | T = τ}

.

Proof.

See Appendix C. □

Similar to the bound in (32), the bounds on meta-generalization gap in (33) are in terms of two types of mutual information, the first describing the sensitivity of the meta-learner and the second the sensitivity of the base-learner. Specifically, the MI

I (U; Z_{i}^{m})

quantifies the sensitivity of the output of the meta learner to per-task data set

Z_{i}^{m}

, and the MI

I (W; Z_{j} | T = τ)

measures the sensitivity of the output of the base-learner,

P_{W | Z^{m}, U}

to each data sample

Z_{i}

within the training set

Z^{m}

of the meta-test task T. Moreover, it can be shown, in a manner similar to (31c), that, under Assumption 5, the ITMI bound in (33) is tighter than the MI bound in (32).

4.3. Discussion on Bounds

The bounds on the average meta-generalization gap obtained in this section generalize the bounds for conventional single-task learning in Section 3. To see this, consider the task distribution

P_{T} = δ (T - τ)

to be centered at some task

τ \in T

. Recall that in conventional learning, the hyperparameter u is fixed a priori. As such, the mutual information

I (U; Z_{1 : N}^{m})

(for MI-based bounds) and

I (U; Z_{i}^{m})

(for ITMI-based bounds) vanishes. For the separate within-task training and test sets, this implies that the average generalization gap is zero, which follows since the per-task test loss

L_{t} (W | Z_{i}^{m_{te}})

is an unbiased estimate of per-task generalization loss

L_{g} (W | T_{i})

. The MI- and ITMI-based bounds for the joint within-task training and test sets then reduce to

\begin{matrix} |E_{P_{Z^{m}, W | τ, u}} [Δ L (W | Z^{m}, τ)]| \leq \sqrt{\frac{2 δ_{τ}^{2}}{m} I (W; Z^{m})}, \end{matrix}

(34)

and

\begin{matrix} |E_{P_{Z^{m}, W | τ, u}} [Δ L (W | Z^{m}, τ)]| \leq \frac{1}{m} \sum_{j = 1}^{m} \sqrt{2 δ_{τ}^{2} I (W; Z_{j})} \end{matrix}

(35)

respectively, where

I (W; Z^{m})

is evaluated with respect to the joint distribution

P_{W, Z^{m} | τ, u}

and

I (W; Z_{j})

with respect to

P_{W, Z_{j} | τ, u}

.

The MI- and ITMI-based bounds derived in this section point that a smaller correlation between hyperparameters and meta-training set and thus small mutual information

I (U; Z_{1 : N}^{m})

improves the meta-generalization gap, although this seems deleterious to performance. To clarify this contradiction, we would like to emphasize that these bounds quantify the difference between meta-generalization loss and empirical training loss, which in turn depends on the sensitivity of the meta-learner and base-learner to their input meta-training set and per-task training set, respectively. The mutual information terms in our bounds capture these sensitivities. Consequently, our bounds suggest that a meta-learner that is highly correlated to the input meta-training set (i.e., when

I (U; Z_{1 : N}^{m})

is large) does not generalize well (i.e., yields large meta-generalization gap). This property aligns with a previous information-theoretic analysis for generalization in conventional learning [25].

To the best of our knowledge, the MI- and ITMI-based bounds studied here are the first bounds on the average meta-generalization gap. As discussed in the introduction, these bounds are distinct from the high-probability PAC and PAC-Bayesian bounds on the meta-generalization gap studied previously on meta-learning. Consequently, the bounds studied in this work are not directly comparable with the existing high-probability bounds.

Finally, we note that similarity between tasks is crucial to meta-learning. If the per-task data distributions

P_{Z | T = τ}

in the task environment are ‘closer’ to each other, a meta-learner can efficiently learn the shared characteristics of tasks, and can generalize well to new tasks from the task environment. In our setting, the statistical properties of the task environment

(P_{T}, {P_{Z | T = τ}}_{τ \in T})

dictate this similarity. Although our MI- and ITMI-based bounds do not explicitly capture this, we note that the properties of task environment are implicitly accounted for by the mutual information terms

I (U; Z_{1 : N}^{m})

and

I (U; Z_{i}^{m})

, where the meta-training data set

Z_{1 : N}^{m}

is generated from the task environment, and also by the sub-Gaussianity considerations in Assumptions 3–6. From preliminary studies, we believe that information-theoretic bounds that explicitly capture the impact of task similarity require a different performance metric than the average meta-generalization gap considered here, and is left to future work.

5. Applications

In this section, we consider two applications of the information-theoretic bounds proposed in Section 4.1. The first, simpler, example concerns a parameter estimation problem for which an optimized meta-learner can be obtained in closed form. In contrast, the second application covers a broad class of iterative meta-training schemes.

5.1. Parameter Estimation

To illustrate the bounds on the meta-generalization gap derived in Section 4.1, we first consider the problem of prediction for a Bernoulli process with a ‘soft’ predictor that uses only a few samples from the process, as well as meta-training data. Towards this, we consider an arbitrary discrete finite set of tasks

T = {τ_{1}, \dots, τ_{M}}

. The data distribution

P_{Z | T = τ_{k}}

for each task

τ_{k} \in T

,

k \in {1, \dots, M}

, is given as

Bernoulli (μ_{τ_{k}})

with mean parameter

μ_{τ_{k}}

. The task distribution

P_{T}

is then defined over the finite set of mean parameters

{μ_{τ_{1}}, \dots, μ_{τ_{M}}}

. The base-learner uses training data, distributed i.i.d. from

Bernoulli (μ_{τ_{k}})

to determine the parameter

W_{k}

, which is used as a predictor of new observation

Z \sim Bernoulli (μ_{τ_{k}})

at test time. The loss function is defined as

l (w, z) = {(w - z)}^{2}

, measuring the quadratic error between prediction and realized test input z. Note that the optimal (Bayes) predictor, computable in the ideal case of known distribution

P_{Z | T = τ_{k}}

, is given as

W = μ_{τ_{k}}

. We now distinguish the two cases with separate and joint within-task training and test sets.

5.1.1. Separate Within-Task Training and Test Sets

The base-learner

P_{W | Z_{k}^{m_{tr}}, u}

for task

τ_{k} \in T

, deterministically selects the prediction

\begin{matrix} W_{k} = α D_{k}^{m_{tr}} + (1 - α) u, \end{matrix}

(36)

where

D_{k}^{m_{tr}} = \frac{1}{m_{tr}} \sum_{j = 1}^{m_{tr}} Z_{k, j}^{m_{tr}}

is an empirical average over the training set

Z_{k, j}^{m_{tr}}

, u is a hyperparameter defining a bias that can be meta-trained, and

α \in [0, 1)

is a fixed scalar. Here,

Z_{k, j}^{m_{tr}}

denote the jth data sample in the training set

Z_{k}^{m_{tr}}

of task

τ_{k}

. The bias term in (36) may help approximate the ideal Bayes predictor in the presence of limited data

Z_{k}^{m_{tr}}

.

The objective of the meta-learner is to infer the hyperparameter u. For a given meta-training data set

Z_{1 : N}^{m}

, comprising of data sets from N tasks sampled according to

P_{T}

, the meta-learner can compute the empirical meta-training loss as

\begin{matrix} L_{t}^{sep} (u | Z_{1 : N}^{m}) = \frac{1}{N} \sum_{k = 1}^{N} \frac{1}{m_{te}} \sum_{j = 1}^{m_{te}} {(W_{k} - Z_{k, j}^{m_{te}})}^{2}, \end{matrix}

(37)

where

Z_{k, j}^{m_{te}}

denote the jth example in the test set of

Z_{k}^{m}

, the kth sub-data set of

Z_{1 : N}^{m}

. The meta-learner

P_{U | Z_{1 : N}^{m}}

then deterministically selects the minimizing hyperparameter u of the meta-training empirical loss function in (37). This optimization yields

\begin{matrix} U = \frac{{(1 - α)}^{- 1}}{N} (\sum_{k = 1}^{N} D_{k}^{m_{te}} - α D_{k}^{m_{tr}}), \end{matrix}

(38)

where

D_{k}^{m_{te}} = \sum_{j = 1}^{m_{te}} Z_{k, j}^{m_{te}} / m_{te}

. Note that

D_{k}^{m_{te}}

and

D_{k}^{m_{tr}}

are binomial random variables and by (38), U takes finitely many discrete values and is bounded as

- α {(1 - α)}^{- 1} \leq U \leq {(1 - α)}^{- 1}

. The meta-test loss can be explicitly computed as

\begin{matrix} L_{g}^{sep} (u) = {(1 - α)}^{2} (u^{2} - 2 u E_{P_{T}} [μ_{T}]) + E_{P_{T}} [α^{2} (μ_{T}^{2} + \frac{μ_{T} {\bar{μ}}_{T}}{m_{tr}}) + μ_{T} - 2 α μ_{T}^{2}], \end{matrix}

(39)

where

{\bar{μ}}_{T} = 1 - μ_{T}

, and the average meta-generalization gap evaluates to

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ P^{sep} (U | Z_{1 : N}^{m})] & = \frac{2 (1 + α^{2})}{N} {Var}_{T} + \frac{2 E_{P_{T}} [μ_{T} {\bar{μ}}_{T}]}{N} (\frac{1}{m_{te}} + \frac{α^{2}}{m_{tr}}), \end{matrix}

(40)

where

{Var}_{T} = (E_{P_{T}} [μ_{T}^{2}] - {(E_{P_{T}} [μ_{T}])}^{2})

is the variance of

μ_{T}

.

To compute the MI- and ITMI-based bounds on the meta-generalization gap (40), it is easy to verify that the average training loss

L_{t}^{sep} (\cdot | Z^{m})

is bounded, i.e.,

0 \leq L_{t}^{sep} (\cdot | Z^{m}) \leq {(1 + α)}^{2}

for all

u \in U

and

Z^{m} \in Z^{m}

. Thus, Assumption 3 for the MI bound and also Assumption 4 for the ITMI bound hold with

σ^{2} = {(1 + α)}^{4} / 4

. For the MI bound, we note that, since the meta-learner is deterministic, we have that

I (U; Z_{1 : N}^{m}) = H (U)

. The ITMI bound (30) is given as

\begin{matrix} | E_{P_{Z_{1 : N}^{m}, U}} [Δ^{sep} (U | Z_{1 : N}^{m})] | \leq \frac{1}{N} \sum_{i = 1}^{N} \sqrt{\frac{{(1 + α)}^{4}}{2} I (U; Z_{i}^{m})} . \end{matrix}

(41)

The information-theoretic measures in (41) can be evaluated numerically as discussed in Appendix D.

For a numerical illustration, Figure 4 plots the average of the meta-generalization loss (39) and average meta-training loss (A16) along with the ITMI bound in (41) and MI bound in (27). It can be seen that the ITMI bound is tighter than MI bound and correctly predicts the decrease in the meta-generalization gap as the number N of tasks increases.

5.1.2. Joint Within-Task Training and Testing sets

We now consider the case with joint within-task training and test sets. The base-learner

P_{W | Z_{k}^{m}, U}

for task

τ_{k} \in T

still uses the predictor (36), but now the empirical average over the training set is given as

D_{k} = \sum_{j = 1}^{m} Z_{k, j}^{m} / m .

As before, the meta-learner

P_{U | Z_{1 : N}^{m}}

deterministically selects the minimizing hyperparameter u of the meta-training empirical loss function,

L_{Z_{1 : N}^{m}} (u) = (1 / N) \sum_{k = 1}^{N} (1 / m) \sum_{j = 1}^{m} {(W_{i} - Z_{k, j}^{m})}^{2},

yielding

U = \frac{1}{N} \sum_{k = 1}^{N} D_{k} .

As discussed in Appendix D, the meta-generalization loss for this example can also be explicitly computed and the meta-generalization gap bounds in (32) and (33) can be evaluated numerically. Figure 5 plots the average meta-generalization loss and average meta-training loss along with the MI bound in (32) and ITMI bound in (A18), as a function of per-task data samples m. The ITMI bound is seen to better reflect the decrease of the meta-training loss as a function of m.

5.2. Noisy Iterative Meta-Learning Algorithms

Most meta-learning algorithms are built around a nested loop structure, with the inner loop applying the base-learner on the meta-training set and the outer loop updating the hyperparameters U. In this section, we focus on a vast class of such meta-learning algorithms in which the inner loop applies training procedures dependent on the current iterate of the hyperparameter, while the outer loop updates the hyperparameter using a stochastic rule. This class includes stochastic variants of state-of-the-art algorithms such as MAML [23] and reptile [24]. We apply the derived information-theoretic bounds to study the meta-generalization performance of the mentioned class of meta-training iterative stochastic rules by focusing on the case of separate within-task training and test sets here, which is assumed e.g., by MAML. The analysis for the setup with joint within-task training and test sets can also be carried out at the cost of a more cumbersome notation.

To start, let

U^{j} \in R^{d}

denote the hyperparameter vector at outer iteration j, with

U^{0} \in R^{d}

being an arbitrary initialization. For example, in MAML, the hyperparameter U defines the initial iterate used by each base-learner in the inner loop to update the model parameter

W_{τ}

corresponding to task

τ

. At each iteration

j \geq 1

, we randomly select a mini-batch of task indices

K_{j} \subseteq [1, \dots, N]

from the meta-training data

Z_{1 : N}^{m}

, obtaining the corresponding data set

Z_{K_{j}}^{m} = (Z_{K_{j}}^{m_{tr}}, Z_{K_{j}}^{m_{te}}) \subseteq Z_{1 : N}^{m}

, where

Z_{K_{j}}^{m_{tr}} = {Z_{k}^{m_{tr}}}_{k \in K_{j}}

and

Z_{K_{j}}^{m_{te}} = {Z_{k}^{m_{te}}}_{k \in K_{j}}

are the separate training and test sets for the selected tasks. For each index

k \in K_{j}

, in the inner loop, the base-learner selects the model parameter

W_{k}^{j}

as a possibly stochastic function

\begin{matrix} W_{k}^{j} = g (U^{j - 1}, Z_{k}^{m_{tr}}) . \end{matrix}

(42)

For instance, in MAML, the function

g (U^{j - 1}, Z_{k}^{m_{tr}}) \in R^{d}

in (42) represents the output of an SGD procedure that starts from initialization

U^{j - 1}

and uses the task training data

Z_{k}^{m_{tr}}

to iteratively update the model parameters, producing the final iterate

W_{k}^{j}

. We denote as

W_{K_{j}} = {W_{k}^{j}}_{k \in K_{j}}

the collection of the base-learners’ outputs for all task indices

k \in K_{j}

at outer iteration j.

In the outer loop, the meta-learner uses the task-specific adapted parameters

W_{K_{j}}

from the inner loop and the meta-test set

Z_{K_{j}}^{m_{te}}

to update the past iterate

U^{j - 1}

according to the general update rule

\begin{matrix} U^{j} = F (U^{j - 1}) + β_{j} G (U^{j - 1}, W_{K_{j}}, Z_{K_{j}}^{m_{te}}) + ξ_{j}, \end{matrix}

(43)

where

F (\cdot)

and

G (\cdot, \cdot, \cdot)

are arbitrary deterministic functions;

β_{j}

is the step-size; and

ξ_{j} \sim N (0, γ_{j}^{2} I_{d})

is an isotropic Gaussian noise, independently drawn for

j = 1, 2, \dots,

. As an example, in MAML, the function

F (\cdot)

is the identity function and function

G (\cdot, \cdot, \cdot)

equals the gradient of the empirical loss

1 / | K_{j} | \sum_{k \in K_{j}} L_{t}^{sep} (W_{k}^{j} | Z_{k}^{m_{te}})

in (14) with respect to

U^{j - 1}

. Note, however, that MAML does not add noise, i.e.,

γ_{j}^{2} = 0

for all j.

The final output of the meta-learning algorithm is then defined as an arbitrary function

U = f (U^{1}, \dots, U^{J}),

of all iterates. Examples of function f include the last update

f (U^{1}, \dots, U^{J}) = f (U^{J})

and average of the updates

f (U^{1}, \dots, U^{J}) = 1 / J \sum_{j = 1}^{J} U^{j}

. A graphical model representation of the variables involved is shown in Figure 6.

We now derive an upper bound on the meta-generalization gap for the general class of iterative meta-learning algorithm satisfying (42) and (43) under the following assumptions.

Assumption 7.

(1): For the base-learner given in (42), the average training loss $L_{t}^{sep} (u | Z^{m})$ in (15) is $σ^{2}$ -sub-Gaussian for all $u \in$ when $Z^{m} \sim P_{Z^{m}}$ ;
(2): The meta-training data set $Z_{K_{j}}^{m}$ sampled at each iteration j is conditionally independent of the history of model-parameter vectors ${W_{K_{i}}}_{i = 1}^{j - 1}$ and hyperparameter $U^{(j - 1)} = (U^{1}, U^{2}, \dots, U^{j - 1})$ , i.e.,

$\begin{matrix} P_{Z_{K_{j}}^{m} | {Z_{K_{i}}^{m}}_{i = 1}^{j - 1}, Z_{1 : N}^{m}, U^{(j - 1)}, {W_{K_{i}}}_{i = 1}^{j - 1}} = P_{Z_{K_{j}}^{m} | {Z_{K_{i}}^{m}}_{i = 1}^{j - 1}, Z_{1 : N}^{m}}; \end{matrix}$

(44)
(3): The meta-parameter update function $G (\cdot, \cdot, \cdot)$ is uniformly bounded, i.e., ${| | G (\cdot, \cdot, \cdot) | |}_{2} \leq L$ for some $L > 0$ .

Lemma 4.

Under Assumption 7, the following upper bound on the meta-generalization gap (18) holds for the class of noisy iterative meta-training algorithms (42) and (43)

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ^{sep} (U | Z_{1 : N}^{m})] & \leq \sqrt{\frac{2 σ^{2}}{N} \sum_{j = 1}^{J} \frac{d}{2} log (1 + \frac{β_{j}^{2} L^{2}}{d γ_{j}^{2}})} . \end{matrix}

(45)

Proof.

See Appendix E. □

The bound in (45) has the same form as the generalization gap derived in [27] for conventional learning. From (45), the generalization gap can be reduced by increasing the variance

γ_{j}^{2}

of the injected Gaussian noise. In particular, the meta-generalization gap depends on the ratios

β_{j}^{2} / γ_{j}^{2}

between squared step size

β_{j}^{2}

and variance

γ_{j}^{2}

. For example, SGLD sets

γ_{j} = \sqrt{β_{j}}

, and a step size

β_{j}

decaying over time according to the standard Robbins–Monro conditions, in order to ensure convergence of the output samples to the generalized posterior distribution of the hyperparameters [52].

Example: To illustrate bound (45), we now consider a simple logistic regression problem that generalizes the example studied in Section 5.1. Accordingly, each data point Z corresponds to labelled data

Z = (X, Y)

, where

X \in {0, 1}^{d}

represents the input vector and

Y \in {0, 1}

represents the corresponding binary label. The data distribution

P_{Z | τ_{k}} = P_{X | τ_{k}} P_{Y | X, τ_{k}}

for each task

τ_{k} \in T = {τ_{1}, \dots, τ_{M}}

is such that

X \sim P_{X | τ_{k}}

is a d-dimensional Bernoulli vector obtained via d independent draws from

Bernoulli (ν)

and Y is distributed as

Y \sim Bernoulli (ϕ (μ_{τ_{k}}^{T} X))

, where

ϕ (a) = 1 / (1 + exp (- a))

is the sigmoid function and

μ_{τ_{k}} \in R^{d}

, with

| | μ_{τ_{k}} {| |}_{2} \leq 1

. The task distribution

P_{T}

then defines a distribution over the parameter vectors

{μ_{τ_{1}}, \dots, μ_{τ_{M}}}

. The base-learner uses training data generated i.i.d. from

P_{Z | τ_{k}}

to obtain a prediction w of the parameter vector

μ_{τ_{k}}

for task

τ_{k} \in T

. The loss function is taken as the quadratic error

l (w, z) = {(ϕ (w^{T} x) - y)}^{2} .

At each iteration j, starting from initialization point

U^{j - 1}

, the base-learner in (42) uses a one-step projected gradient descent algorithm on the training data set

Z_{k}^{m_{tr}}

to obtain the prediction

W_{k}^{j}

as

\begin{matrix} W_{k}^{j} = {proj}_{W} (U^{j - 1} - α \nabla_{w} L_{t}^{sep} (w | Z_{k}^{m_{tr}}) |_{w = U^{j - 1}}), \end{matrix}

(46)

where

α > 0

is the step-size

W = {w \in R^{d} | | | w | |_{2} \leq 1}

is the set of feasible model parameters and

{proj}_{A} (b) = \frac{1}{2} {min}_{a \in A} | | a - {b | |}_{2}^{2}

is the projection operator. The meta-learner (43) updates the initialization vector according to the noisy gradient descent rule

\begin{matrix} U^{j} = U^{j - 1} - β_{j} (\frac{1}{| K_{j} |} \sum_{k = 1}^{| K_{j} |} \nabla_{w} L_{t}^{sep} {(w | Z_{k}^{m_{te}}) |}_{w = W_{k}^{j}}) + ξ_{j}, \end{matrix}

(47)

where

β_{j}

is the step-size; and

ξ_{j} \sim N (0, γ_{j}^{2} I_{d})

is isotropic Gaussian noise. This update rule corresponds to performing a first order MAML (FOMAML) [23] with the addition of noise.

For this problem, it is easy to verify that Assumption 7 is satisfied, since the loss function

l (\cdot, \cdot)

is bounded in the interval

[0, 1]

, whereby

L_{t}^{sep} (u | Z^{m})

is also

[a, b]

-bounded. We also have the inequality

\begin{matrix} {∥\frac{1}{| K_{t} |} \sum_{i = 1}^{| K_{t} |} \nabla_{w} L_{Z^{m_{te}}} {(w) |}_{w = W_{i}^{t}}∥}_{2} \leq 2 \sqrt{d} e^{\sqrt{d}} ≜ L . \end{matrix}

(48)

The MI bound in (45) then evaluates to

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ L (U | Z_{1 : N}^{m})] & \leq \sqrt{\frac{1}{2 N} \sum_{j = 1}^{J} \frac{d}{2} log (1 + \frac{4 β_{j}^{2} e^{2 \sqrt{d}}}{γ_{j}^{2}})} . \end{matrix}

(49)

We now evaluate the meta-training and meta-test loss, along with the bound (49) as a function of the ratio

γ_{j}^{2} / β_{j}^{2}

in Figure 7. For the experiment, we considered a task environment of

M = 20

tasks with

ν = 0.4

,

d = 3

,

N = 4

meta-training tasks with

m_{tr} = 10

training data samples and

m_{te} = 5

test data samples. For the inner-loop (46), we fixed step-size

α = 10^{- 4}

and for the outer-loop (47), we set

| K_{t} | = N

,

β_{j} = 0.25

and

T = 200

iterations.

As suggested by Lemma 4, the meta-generalization gap decreases with addition of noise. While the MI bound (45) is generally loose, it correctly quantifies the dependence of the meta-generalization loss and the ratio

γ_{j}^{2} / β_{j}^{2}

, and it can hence serve as a useful meta-training criterion [20,48].

6. Conclusions

This work has presented novel information-theoretic upper bounds on the average generalization gap of meta-learning algorithms, thereby extending the well-studied information-theoretic approaches in conventional learning to meta-learning. The proposed bounds capture two sources of uncertainty-environment-level uncertainty and within-task uncertainty—and bound them via separate mutual information terms. Applications were also discussed, with the aim of elucidating the use of the bounds to quantify meta-overfitting and guide the choice of the meta-inductive bias, i.e., the class of inductive biases. The derived bounds are amenable to further refinements, such as those along the lines of [39,40,45]. It would also be interesting to study the meta-generalization bounds on noisy iterative meta-learning algorithms using the tighter information-theoretic bounds such as [26,40].

Author Contributions

Conceptualization, formal analysis, writing—original draft, S.T.J.; conceptualization, funding acquisition and supervision, O.S. Both authors have read and agreed to the published version of the manuscript.

Funding

The authors have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 725731).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Decoupling Estimate Lemmas

The proofs of the main results rely on the following decoupling estimate lemmas, which bound the difference in expectations under a change of measure from the joint

P_{X, Y}

to the product of the marginals

P_{X} P_{Y}

. In order to state the most general form of decoupling estimate lemmas, we first define a generalized sub-Gaussian random variable.

Definition A1.

A random variable X is said to be

(Ψ_{+}, Ψ_{-}, b_{+}, b_{-})

-generalized sub-Gaussian if there exist convex functions

Ψ_{+} : R_{+} \to R

and

Ψ_{-} : R_{+} \to R

that satisfy the equalities

Ψ_{+} (0) = Ψ_{-} (0) = Ψ_{+}^{'} (0) = Ψ_{-}^{'} (0) = 0

and bound the CGF of X as

Λ_{X} (λ) \leq Ψ_{+} (λ), f o r λ \in [0, b_{+})

(A1a)

Λ_{X} (λ) \leq Ψ_{-} (- λ), f o r λ \in (b_{-}, 0],

(A1b)

for some constants

0 < b_{+} \leq \infty

and

- \infty \leq b_{-} < 0

.

For a

(Ψ_{+}, Ψ_{-}, b_{+}, b_{-})

-generalized sub-Gaussian random variable, we also introduce the following standard definitions. First, the Legendre dual of function

Ψ_{+} (λ)

is defined as

\begin{matrix} Ψ_{+}^{*} (x) = sup_{λ \in [0, b_{+})} (λ x - Ψ_{+} (λ)) . \end{matrix}

(A2)

It can be easily seen that

Ψ_{+}^{*} (\cdot)

is a non-negative, convex, and non-decreasing function on

[0, \infty)

with

Ψ_{+}^{*} (0) = 0 .

Second, the inverse Legendre dual of function

Ψ_{+} (λ)

is defined as

Ψ_{+}^{* - 1} (y) = inf {x \geq 0 : Ψ_{+}^{*} (x) \geq y}

. This function is concave, and it can be equivalently written as [26]

\begin{matrix} Ψ_{+}^{* - 1} (y) = inf_{λ \in [0, b_{+})} \frac{y + Ψ_{+} (λ)}{λ} . \end{matrix}

(A3)

Similar definitions and results apply for

Ψ_{-} (\cdot)

.

A

σ^{2}

-sub-Gaussian random variable X is a generalized sub-Gaussian variable with

Ψ_{+} (λ) = Ψ_{-} (λ) = λ^{2} σ^{2} / 2

,

b_{+} = \infty

and

b_{-} = - \infty

. Furthermore, the Legendre dual functions are given as

Ψ_{+}^{*} (x) = Ψ_{-}^{*} (x) = x^{2} / (2 σ^{2})

, and the inverse Legendre dual functions evaluate to

\begin{matrix} Ψ_{+}^{* - 1} (y) = Ψ_{-}^{* - 1} (y) = \sqrt{2 σ^{2} y} . \end{matrix}

(A4)

We are now ready to state the decoupling estimate lemmas.

Lemma A1

(Decoupling Estimate [14]). Let

X \in

and

Y \in

be two jointly distributed random variables with joint distribution

P_{X, Y}

, and let

f (X, Y)

be a real valued function such that

f (x, Y)

is

(Ψ_{+}, Ψ_{-}, \infty, - \infty)

-generalized sub-Gaussian for all

x \in

when

Y \sim P_{Y}

. Then we have the following inequalities

\begin{matrix} \pm (E_{P_{X} P_{Y}} [f (\tilde{X}, \tilde{Y})] - E_{P_{X, Y}} [f (X, Y)]) & \leq Ψ_{\mp}^{* - 1} (I (X; Y)), \end{matrix}

(A5)

where

(\tilde{X}, \tilde{Y}) \sim P_{X} P_{Y}

.

Lemma A2

(General Decoupling Estimate [26]). Let

X \in

and

Y \in

be two jointly distributed random variables with joint distribution

P_{X, Y}

, and let

f (X, Y)

be a real valued function such that

f (X, Y)

is a

(Ψ_{+}, Ψ_{-}, b_{+}, b_{-})

-generalized sub-Gaussian when

(X, Y) \sim P_{X} P_{Y}

. Then, we have the inequality (A5).

Note that in Lemma A2, the random variables

X, Y

are jointly distributed according to

P_{X, Y}

. Assuming that the function

f (X, Y)

is generalized sub-Gaussian under

X \sim P_{X}

and

Y \sim P_{Y}

with

P_{X}

and

P_{Y}

being the marginals of

P_{X, Y}

, the lemma provides an upper bound on the difference between average of

f (X, Y)

when

(X, Y)

is jointly distributed according to

P_{X, Y}

and the average of

f (X, Y)

when

(X, Y)

is independent with

X \sim P_{X}

and

Y \sim P_{Y}

. The resultant bound thus provides an estimate of the effect of decoupling of the joint distribution to its marginals with respect to function

f (X, Y)

.

Appendix B. Proofs of Theorems 1 and 2

For the proof of Theorem 1, we use the decomposition (29) of the meta-generalization gap into average environment-level and within-task generalization gaps as

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{sep} (U | Z_{1 : N}^{m})] \\ = E_{P_{Z_{1 : N}^{m}, U}} [(L_{g}^{sep} (U) - L_{g, t}^{sep} (U)) + (L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m}))] \\ = E_{P_{T}} [E_{P_{Z^{m} | T} P_{W, U, Z_{1 : N}^{m} | Z^{m_{tr}}}} [Δ L (W | Z^{m_{te}}, T)])] + E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})] \end{matrix}

(A6)

where (A6) follows since the average within-task generalization gap for a random meta-test task

E_{P_{Z_{1 : N}^{m}, U}} [L_{g}^{sep} (U) - L_{g, t}^{sep} (U)]

can be equivalently written as

E_{P_{T}} [E_{P_{Z^{m} | T} P_{W, U, Z_{1 : N}^{m} | Z^{m_{tr}}}} [Δ L (W | T)])]

, with

Δ L (W | Z^{m_{te}}, T) = L_{P_{Z | T}} (W) - L_{t} (W | Z^{m_{te}})

denoting the generalization gap of the meta-test task T, and the joint distribution

P_{W, U, Z_{1 : N}^{m} | Z^{m_{tr}}}

factorizes as

P_{W | Z^{m_{tr}}, U} P_{Z_{1 : N}^{m}, U}

. To obtain an upper bound on the average meta-generalization gap

E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{sep} (U | Z_{1 : N}^{m})]

, we bound each of the two differences in (A6) separately.

We first bound the second difference in (A6)

E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})]

, which represents the expected environment-level uncertainty measured using the average per-task training loss

L_{t}^{sep} (u | Z^{m})

defined in (15). To this end, we extend the single-task learning generalization bound of Lemma 1 by resorting to the decoupling estimate in Lemma A1 with

X = U

,

Y = Z_{1 : N}^{m}

and

f (X, Y) = L_{t}^{sep} (U | Z_{1 : N}^{m})

, so that

E_{P_{X, Y}} [f (X, Y)] = E_{P_{Z_{1 : N}^{m}, U}} [_{t}^{sep} (U | Z_{1 : N}^{m})]

and

E_{P_{X} P_{Y}} [f (\tilde{X}, \tilde{Y})] = E_{P_{Z_{1 : N}^{m}, U}} [L_{g}^{sep} (U)]

.

Recall from Assumption 3 that for a given

u \in U

,

L_{t}^{sep} (u | Z_{1 : N}^{m}) = \frac{1}{N} \sum_{i = 1}^{N} L_{t}^{sep} (u | Z_{i}^{m})

is the average of

σ^{2}

-sub-Gaussian i.i.d. random variables

L_{t}^{sep} (u | Z_{i}^{m})

under

Z_{i}^{m} \sim P_{Z^{m}}

for all

i \in {1, \dots, N}

. It then follows that

L_{t}^{sep} (u | Z_{1 : N}^{m})

is

σ^{2} / N

-sub-Gaussian under

Z_{1 : N}^{m} \sim P_{Z_{1 : N}^{m}}

for all

u \in U

[53]. Applying Lemma A1 with

Ψ_{\mp}^{* - 1}

specialized to the

σ^{2} / N

-sub-Gaussian loss in (A4) gives that

\begin{matrix} |E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})]| & \leq \sqrt{\frac{2 σ^{2} I (U; Z_{1 : N}^{m})}{N}} . \end{matrix}

(A7)

We now evaluate the first difference in (A6). It can be seen that for a fixed task

τ \in T

, the average within-task uncertainty evaluates to

\begin{matrix} E_{P_{Z^{m} | T = τ} P_{W, U, Z_{1 : N}^{m} | Z^{m_{tr}}}} [Δ L (W | Z^{m_{te}}, T = τ)] & = E_{P_{Z^{m} | T = τ} P_{W | Z^{m_{tr}}}} [Δ L (W | Z^{m_{te}}, T = τ)] \\ = E_{P_{Z^{m_{tr}} | T = τ}} E^{P_{W | Z^{m_{tr}}}} [L_{g} (W | T = τ) - E_{P_{Z^{m_{te}} | T = τ}} L_{t} (W | Z^{m_{te}})] \\ \overset{(a)}{=} 0, \end{matrix}

(A8)

where (a) follows, since W and

Z^{m_{te}}

are conditionally independent given

Z^{m_{tr}}

, whereby

E_{P_{Z^{m_{tr}} | T = τ}} E P_{W | Z^{m_{tr}}} [E_{P_{Z^{m_{te}} | T = τ}} L_{t} (W | Z^{m_{te}})] = E_{P_{Z^{m_{tr}} | T = τ}} E_{P_{W | Z^{m_{tr}}}} [L_{g} (W | T = τ)]

. Substituting (A7) and (A8) in (A6), then concludes the proof.

For Theorem 2, the proof follows along the same line, bounding the average environment-level uncertainty

E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})]

. Towards this, we note that the environment-level uncertainty can be equivalently written as

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{sep} (U) - L_{t}^{sep} (U | Z_{1 : N}^{m})] & = \frac{1}{N} \sum_{i = 1}^{N} (E_{P_{Z_{i}^{m}} P_{U}} [L_{t}^{sep} (U | Z_{i}^{m})] - E_{P_{Z_{i}^{m}, U}} [L_{t}^{sep} (U | Z_{i}^{m})]), \end{matrix}

(A9)

where

Z^{m}

and U in the first term are conditionally independent random variables distributed as

(Z_{i}^{m}, U) \sim P_{Z^{m}} P_{U}

, while, in the second term, they are jointly distributed according to

P_{Z_{i}^{m}, U}

, which is obtained by marginalizing the joint distribution

P_{Z_{1 : N}^{m}, U}

. Under Assumption 4(a), we can bound the difference

E_{P_{Z_{i}^{m}} P_{U}} [L_{t}^{sep} (U | Z_{i}^{m})] - E_{P_{Z_{i}^{m}, U}} [L_{t}^{sep} (U | Z_{i}^{m})]

by resorting to the decoupling estimate in Lemma A1 with

X = U, Y = Z_{i}^{m}

,

f (X, Y) = L_{t}^{sep} (U | Z_{i}^{m})

such that

E_{P_{X, Y}} [f (X, Y)] = E_{P_{U, Z_{i}^{m}}} [L_{t}^{sep} (U | Z_{i}^{m})]

and

E_{P_{X} P_{Y}} [f (\tilde{X}, \tilde{Y})] = E_{P_{U} P_{Z_{i}^{m}}} [L_{t}^{sep} (U | Z_{i}^{m})]

. Since

L_{t}^{sep} (U | Z_{i}^{m})

is

σ^{2} -

sub-Gaussian under Assumption 4(a) for all

u \in U

, Lemma A1 yields the following bound

\begin{matrix} |E_{P_{Z_{i}^{m}} P_{U}} [L_{t}^{sep} (U | Z_{i}^{m})] - E_{P_{Z_{i}^{m}, U}} [L_{t}^{sep} (U | Z_{i}^{m})]| \leq \sqrt{2 σ^{2} I (U; Z_{i}^{m})} . \end{matrix}

(A10)

The bound in (A10) can also be obtained using Assumption 4(a) by resorting to the general decoupling estimate in Lemma A2 by fixing

X = U, Y = Z_{i}^{m}

,

f (X, Y) = L_{t}^{sep} (U | Z_{i}^{m})

such that

E_{P_{X, Y}} [f (X, Y)] = E_{P_{U, Z_{i}^{m}}} [L_{t}^{sep} (U | Z_{i}^{m})]

and

= E_{P_{X} P_{Y}} [f (\tilde{X}, \tilde{Y})] = E_{P_{U} P_{Z_{i}^{m}}} [L_{t}^{sep} (U | Z_{i}^{m})]

. Substituting the bound in (A10) in (A9) then yields the required bound in (27).

Appendix C. Proofs of Theorems 3 and 4

For Theorem 3, we start from the following decomposition of the average meta-generalization gap analogous to (A6)

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [Δ L^{joint} (U | Z_{1 : N}^{m})] = E_{P_{T}} [E_{P_{Z^{m} | T} P_{W | Z^{m}}} [Δ L (W | Z^{m}, T)])] + E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{joint} (U) - L_{t}^{joint} (U | Z_{1 : N}^{m})] \end{matrix}

(A11)

where

L_{g, t}^{joint} (u) = E_{P_{T, Z^{m}}} [L_{t}^{joint} (u | Z^{m})]

is the average training loss for the randomly selected meta-test task as a function of the hyperparameter u, and

Δ L (W | Z^{m}, T) = L_{P_{Z | T}} (W) - L_{t} (W | Z^{m})

is the generalization gap for the meta-test task T. The MI bound on the expected environment-level uncertainty,

E_{P_{Z_{1 : N}^{m}, U}} [L_{g, t}^{joint} (U) - L_{t}^{joint} (U | Z_{1 : N}^{m})]

, can be obtained by using Lemma A1 and the Assumption 5(b) as in (A7).

The main difference between the separate and joint within-task training and test sets scenarios is that while the average within-task uncertainty vanishes in the former scenario, this is not the case for joint within-task training and training sets. Consequently, we now bound the average within-task generalization gap denoted by the first difference in (A11). For given task

τ \in T

, to bound the within-task generalization gap

E_{P_{Z^{m} | T = τ} P_{W | Z^{m}}} [Δ L (W | Z^{m}, T = τ)]

, we resort to Lemma A1 with

X = W

,

Y = Z^{m}

and

f (X, Y) = L_{t} (W | Z^{m})

, so that

E_{P_{X, Y}} [f (X, Y)] = E_{P_{W, Z^{m} | T = τ}} [L_{t} (W | Z^{m})]

. It can be then verified that

E_{P_{X} P_{Y}} [f (\tilde{X}, \tilde{Y})] = E_{P_{W | T = τ}} E_{P_{Z^{m} | T = τ}} [L_{t} (W | Z^{m})] = E_{P_{W | T = τ}} [L_{g} (W | T = τ)] = E_{P_{W, Z^{m} | T = τ}} [L_{g} (W | T = τ)]

, where

P_{W, Z^{m} | T = τ} = P_{W | Z^{m}} P_{Z^{m} | T = τ}

. Since

L_{t} (w | Z^{m})

is the sum of i.i.d

δ_{τ}^{2}

-sub-Gaussian random variables

l (w, Z_{i})

(from Assumption 5(a)), we have that

L_{t} (w | Z^{m})

is

δ_{τ}^{2} / m

-sub-Gaussian under

Z^{m} \sim P_{Z^{m} | T = τ}

for all

w \in W

[53]. Consequently, Lemma A1 yields the following bound

\begin{matrix} |E_{P_{Z^{m} | T = τ} P_{W | Z^{m}}} [Δ L (W | Z^{m}, T = τ)]| \leq \sqrt{\frac{2 δ_{τ}^{2}}{m} I (W; Z^{m} | T = τ)} . \end{matrix}

(A12)

Averaging with respect to

P_{T}

on both sides of (A12), and combining with the bound on average environment-level uncertainty yields the required bound in (32) via Jensen’s inequality.

For Theorem 4, the proof follows along the same line. The ITMI bound on the expected environment-level uncertainty can be obtained along the lines of (A10), using the assumption on

L_{t}^{joint} (u | Z^{m})

in either Assumption 6(a) or Assumption 6(b). We now show that we can similarly bound the within-task uncertainty using the assumption on loss function

l (w, z)

in either Assumption 6(a) or Assumption 6(b). Towards this, for fixed task

τ \in T

, we write the average within-task uncertainty equivalently as

\begin{matrix} E_{P_{Z^{m} | T = τ} P_{W | Z^{m}}} [Δ L (W | Z^{m}, T = τ)] = \frac{1}{m} \sum_{j = 1}^{m} (E_{P_{W | T = τ} P_{Z_{j} | T = τ}} [l (W, Z_{j})] - E_{P_{W, Z_{j} | T = τ}} [l (W, Z_{j})]), \end{matrix}

(A13)

where W and

Z_{j}

in the second term are jointly distributed according to

P_{W, Z_{j} | T = τ}

, which is the marginal of the joint distribution

P_{W, Z^{m} | T = τ}

. In contrast, W and

Z_{j}

in the first term are conditionally independent random variables distributed as

(W, Z_{j}) \sim P_{W | T = τ} P_{Z_{j} | T = τ}

where

P_{W | T = τ}

is the marginal distribution of

P_{W, Z_{j} | T = τ}

. Now, fixing

X = W

,

Y = Z_{j}

and

f (X, Y) = l (W, Z_{j})

so that

E_{P_{X} P_{Y}} [f (\tilde{X}, \tilde{Y})] = E_{P_{W | T = τ} P_{Z_{j} | T = τ}} [l (W, Z_{j})]

and

E_{P_{X, Y}} [f (X, Y)] = E_{P_{W, Z_{j} | T = τ}} [l (W, Z_{j})]

in Lemma A1 under the assumption on

l (w, z)

in Assumption 6(a), or in Lemma A2 under the assumption on

l (w, z)

in Assumption 6(b) yields the following bound,

\begin{matrix} |E_{P_{Z^{m} | T = τ} P_{W | Z^{m}}} [Δ L (W | Z^{m}, T = τ)]| \leq \frac{1}{m} \sum_{j = 1}^{m} \sqrt{2 δ_{τ}^{2} I (W; Z_{j} | T = τ)} . \end{matrix}

(A14)

Averaging with respect to

P_{T}

on both sides of (A14), and combining with the bound on average environment-level uncertainty yields the required bound in (33).

Appendix D. Details of Example

We first give details of the derivation of meta-generalization gap for the case with separate within-task training and test sets. The average meta-generalization loss can be computed as

E_{P_{Z_{1 : N}^{m}, U}} [L_{g}^{sep} (U)] =

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [{(1 - α)}^{2} U^{2} + E_{P_{T, Z^{m_{tr}}}} [α^{2} {(D_{T}^{m_{tr}})}^{2} + E_{P_{Z | T}} [Z^{2}] - 2 α D_{T}^{m_{tr}} μ_{T} + 2 (1 - α) U (α D_{T}^{m_{tr}} - μ_{T})]] \\ \overset{(a)}{=} E_{P_{Z_{1 : N}^{m}, U}} [{(1 - α)}^{2} (U^{2} - 2 U E_{P_{T}} [μ_{T}])] + E_{P_{T}} [α^{2} (μ_{T}^{2} + \frac{μ_{T} {\bar{μ}}_{T}}{m_{tr}}) + μ_{T} - 2 α μ_{T}^{2}], \end{matrix}

(A15)

where the equality in (a) follows since

E_{P_{Z | T}} [Z^{2}] = μ_{T}

,

E_{P_{Z^{m_{tr}} | T}} [D_{T}^{m_{tr}}] = μ_{T}

and

E_{P_{Z^{m_{tr}} | T}} [{(D_{T}^{m_{tr}})}^{2}]

= μ_{T}^{2} + μ_{T} {\bar{μ}}_{T} / m_{tr}

. In a similar manner, the average meta-training loss can be computed as

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [L_{t}^{sep} (U | Z_{1 : N}^{m})] & = E_{P_{Z_{1 : N}^{m}}} [- {(1 - α)}^{2} U^{2} + \frac{1}{N} \sum_{i = 1}^{N} α^{2} {(D_{i}^{m_{tr}})}^{2} \\ + \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{m_{te}} \sum_{j = 1}^{m_{te}} {(Z_{i, j}^{m_{te}})}^{2} - 2 α \frac{1}{N} \sum_{i = 1}^{N} D_{i}^{m_{tr}} D_{i}^{m_{te}}], \end{matrix}

(A16)

with U defined as in (38). The meta-generalization gap in (40) then results by taking the difference of (A15) and (A16), and using that

E_{P_{Z_{1 : N}^{m}}} [{(1 - α)}^{2} U^{2}] =

E_{P_{T}} [μ_{T} {\bar{μ}}_{T}] (\frac{1}{N m_{te}} + \frac{α^{2}}{N m_{tr}}) + \frac{1}{N} (1 + α^{2}) {Var}_{T} + {(1 - α)}^{2} {(E_{P_{T}} [μ_{T}])}^{2}

and

E_{P_{Z_{1 : N}^{m}}} [U] = E_{P_{T}} [μ_{T}]

with

{Var}_{T} = E_{P_{T}} [μ_{T}^{2}] - {(E_{P_{T}} [μ_{T}])}^{2}

.

We now evaluate the mutual pieces of information

I (U; Z_{1 : N}^{m})

and

I (U; Z_{i}^{m})

. For the first MI, note that, since the meta-learner is deterministic (see (38)),

H (U | Z_{1 : N}^{m}) = 0

and thus

I (U; Z_{1 : N}^{m}) = H (U)

. For the second MI, we can write

I (U; Z_{i}^{m}) = H (U) -_{Z_{i}^{m}} [H (U | Z_{i}^{m} = z^{m})]

. It can be seen that random variables U and

U | Z_{i}^{m} = z^{m}

are mixtures of probability distributions, whose entropies can be evaluated following standard methods [54].

For the case with joint within-task training and test sets, the meta-generalization gap can be obtained in a similar way as

\begin{matrix} E_{P_{Z_{1 : N}^{m}, U}} [L_{t}^{joint} (U | Z_{1 : N}^{m})] = \frac{2}{N} [E_{P_{T}} [\frac{μ_{T} {\bar{μ}}_{T}}{m}] + {Var}_{T}] + \frac{2 α}{m} E_{P_{T}} [μ_{T} {\bar{μ}}_{T}] . \end{matrix}

(A17)

For the MI- and ITMI-based bounds, note that with

W = [0, 1]

, the loss function

l (\cdot, \cdot)

is

[0, 1]

-bounded, and for the deterministic base-learner in (36) with

U = [0, 1]

, the average training loss,

L_{t}^{joint} (u | Z^{m})

is also

[0, 1]

-bounded for all

Z^{m} \in Z^{m}

. Thus, Assumptions 5 and 6 hold with

σ^{2} = δ_{τ}^{2} = 1 / 4

. For the MI bound in (32), we have

I (U; Z_{1 : N}^{m}) = H (U)

and

I (W; Z^{m} | T = τ) = H (W | T = τ) - E_{Z^{m}} [H (W | Z^{m}, T = τ)]

. For the ITMI bound (33), we have

\begin{matrix} | E_{P_{Z_{1 : N}^{m}, U}} [L_{t}^{joint} (U | Z_{1 : N}^{m})] | & \leq \frac{1}{N} \sum_{i = 1}^{N} \sqrt{\frac{1}{2} I (U; Z_{i}^{m})} + E_{P_{T}} [\frac{1}{m} \sum_{j = 1}^{m} \sqrt{\frac{1}{2} I (W; Z_{j} | T = τ)}] . \end{matrix}

(A18)

All information measures can be easily evaluated numerically [54].

Appendix E. Proof of Lemma 4

From the update rule of the meta-learner in (43), we get the Markov dependency

\begin{matrix} P_{U^{j} | U^{(j - 1)}, {W_{K_{i}}}_{i = 1}^{j}, {Z_{K_{i}}^{m}}_{i = 1}^{j}, Z_{1 : N}^{m}} = P_{U^{j} | U^{j - 1}, W_{K_{j}}, Z_{K_{j}}^{m_{te}}}, \end{matrix}

(A19)

where

U^{(j - 1)} = {U^{1}, \dots, U^{j - 1}}

is the history vector of hyperparameters. The sampling strategy in (44) together with (A19) then implies the following relation

\begin{matrix} P_{U^{j} | U^{(j - 1)}, {W_{K_{i}}}_{i = 1}^{j}, {Z_{K_{i}}^{m}}_{i = 1}^{J}, Z_{1 : N}^{m}} = P_{U^{j} | U^{j - 1}, W_{K_{j}}, Z_{K_{j}}^{m_{te}}} . \end{matrix}

(A20)

Using

U^{(J)} = {U^{1}, \dots, U^{J}}

to denote the set of all updates, we have the following relations

\begin{matrix} I (U; Z_{1 : N}^{m}) & \overset{(a)}{\leq} I (U^{(J)}; Z_{1 : N}^{m}) \\ \overset{(b)}{\leq} I (U^{(J)}; {Z_{K_{i}}^{m}}_{i = 1}^{J}) = \sum_{j = 1}^{J} I (U^{j}; {Z_{K_{i}}^{m}}_{i = 1}^{J} | U^{(j - 1)}) \end{matrix}

(A21)

\begin{matrix} \leq \sum_{j = 1}^{J} I (U^{j}; {Z_{K_{i}}^{m}}_{i = 1}^{J}, {W_{K_{i}}}_{i = 1}^{j} | U^{(j - 1)}) \end{matrix}

(A22)

\begin{matrix} = \sum_{j = 1}^{J} h (U^{j} | U^{(j - 1)}) - h (U^{j} | U^{(j - 1)}, {Z_{K_{i}}^{m}}_{i = 1}^{J}, {W_{K_{i}}}_{i = 1}^{j}) \end{matrix}

(A23)

\begin{matrix} \overset{(c)}{=} \sum_{j = 1}^{J} [h (U^{j} | U^{j - 1}) - h (U^{j} | U^{j - 1}, W_{K_{j}}, Z_{K_{j}}^{m_{te}})], \end{matrix}

(A24)

where, the inequality in (a) follows from data processing inequality on Markov chain

Z_{1 : N}^{m} \to U^{(J)} \to U

; (b) follows from the Markov chain

Z_{1 : N}^{m} \to {Z_{K_{i}}^{m}}_{i = 1}^{J} \to U^{(J)}

; and the equality in (c) follows from

U^{(j - 2)} \to U^{j - 1} \to U^{j}

and (A20). Finally, the computation of bound in (A24) follows similar to Lemma 5 in [27].

References

Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Simeone, O. A Brief Introduction to Machine Learning for Engineers. Found. Trends Signal Process. 2018, 12, 200–431. [Google Scholar] [CrossRef]
Schmidhuber, J. Evolutionary Principles in Self-Referential Learning, or On Learning How to Learn: The Meta-meta-... Hook. Ph.D. Thesis, Technische Universität München, München, Germany, 1987. [Google Scholar]
Thrun, S.; Pratt, L. Learning to Learn: Introduction and Overview. In Learning to Learn; Springer: Berlin/Heidelberg, Germany, 1998; pp. 3–17. [Google Scholar]
Thrun, S. Is Learning the N-th Thing Any Easier Than Learning the First? In Proceedings of the Advances in Neural Information Processing Systems(NIPS); MIT Press: Cambridge, MA, USA, 1996; pp. 640–646. [Google Scholar]
Vilalta, R.; Drissi, Y. A Perspective View and Survey of Meta-Learning. Artif. Intell. Rev. 2002, 18, 77–95. [Google Scholar] [CrossRef]
Simeone, O.; Park, S.; Kang, J. From Learning to Meta-Learning: Reduced Training Overhead and Complexity for Communication Systems. arXiv 2020, arXiv:2001.01227. [Google Scholar]
Kuzborskij, I.; Orabona, F. Fast rates by transferring from auxiliary hypotheses. Mach. Learn. 2017, 106, 171–195. [Google Scholar] [CrossRef][Green Version]
Kienzle, W.; Chellapilla, K. Personalized handwriting recognition via biased regularization. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 457–464. [Google Scholar]
Denevi, G.; Ciliberto, C.; Grazzi, R.; Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. arXiv 2019, arXiv:1903.10399. [Google Scholar]
Denevi, G.; Pontil, M.; Ciliberto, C. The advantage of conditional meta-learning for biased regularization and fine-tuning. arXiv 2020, arXiv:2008.10857. [Google Scholar]
Baxter, J. A Model of Inductive Bias Learning. J. Artif. Intell. Res. 2000, 12, 149–198. [Google Scholar] [CrossRef]
Russo, D.; Zou, J. Controlling Bias in Adaptive Data Analysis Using Information Theory. In Proceedings of the Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain, 9–11 May 2016; pp. 1232–1240. [Google Scholar]
Xu, H.; Farajtabar, M.; Zha, H. Learning Granger causality for Hawkes processes. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 20–22 June 2016; pp. 1717–1726. [Google Scholar]
Maurer, A. Algorithmic Stability and Meta-Learning. J. Mach. Learn. Res. 2005, 6, 967–994. [Google Scholar]
Devroye, L.; Wagner, T. Distribution-Free Performance Bounds for Potential Function Rules. IEEE Trans. Inf. Theory 1979, 25, 601–604. [Google Scholar] [CrossRef]
Rogers, W.H.; Wagner, T.J. A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules. Ann. Stat. 1978, 6, 506–514. [Google Scholar] [CrossRef]
Pentina, A.; Lampert, C. A PAC-Bayesian Bound for Lifelong Learning. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 991–999. [Google Scholar]
Amit, R.; Meir, R. Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2018; pp. 205–214. [Google Scholar]
Rothfuss, J.; Fortuin, V.; Krause, A. PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. arXiv 2020, arXiv:2002.05551. [Google Scholar]
Denevi, G.; Ciliberto, C.; Stamos, D.; Pontil, M. Incremental learning-to-learn with statistical guarantees. arXiv 2018, arXiv:1803.08089. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the International Conference on Machine Learning-Volume 70, Sydney, NSW, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Nichol, A.; Achiam, J.; Schulman, J. On First-Order Meta-Learning Algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
Xu, A.; Raginsky, M. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 2524–2533. [Google Scholar]
Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening Mutual Information Based Bounds on Generalization Error. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 587–591. [Google Scholar]
Pensia, A.; Jog, V.; Loh, P.L. Generalization Error Bounds for Noisy, Iterative Algorithms. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Veil, CO, USA, 17–22 June 2018; pp. 546–550. [Google Scholar]
Vapnik, V.N.; Chervonenkis, A.Y. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. In Theory of Probability and Its Applications; SIAM: Philadelphia, PA, USA, 1971; Volume 16, pp. 264–280. [Google Scholar]
Koltchinskii, V.; Panchenko, D. Rademacher Processes and Bounding the Risk of Function Learning. In High Dimensional Probability II; Springer: Berlin/Heidelberg, Germany, 2000; Volume 47, pp. 443–457. [Google Scholar]
Bousquet, O.; Elisseeff, A. Stability and Generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
Kearns, M.; Ron, D. Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation. Neural Comput. 1999, 11, 1427–1453. [Google Scholar] [CrossRef]
Poggio, T.; Rifkin, R.; Mukherjee, S.; Niyogi, P. General Conditions for Predictivity In Learning Theory. Nature 2004, 428, 419–422. [Google Scholar] [CrossRef]
Kutin, S.; Niyogi, P. Almost-Everywhere Algorithmic Stability and Generalization Error. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI’02); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2002; pp. 275–282. [Google Scholar]
Dwork, C.; Feldman, V.; Hardt, M.; Pitassi, T.; Reingold, O.; Roth, A.L. Preserving Statistical Validity in Adaptive Data Analysis. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing (STOC), Portland, OR, USA, 15–17 June 2015; pp. 117–126. [Google Scholar]
Bassily, R.; Nissim, K.; Smith, A.; Steinke, T.; Stemmer, U.; Ullman, J. Algorithmic Stability for Adaptive Data Analysis. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (STOC), Cambridge, MA, USA, 19–21 June 2016; pp. 1046–1059. [Google Scholar]
McAllester, D.A. PAC-Bayesian Model Averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT), Santa Cruz, CA, USA, 7–9 July 1999; pp. 164–170. [Google Scholar]
Seeger, M. PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification. J. Mach. Learn. Res. 2002, 3, 233–269. [Google Scholar]
Alquier, P.; Ridgway, J.; Chopin, N. On the Properties of Variational Approximations of Gibbs Posteriors. J. Mach. Learn. Res. 2016, 17, 8374–8414. [Google Scholar]
Asadi, A.; Abbe, E.; Verdú, S. Chaining Mutual Information and Tightening Generalization Bounds. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 7234–7243. [Google Scholar]
Negrea, J.; Haghifam, M.; Dziugaite, G.K.; Khisti, A.; Roy, D.M. Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 11013–11023. [Google Scholar]
Alabdulmohsin, I. Towards a Unified Theory of Learning and Information. Entropy 2020, 22, 438. [Google Scholar] [CrossRef]
Jiao, J.; Han, Y.; Weissman, T. Dependence Measures Bounding the Exploration Bias for General Measurements. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1475–1479. [Google Scholar]
Issa, I.; Gastpar, M. Computable Bounds On the Exploration Bias. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 576–580. [Google Scholar]
Issa, I.; Esposito, A.R.; Gastpar, M. Strengthened Information-Theoretic Bounds on the Generalization Error. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 582–586. [Google Scholar]
Steinke, T.; Zakynthinou, L. Reasoning About Generalization via Conditional Mutual Information. arXiv 2020, arXiv:2001.09122. [Google Scholar]
Knoblauch, J.; Jewson, J.; Damoulas, T. Generalized Variational Inference. arXiv 2019, arXiv:1904.02063. [Google Scholar]
Bissiri, P.G.; Holmes, C.C.; Walker, S.G. A General Framework for Updating Belief Distributions. J. R. Stat. Soc. Ser. Stat. Methodol. 2016, 78, 1103–1130. [Google Scholar] [CrossRef] [PubMed]
Yin, M.; Tucker, G.; Zhou, M.; Levine, S.; Finn, C. Meta-Learning Without Memorization. arXiv 2019, arXiv:1912.03820. [Google Scholar]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Learnability, Stability and Uniform Convergence. J. Mach. Learn. Res. 2010, 11, 2635–2670. [Google Scholar]
Maurer, A.; Pontil, M.; Romera-Paredes, B. The Benefit of Multitask Representation Learning. J. Mach. Learn. Res. 2016, 17, 2853–2884. [Google Scholar]
Welling, M.; Teh, Y.W. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Handbook of Differential Entropy; CRC Press: New York, NY, USA, 2013. [Google Scholar]

Figure 1. Directed graph representing the variables involved in the definition of generalization gap (11) for single-task learning.

Figure 2. Directed graph representing the variables involved in the definition of meta-generalization gap (18) for separate within-task training and testing sets.

Figure 3. Directed graph representing the variables involved in the definition of meta-generalization gap (22) for joint within-task training and testing sets.

Figure 4. Comparison of the MI bound in (27) and ITMI-based bound obtained in (41) with the meta-generalization gap for meta-learning with separate within-task training and test sets. The task environment is defined by

M = 12

tasks. Other parameters are set as

α = 0.15

,

m_{tr} = 15

,

m_{te} = 5

.

Figure 4. Comparison of the MI bound in (27) and ITMI-based bound obtained in (41) with the meta-generalization gap for meta-learning with separate within-task training and test sets. The task environment is defined by

M = 12

tasks. Other parameters are set as

α = 0.15

,

m_{tr} = 15

,

m_{te} = 5

.

Figure 5. Comparison of the MI- and ITMI-based bound obtained in (A18) with the meta-generalization gap for meta-learning with joint within-task training and test sets, as a function of the per-task data samples m for

N = 5

and

α = 0.55

. The task environment is defined by

M = 9

tasks.

Figure 5. Comparison of the MI- and ITMI-based bound obtained in (A18) with the meta-generalization gap for meta-learning with joint within-task training and test sets, as a function of the per-task data samples m for

N = 5

and

α = 0.55

. The task environment is defined by

M = 9

tasks.

Figure 6. A graphical model representation of the variables involved in the Definition of noisy iterative algorithms.

Figure 7. Comparison of the meta-generalization gap with the MI-based bound in (49) as function of the ratio

γ_{t}^{2} / β_{t}^{2}

.

Figure 7. Comparison of the meta-generalization gap with the MI-based bound in (49) as function of the ratio

γ_{t}^{2} / β_{t}^{2}

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jose, S.T.; Simeone, O. Information-Theoretic Generalization Bounds for Meta-Learning and Applications. Entropy 2021, 23, 126. https://doi.org/10.3390/e23010126

AMA Style

Jose ST, Simeone O. Information-Theoretic Generalization Bounds for Meta-Learning and Applications. Entropy. 2021; 23(1):126. https://doi.org/10.3390/e23010126

Chicago/Turabian Style

Jose, Sharu Theresa, and Osvaldo Simeone. 2021. "Information-Theoretic Generalization Bounds for Meta-Learning and Applications" Entropy 23, no. 1: 126. https://doi.org/10.3390/e23010126

APA Style

Jose, S. T., & Simeone, O. (2021). Information-Theoretic Generalization Bounds for Meta-Learning and Applications. Entropy, 23(1), 126. https://doi.org/10.3390/e23010126

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Information-Theoretic Generalization Bounds for Meta-Learning and Applications

Abstract

1. Introduction

1.1. Main Contributions

1.2. Related Work

1.3. Notation

2. Problem Definition

2.1. Generalization Gap for Single-Task Learning

2.2. Generalization Gap for Meta-Learning

2.2.1. Separate Within-Task Training and Test Sets

2.2.2. Joint Within-Task Training and Test Sets

3. Information-Theoretic Generalization Bounds for Single-Task Learning

3.1. Preliminaries

3.2. Mutual Information (MI) Bound

3.3. Individual Sample MI (ISMI) Bound

4. Information-Theoretic Generalization Bounds for Meta-Learning

4.1. Bounds on Meta-Generalization Gap with Separate Within-Task Training and Test Sets

4.1.1. MI-Based Bound

4.1.2. ITMI Bound

4.2. Bounds on Generalization Gap with Joint Within-Task Training and Test Sets

4.2.1. MI-Based Bound

4.2.2. ITMI Bound on (22)

4.3. Discussion on Bounds

5. Applications

5.1. Parameter Estimation

5.1.1. Separate Within-Task Training and Test Sets

5.1.2. Joint Within-Task Training and Testing sets

5.2. Noisy Iterative Meta-Learning Algorithms

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Decoupling Estimate Lemmas

Appendix B. Proofs of Theorems 1 and 2

Appendix C. Proofs of Theorems 3 and 4

Appendix D. Details of Example

Appendix E. Proof of Lemma 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI