Meta-Strategy for Learning Tuning Parameters with Guarantees

Online learning methods, similar to the online gradient algorithm (OGA) and exponentially weighted aggregation (EWA), often depend on tuning parameters that are difficult to set in practice. We consider an online meta-learning scenario, and we propose a meta-strategy to learn these parameters from past tasks. Our strategy is based on the minimization of a regret bound. It allows us to learn the initialization and the step size in OGA with guarantees. It also allows us to learn the prior or the learning rate in EWA. We provide a regret analysis of the strategy. It allows to identify settings where meta-learning indeed improves on learning each task in isolation.


Introduction
In many applications of modern supervised learning, such as medical imaging or robotics, a large number of tasks is available but many of them are associated with a small amount of data. With few datapoints per task, learning them in isolation would give poor results. In this paper, we consider the problem of learning from a (large) sequence of regression or classification tasks with small sample size. By exploiting their similarities we seek to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments.
Inspired by human ingenuity in solving new problems by leveraging prior experience, meta-learning is a subfield of machine learning whose goal is to automatically adapt a learning mechanism from past experiences to rapidly learn new tasks with little available data. Since it "learns the learning mechanism" it is also referred to as learning-to-learn [1]. It is seen as a critical problem for the future of machine learning [2]. Numerous formulations exist for meta-learning and we focus on the problem of online meta-learning where the tasks arrive one at a time and the goal is to efficiently transfer information from the previous tasks to the new ones such that we learn the new tasks as efficiently as possible (this has also been refered to as lifelong learning). Each task is in turn processed online. To sum up, we have a stream of tasks and for each task a stream of observations.
In order to solve online tasks, diverse well-established strategies exist: perceptron, online gradient algorithm (OGA), online mirror descent, follow-the-regularized-leader, exponentially weighted aggregation (EWA, also refered to as generalized Bayes etc.) We refer the reader to [3][4][5][6] for introductions to these algorithms and to so-called regret bounds, that control their generalization errors. We refer to these algorithms as the within-task strategies. The big challenge is to design a meta-strategy that uses past experiences to adapt a within-task strategy to perform better on the next tasks.
In this paper, we propose a new meta-learning strategy. The main idea to learn the tuning parameters is to minimize its regret bound. We provide a meta-regret analysis for our strategy. We illustrate our results in the case where the within-task strategy is the online gradient algorithm, and exponentially weighted aggregation. In the case of OGA, the tuning parameters considered are the initialization and the gradient steps. For EWA,

Organization of the Paper
In Section 2, we introduce the formalism of meta-learning and the notations that will be used throughout the paper. In Section 3, we introduce our meta-learning strategy, and its theoretical analysis. In Section 4, we provide the details of our method in the case of meta-learning the initialization and the step size in the online gradient algorithm. Based on our theoretical results, there are also explicit situations where meta-learning indeed improves on learning the tasks independently. This is confirmed by experiments reported in this section. In Section 5, we provide the details of our methodology when the algorithm used within tasks is a generalized Bayesian algorithm: EWA. We show how our metastrategy can be used to tune the learning rate; we also discuss how it can be used to learn priors. The proofs of the main results are given in Section 6.

Notations and Preliminaries
By convention, vectors v ∈ R d are seen as d × 1 matrices (columns). Let v denote the Euclidean norm of v. Let A T denote the transposition of any d × k matrix A, and I d the d × d identity matrix. For two real numbers a and b, let a ∨ b = max(a, b) and a ∧ b = min(a, b). For z ∈ R, z + is its positive part z + = z ∨ 0. Given a finite set S, we let card(S) denote the cardinality of S.
The learner has to solve tasks t = 1, . . . , T sequentially. Each task t consists in n rounds i = 1, . . . , n. At each round i of task t, the learner has to take a decision θ t,i in a decision space Θ ⊆ R d for some d > 0. Then, a convex loss function t,i : Θ → R is revealed to the learner, who incurs the loss t,i (θ t,i ). Classical examples with Θ ⊂ R d include regression tasks, where t,i (θ) = (y t,i − x T t,i θ) 2 for some x t,i ∈ R d and y t,i ∈ R. For classification tasks, t,i (θ) = (1 − y t,i x T t,i θ) + for some x t,i ∈ R d , y t,i ∈ {−1, +1}. Throughout the paper, we will assume that the learner uses, for each task, an online decision strategy called within-task strategy, parametrized by a tuning parameter λ ∈ Λ where Λ is a closed, convex subset of R p for some p > 0. Example of such strategies include the online gradient algorithm, given by θ t,i = θ t,i−1 − γ∇ t,i (θ t,i−1 ). In this case, the tuning parameters are the initialization, or starting point, θ t,1 = ϑ and the learning rate, or step size, γ. That is, λ = (ϑ, γ), so p = d + 1. The parameter λ is kept fixed during the whole task. It is of course possible to use the same parameter λ in all the tasks. However, we will be interested here in defining meta-strategies that will allow us to improve λ task after task, based on the information available so far. In Section 3, we will define such strategies. For now, let λ t denote the tuning parameter used by the learner all along task t. Figure 1 provides a recap of all the notations. Let θ λ t,i denote the decision at round i of task t when the online strategy is used with parameter λ. We will assume that a regret bound is available for the within-task strategy. By this, we mean that there is a set Θ 0 ⊂ Θ of parameters of interest, and that the learner knows a function B n : Θ × Λ → R such that, for any task t, for any λ ∈ Λ, For OGA, regret bounds can be found, for example, in [4,6] (in this case, Θ 0 = Θ). Other examples include exponentially weighted aggregation (bounds in [3], here Θ 0 is a finite set of predictors while decisions Θ are probability distributions on Θ 0 ). More examples will be discussed in the paper. For a fixed parameter θ, the quantity measures the difference between the total loss suffered during task t, and the loss what one would have suffered using the parameter θ. It is thus called "the regret with respect to parameter θ", and B n (θ, λ) is usually referred to as a "regret bound". We will call L t (λ) the "meta-loss". In [29], the authors study a meta-strategy that minimizes the meta-loss of OGA. Indeed, if (1) is tight, to minimize the right-hand side is a good way to ensure that the left-hand side, that is, the cumulated loss, is small. In this work, we will focus on meta-strategies minimizing the meta-loss in a more general context.
The simplest meta-strategy is learning in isolation. That is, we keep λ t = λ 0 ∈ Λ for all tasks. The total loss after task T is then given by: However, when the learner uses a meta-strategy to improve the tuning parameter at the end of each task, the total loss is given by ∑ T t=1 ∑ n i=1 t,i (θ λ t t,i ). We will, in this paper, investigate strategies with meta-regret bounds; that is, bounds of the form Of course, such bounds will be relevant only if the right-hand side of (3) is not larger than the right-hand side of (2), and is significantly smaller in some favourable settings. We show when this is the case in Section 4.

Meta-Learning Algorithms
In this section, we provide two meta-strategies to update λ at the end of each task. The first one is a direct application of OGA to meta-learning. It is computationally simpler, but feasible only in the special case where we have an explicit formula for the (sub-)gradient of each L t (λ). The second one is an application of implicit online learning to our setting. In Section 4, we provide an example where this is the case. The second meta-strategy can be used without this assumption. In both cases, we provide a regret bound as (3), under the following condition.

Special Case: The Gradient of the Meta-Loss Is Available in Closed Form
As each L t is convex, its subdifferential at each point of Λ is non-empty. For the sake of simplicity, we will use the notation λ → ∇L t (λ) in the following formulas to denote any element of its subdifferential at λ. We define the online gradient meta-strategy (OGMS) with step α > 0 and starting point λ 1 ∈ Λ: for any t > 1, where Π Λ denotes the orthogonal projection on Λ.

The General Case
We now cover the general case, where a formula for the gradient of L t (λ) might not be available. We propose to apply a strategy that was first defined in [32] for online learning, and studied under the name "implicit online learning" (we refer the reader to [33] and the references therein). In the meta-learning context, this gives the online proximal meta-strategy (OPMS) with step α > 0 and starting point λ 1 ∈ Λ, defined by: Using classical notations, e.g., [34], we can rewrite this definition with the proximal operator (hence the name of the method). Indeed λ t = prox αL t−1 (λ t−1 ) where prox is the proximal operator given for any x ∈ Λ and any convex function f : Λ → R, This strategy is feasible in practice in the regime we are interested in; that is, when n is small or moderately large, and T → ∞. The learner has to store all the losses of the current task t−1,1 , . . . , t−1,n . At the end of the task, the learner can use any convex optimization algorithm to minimize, with respect to (θ, λ) ∈ Θ × Λ, the function We can use a (projected) gradient descent on F t or its accelerated variants [35].

Regret Analysis
A direct application of known results to the setting of this paper leads to the following proposition. For the sake of completeness, we still provide the proofs in Section 6. Proposition 1. Under Assumption 1, using either OGMS or OPMS with step α > 0 and starting The proof can be found in Section 6.

Example: Learning the Tuning Parameters of Online Gradient Descent
In all this section, we work under the following condition.

Explicit Meta-Regret Bound
We study the situation where the learner uses (projected) OGA as a within-task strategy; that is, Θ = {θ ∈ R d : θ ≤ C} and, for any i > 1, With such a strategy, we already mentioned that λ = (ϑ, γ) ∈ Λ ⊂ Θ × R + contains an initialization and a step size. An application of the results in Chapter 11 in [3] gives B n (θ, λ) = B n (θ, (ϑ, γ)) = γΓ 2 n/2 It is quite direct to check Assumption 1. We summarize this in the following proposition.

Proposition 2.
Under Assumption 2, assume that the learner uses OGA as an inner algorithm.
So, when the learner uses one of the meta-strategies OGMS or OPMS, we can apply Proposition 1 respectively. This leads to the following theorem.

Theorem 1.
Under the assumptions of Proposition 2, with γ = 1/n β for some β > 0 andγ = C 2 , when the learner uses either OGMS or OPMS with (where L is given by (11)), we have: where C(Γ, C) > 0 depends only on (Γ, C) and where: Let us compare this result with learning in isolation, as defined in (2); that is, solving the sequence of tasks with a constant hyperparameter λ = (ϑ, γ). For the usual choice ϑ = 0 and γ = c/ √ n where c is a constant that does not depend on n nor T, OGA leads to a regret in O( √ n). After T tasks, learning in isolation thus leads to a regret in T √ n. Our strategies with β = 1 lead to a regret in The term n 2 √ T is the price to pay for meta-learning. In the regime we are interested in (small n, large T), which is smaller than T √ n. Consider the leading term. In the worst case scenario, this is also T √ n. However, there are good predictors θ 1 , . . . , θ T for tasks 1, . . . , T, respectively, such that σ(θ T 1 ) is small, and in this case we see the improvement with respect to learning in isolation. The extreme case is when there is a good predictor θ * that predicts well for all tasks. In this case, regret with respect to θ 1 = · · · = θ T = θ * is in n 2 √ T + T, which improves significantly on learning in isolation. Note however that, using a different meta-strategy, specifically designed for OGA, Ref. [29] obtain a better dependence on T when σ(θ T 1 ) = 0. Let us now discuss the implementation of our meta-stategy. We first remark that under the quadratic loss, it is possible to derive a formula for L t , which allows to use OGMS. We then discuss OPMS for the general case.

Special Case: Quadratic Loss
First, consider t,i = (y t,i − x T t,i θ) 2 for some y t,i ∈ R and x t,i ∈ R d . Assumption 2 is satisfied if we assume, moreover that all |y t,i | ≤ c and x t,i ≤ b, with Γ = 2bc + 2b 2 C. In this case, with respect to θ is known as the ridge regression estimator: This also coincides with the minimizer in the right-hand side of (16) on the condition that θ t ≤ C. In this case, by pluggingθ t in (16), we have a close form formula for L t ((ϑ, γ)), and an explicit (but cumbersome) formula for its gradient. It is thus possible to use the OGMS strategy to update λ = (ϑ, γ).

The General Case
In the general case, denote with respect to θ, ϑ, γ. Any efficient minimization procedure can be used. In our experiments, we used a projected gradient descent, the gradient being given by: Note that even though we do not stricto sensu obtain the minimizer of F t , we can get arbitrarily close to it by taking a large enough number of steps. The main difference between this algorithm and the strategy suggested in [29] is that it is obtained by applying the general proximal update introduced in Equation (7), while they decoupled the update for the initialization step and the learning rate.

Experimental Study
In this section we compare simulated data for the numerical performance of OPMS w.r.t learning the task in isolation with online gradient descent (I-OGA). To measure the impact of learning the gradient step γ, we also introduce mean-OPMS that uses the same strategy as OPMS but only learns the starting point ϑ (it is thus close to [27]). We present the results for regression tasks with the mean-squared-error loss, and then for classification with the hinge loss. The notebooks of the experiments can be found online: https://dimitri-meunier.github.io/ (accessed on 26 September 2021).

Synthetic Regression
At each round t = 1, . . . , T, the meta learner sequentially receives a regression task that corresponds to a dataset (x t,i , y t,i ) i=1,...,n generated as ) and the t,i are all independent, the inputs are uniformly sampled on We take d = 20, n = 30, T = 200, σ 2 = 0.5 and θ 0 with all components equal to 5. In this setting, θ 0 is a common bias between the tasks, σ 2 is the inter-task variance and r characterizes the tasks similarity. We experiment with different values of r ∈ {0, 5, 10, 30} to observe the impact of task similarity on the meta-learning process. The smaller r, the closer are the tasks and for the extreme case of r = 0 the tasks are identical, in the sense that the parameters θ t of the tasks are all the same. We draw attention to the fact that a cross-validation procedure to select α (the parameter of OGMS or OPMS, see Equation (5)) or γ is not valid in the online settings, as it would require having knowledge of several tasks in advance for the former and several datapoints in advance for each task for the latter. Moreover, the theoretical values are based on worst-case analysis and lead in practice to slow learning. In practice, to set these values to the correct order of magnitude without adjusting the constants led to better results. So, for mean-OPMS and OPMS we set α = 1/ √ T, for OPMS and I-OGA we set γ = 1/ √ n. Instead of cross-validation, one can launch several online learners in parallel with different parameter values to pick the best one (or aggregate them). That is the strategy we use to select Γ for OPMS. Note that the exact value of Γ is usually unkown in practice; its automatic calibration is an important open question. To solve (18), after each task we use the exact solution for mean-OPMS and projected Newton descent with 10 steps for OPMS. We observed that not reaching the exact solution of (18) does not harm the performance of the algorithm and 10 steps are sufficient to reach convergence. The results are displayed in Table 1 and Figure 2. On Figure 2, for each task t = 1, . . . , T, we report the average end-of-task loss MSE t = ∑ n i=1 t,i (θ t,n )/n averaged over 50 independent runs (with their confidence intervals). Table 1 reports MSE t averaged over the 100 most recent tasks. The results confirm our theoretical findings: learning γ can bring a substantial benefit over just learning the starting point, which in turn brings a considerable benefit with respect to learning the tasks in isolation. Learning the gradient step makes the meta-learner more robust to task dissimilarities (i.e. when r increases) as shown in Figure 2. In the regime where r is low, learning the gradient step does not help the meta-learner as it takes more steps to reach convergence. Overall both meta learners are consistently better than learning the task in isolation since the number of observation per task is low. Figure 2. Performance of learning in isolation with OGA (I-OGA), OPMS to learn initialization (mean-OPMS) and OPMS to learn initialization and step size (OPMS). We report the average end-of-task MSE losses at the end of each task, for different values of the task-similarity index r ∈ {0, 5, 10, 30}. The results are averaged over 50 independent runs to get confidence intervals. At each round t = 1, . . . , T, the meta learner sequentially receives a binary classification task with the Hinge loss that corresponds to a dataset (x t,i , y t,i ) i=1,...,n . The binary labels {−1, 1} are generated as a logistic model P(y = 1) = (1 + exp(−x t θ t )) −1 . The task parameters θ t and the inputs are generated as in the regression setting. To add some noise, we shuffle 10% of the labels. We take d = 10, n = 100, T = 500, r = 2. For mean-OPMS and OPMS we set α = 1/ √ T, for OPMS and I-OGA we set γ = 1/ √ n. For the optimisation of F t (18) with both OPMS and mean-OPMS we use a projected gradient descent with 50 steps.
On Figure 3, for each task t = 1, . . . , T, we report the regret on the end-of-task losses: R(t) = 1 nt ∑ t k=1 ∑ n i=1 k,i (θ k,n ), averaged over 10 independent runs (with their confidence intervals). As the for regression setting, the results confirm our theoretical findings: by learning γ (OPMS), we reach a better overall performance than just learning the initialization (mean-OPMS) and a substantially stronger than independent task learning (I-OGA). Note that, in the classification regime, there is no known closed formed expression for the meta-gradient; therefore, OGMS cannot be used. Figure 3. Performance of learning in isolation with OGA (I-OGA), OPMS to learn the initialization (mean-OPMS) and OPMS to learn the initialization and step size (OPMS) on a sequence of classification tasks with the Hinge loss. We report the meta-regret of the Hinge loss. The results are averaged over 10 independent runs (dataset generation) to get confidence intervals.

Second Example: Learning the Prior or the Learning Rate in Exponentially Weighted Aggregation
In this section, we will study a generalized Bayesian method, exponentially weighted aggregation. Consider a finite set Θ 0 = {θ 1 , . . . , θ M } ⊂ R d . EWA depends on a prior distribution π on Θ 0 , and on a learning rate η > 0, and returns a decision in Θ = conv(θ 1 , . . . , θ M ) the convex envelope of Θ 0 . In this section, we work under the following condition.
We will sometimes use a stronger assumption.
Examples of a situation in which Assumption 4 is satisfied are provided in [3]. Note that Assumption 4 implies Assumption 3.

Reminder on EWA
The update in EWA is given by: where p t,i are weights defined by The strategy is studied in detail in [3]. We refer the reader to [36] and the references therein for connections to Bayesian inference. We recall the following regret bounds from [3]. First, under Assumption 3, Moreover, under the stronger Assumption 4, In Section 5.2, we work in the general setting (Assumption 3), and we use our metastrategy OPMS or OGMS to learn η. In Section 5.3, we use OPMS or OGMS to learn π under Assumption 4.

Learning the Rate η
Consider the uniform prior π(θ) = 1/M for any θ ∈ Θ 0 . Then, the regret bound (24) becomes: and it is then possible to optimize it explicitly with respect to η. The value minimizing the bound is η = (2/B) 2 log(M)/n and the regret bound becomes: In practice, however, while it is often reasonable to assume that the loss function is bounded (as in Assumption 3), very often one does not know a tight upper bound. Thus, one may use a constant B that satisfies Assumption 3, but that is far too large. Even though one does not know a better upper bound than B, one would like a regret bound that depends on the tightest possible upper bound.
In the meta-learning framework, define: for η ∈ Λ = [1/n, 1]. It is immediately necessary to prove that this function is convex and L-Lipschitz with L = n 2 log(M) + nB 2 /8. So, Assumption 1 is satisfied, allowing for the use of the OPMS or OGMS strategy without needed a tight upper bound on the losses. Note that, in this context, the OGMS strategy is given by:

Theorem 2.
Under Assumption 3, using OGMS or OPMS on L t (η), as in (28) with η 1 = 1, L = n 2 log(M) + nB 2 /8 and Let us compare learning in isolation with meta-learning in this context. When learning in isolation, the hyperparameter η is fixed (as in (2)). If we fix it to the value η 0 = (2/B) 2 log(M)/n as in (27), the meta-regret is in BT n log(M)/2. On the other hand, meta-learning leads to a meta-regret in bT n log(M)/2 + n 2 log M √ 2T + O(nB 2 √ T + T). In other words, we replace the potentially loose upper bound B by the tightest possible bound b, at the cost of an additional n 2 log M √ 2T + O(nB 2 √ T + T) term. Here again, when T is large enough with respect to n, this term is negligible.

Learning the Prior π
Under Assumption 4, we have the regret bound in (25). Without any information on Θ 0 , it seems natural to use the uniform prior π on Θ 0 = {θ 1 , . . . , θ M }, which leads to If some additional information was available, such as, for example: "the best θ is always either θ 1 or θ 2 ", one would rather chose the uniform prior on {θ 1 , θ 2 }, and obtain the bound: Unfortunately, such information is generally not available. However, in the context of meta-learning, we can take advantage of the previous tasks to learn such information.
Thus, let us define, for any task t, and for π = (π(θ 1 ), . . . , π(θ M )) ∈ Λ with It is important to check that L t is convex and L-Lipschitz with L = 2CM on Λ; this allows us to use OPMS (or OGMS).
define I * = {θ * 1 , . . . , θ * T } where each θ * t is as in (34) and m * = card(I * ). We have When learning in isolation with a uniform prior, the meta-regret is TC log(M). On the other hand, if m * is small (that is, many of the θ * i s are similar), meta-learning leads to a meta-regret in CT log(2m * ) + 2CM √ T. For a T that is large enough, this is an important improvement.

Discussion on the Continuous Case
Let us now discuss the possibility of meta-learning for generalized Bayesian methods when Θ 0 is no longer a finite set. There is a general formula for EWA, given by where the minimum is taken over for all probability distributions that are absolutely continuous with π, and where π is a prior distribution, η > 0 a learning rate and K is the Kullback-Leibler divergence (KL). Meta-learning for such an update rule is proven in [10,37] but usually does not lead to feasible strategies. Online variational inference [38,39] consists in replacing the minimization on the set of all probability distributions by minimization in a smaller set in order to define a feasible approximation of ρ t,i . For example, let (q µ ) µ∈M be a parametric family of probability distributions, Thus, we define: It is discussed in [40] that, generally, when µ is a location-scale parameter and t,j is Γ-Lipschitz and convex, then¯ t,i (µ) := E θ∼q µ [ t,j (θ)] is 2Γ-Lipschitz and convex. In this case, under the assumption that K(q µ , π) is α-strongly convex in µ, a regret bound for such strategies was derived in [39]: A complete study of meta-learning of the rate η > 0 and of the prior π in this context is an important objective (possibly, with a restriction that π ∈ {q µ , µ ∈ M}). However, this raises many problems. For example, the KL divergence K(q µ , q µ ) is not always convex with respect to the parameter µ . In this case, it might help to replace it by a convex relaxation that would allow for the use of OGMS or OPMS. This relates to [41,42], who advocate going beyond the KL divergence in (39); see also [36] and the references therein. This will be the object of future works.

Proofs
We start with a preliminary lemma that will be used in the proof of Proposition 1. Lemma 1. Let a, b, c be three vectors in R p . Then: Proof. expand a − c 2 = a 2 + c 2 − 2a T c in the r.h.s, as well as a − b 2 and b − c 2 . Then simplify.
We now prove Proposition 1 separately for the general OGMS strategy, and then for OGMS.
Proof of Proposition 1 for OPMS. As mentioned earlier, this strategy is an application to the meta-learning setting of implicit online learning [32,33]. We follow here a proof from Chapter 11 in [3]. We refer the reader to [43] and the references therein for tighter bounds under stronger assumptions.
First, λ t is defined as the minimizer of a convex function in (5). So, the subdifferential of this function at λ t contains 0. In other words, there is a z t ∈ ∂L t−1 (λ t ), such that By convexity, for any λ, for any z ∈ ∂L t−1 (λ t ), The choice z = z t gives: that is, where we used Lemma 1. Then, note that Combining this inequality with (46) gives Now, for any x ∈ R, −x 2 /2 + xL − L 2 /2 ≤ 0. In particular, z t L − z t 2 /2 ≤ L 2 /2 and so the above can be rewritten: Summing the inequality for t = 2 to T + 1 leads to: This ends the proof.
Proof of Proposition 1 for OGMS. The beginning of the proof follows the proof of Theorem 11.1 in [3].
Note that we can rewrite (4) as rearranging the first line, we obtain: By convexity, for any λ, that is, Lemma 1 gives: the last step being justified by: for any λ ∈ Λ. Plug (56) in (54) to get: and the Lipschitz assumption gives: sum the inequality for t = 2 to T + 1 to get: This ends the proof of the statement for OGMS.
We now provide a lemma that will be useful for the proof of Proposition 2.

Lemma 2.
Let G(u, v) be a convex function of (u, v) ∈ U × V. Define g(u) = inf v∈V G(u, v). Then g is convex.
Proof. indeed, let λ ∈ [0, 1] and (x, y) ∈ U 2 , where the last two inequalities hold for any (x , y ) ∈ V 2 . Let us now take the infimum with respect to (x , y ) ∈ V 2 in both sides, this gives: that is, g is convex.
Proof of Proposition 2. Apply Lemma 2 to u = (ϑ, γ), v = θ, U = Λ, V = Θ and This shows g(u) = L t ((ϑ, γ)) is convex with respect (ϑ, γ). Additionally, G is differentiable w.r.t u = (ϑ, γ), so Proof of Theorem 1. Thanks to the Assumption 2, we can apply Proposition 2. That is, Assumption (1) is satisfied, and we can apply Proposition 1. This gives: We use direct bounds for the last two terms: ϑ − ϑ 1 2 ≤ 4C 2 and |γ − γ 1 | 2 ≤ |γ − γ| 2 ≤ γ 2 = C 4 . Then note that Upper bounding the infimum on ϑ in (71) by ϑ = 1 T ∑ T s=1 θ s leads to The right-hand side of (74) is minimized with respect to α if α = C L 4+C 2 T , which is the value proposed in the theorem, and we obtain: The infimum with respect to γ in the r.h.s is reached for First, note that using γ = n −β . Then, using γ = C 2 and σ(θ T 1 ) ≤ 2C. Plugging (77), (80) and the definition of L into (75) gives where we took This ends the proof. Thus, we have Now, plugging in the right-hand side we obtain: Now, we see that the value α = 2/(TL 2 ) leads to: Rearranging terms, and replacing L by its value, that is the statement of the theorem.

Conclusions
We proposed two simple meta-learning strategies together with their theoretical analysis. Our results clearly show an improvement on learning in isolation if the tasks are similar enough. These theoretical findings are confirmed by our numerical experiments. Important questions remain open. In [27], a purely online method is proposed, in the sense that it does not require storing all of the information of the current task. In the case of OGA, this method allows us to learn the starting point. However, its application to learn the step size is not direct [28]. An important question is, then: is there a purely online method that would provably improve on learning in isolation in this case? Another important question is the automatic calibration of Γ. However, as mentioned in Section 5, we believe that a very general and efficient meta-learning method for learning priors in Bayesian statistics (or in generalized Bayesian inference) would be extremely valuable in practice.