Phase Transitions in Transfer Learning for High-Dimensional Perceptrons

Oussama Dhifallah; Yue M. Lu

doi:10.3390/e23040400

and

John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA

^*

Author to whom correspondence should be addressed.

Entropy2021, 23(4), 400;https://doi.org/10.3390/e23040400

This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning

Version Notes

Order Reprints

Abstract

Transfer learning seeks to improve the generalization performance of a target task by exploiting the knowledge learned from a related source task. Central questions include deciding what information one should transfer and when transfer can be beneficial. The latter question is related to the so-called negative transfer phenomenon, where the transferred source information actually reduces the generalization performance of the target task. This happens when the two tasks are sufficiently dissimilar. In this paper, we present a theoretical analysis of transfer learning by studying a pair of related perceptron learning tasks. Despite the simplicity of our model, it reproduces several key phenomena observed in practice. Specifically, our asymptotic analysis reveals a phase transition from negative transfer to positive transfer as the similarity of the two tasks moves past a well-defined threshold.

Keywords:

transfer learning; statistics; phase transitions

1. Introduction

Transfer learning [1,2,3,4,5] is a promising approach to improving the performance of machine learning tasks. It does so by exploiting the knowledge gained from a previously learned model, referred to as the source task, to improve the generalization performance of a related learning problem, referred to as the target task. One particular challenge in transfer learning is to avoid so-called negative transfer [6,7,8,9], where the transferred source information reduces the generalization performance of the target task. Recent literature [6,7,8,9] shows that negative transfer is closely related to the similarity between the source and target tasks. Transfer learning may hurt the generalization performance if the tasks are sufficiently dissimilar.

In this paper, we present a theoretical analysis of transfer learning by studying a pair of related perceptron learning tasks. Despite the simplicity of our model, it reproduces several key phenomena observed in practice. Specifically, the model reveals a sharp phase transition from negative transfer to positive transfer (i.e., when transfer becomes helpful) as a function of the model similarity.

1.1. Models and Learning Formulations

We start by describing the models for our theoretical study. We assume that the source task has a collection of training data

{(a_{s, i}, y_{s, i})}_{i = 1}^{n_{s}}

, where

a_{s, i} \in R^{p}

is the source feature vector and

y_{s, i} \in R

denotes the label corresponding to

a_{s, i}

. Following the standard teacher–student paradigm, we assume that the labels

{y_{s, i}}_{i = 1}^{n_{s}}

are generated according to the following model:

\begin{matrix} y_{s, i} = φ (a_{s, i}^{⊤} ξ_{s}), \forall i \in {1, \dots, n_{s}}, \end{matrix}

(1)

where

φ (\cdot)

is a scalar deterministic or probabilistic function and

ξ_{s} \in R^{p}

is an unknown source teacher vector.

Similar to the source task, the target task has access to a different collection of training data

{(a_{t, i}, y_{t, i})}_{i = 1}^{n_{t}}

, generated according to

\begin{matrix} y_{t, i} = φ (a_{t, i}^{⊤} ξ_{t}), \forall i \in {1, \dots, n_{t}}, \end{matrix}

(2)

where

ξ_{t} \in R^{p}

is an unknown target teacher vector. We measure the similarity of the two tasks using

\begin{matrix} ρ \overset{def}{=} \frac{ξ_{t}^{⊤} ξ_{s}}{∥ ξ_{t} ∥ ∥ ξ_{s} ∥}, \end{matrix}

(3)

with

ρ = 0

indicating two uncorrelated tasks whereas

ρ = 1

means that the tasks are perfectly aligned.

For the source task, we learn the optimal weight vector

{\hat{w}}_{s}

by solving a convex optimization problem:

\begin{matrix} {\hat{w}}_{s} = \underset{w \in R^{p}}{argmin} \frac{1}{p} \sum_{i = 1}^{n_{s}} ℓ (y_{s, i}; a_{s, i}^{⊤} w) + \frac{λ}{2} {∥ w ∥}^{2}, \end{matrix}

(4)

where

λ \geq 0

is a regularization parameter and

ℓ (.; .)

denotes some general loss function that can take one of the following two forms:

\begin{matrix} \{\begin{matrix} ℓ (y; x) = \hat{ℓ} (y - x), & for regression task \\ ℓ (y; x) = \hat{ℓ} (y x), & for classification task, \end{matrix} \end{matrix}

(5)

where

\hat{ℓ} (.)

is a convex function.

In this paper, we consider a common strategy in transfer learning [4], which consists of transferring the optimal source vector, i.e.,

{\hat{w}}_{s}

, to the target task. One popular approach is to fix a (random) subset of the target weights to values of the corresponding optimal weights learned during the source training process [10]. In our learning model, this amounts to the following target learning formulation:

\begin{matrix} {\hat{w}}_{t} = & \underset{w \in R^{p}}{argmin} \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ (y_{t, i}; a_{t, i}^{⊤} w) + \frac{λ}{2} {∥ w ∥}^{2} \end{matrix}

(6)

\begin{matrix} s . t . Q w = Q {\hat{w}}_{s} . \end{matrix}

(7)

The vector

{\hat{w}}_{s}

is the optimal solution of the source learning problem, and

Q \in R^{p \times p}

is a diagonal matrix with diagonal entries drawn independently from a Bernoulli distribution with probability

δ = m / p \leq 1

. Here, m denotes the number of transferred components. Thus, on average, we retain

δ p

number of entries from the source optimal vector

{\hat{w}}_{s}

. In addition to a possible improvement in the generalization performance, this approach can considerably lower the computational complexity of the target learning task by reducing the number of free optimization variables. In what follows, we refer to

δ

as the transfer rate and call (6) the hard transfer formulation.

Another popular approach in transfer learning is to search for target weight vectors in the vicinity of the optimal source weight vector

{\hat{w}}_{s}

. This can be achieved by adding a regularization term to the target formulation [11,12], which in our model becomes

{\hat{w}}_{t} = \underset{w \in R^{p}}{argmin} \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ (y_{t, i}; a_{t, i}^{⊤} w) + \frac{λ}{2} {∥ w ∥}^{2} + \frac{1}{2} {∥ Σ (w - {\hat{w}}_{s}) ∥}^{2},

(8)

with

Σ \in R^{p \times p}

denoting some weighting matrix. In what follows, we refer to (8) as the soft transfer formulation, since it relaxes the strict equality in (6). In fact, the hard transfer in (6) is just a special case of the soft transfer formulation, if we set

Σ

to be a diagonal matrix in which the diagonal entries are either

+ \infty

(with probability

δ

) or 0 (with probability

1 - δ

).

To measure the performance of the transfer learning methods, we use the generalization error of the target task. Given a new data sample

(a_{t, new}, y_{t, new})

with

y_{t, new} = φ (ξ_{t}^{⊤} a_{t, new})

, we assume that the target task predicts the corresponding label as

{\hat{y}}_{t, new} = \hat{φ} [{\hat{w}}_{t}^{⊤} a_{t, new}],

(9)

where

\hat{φ} (\cdot)

is a predefined scalar function that might be different from

φ (\cdot)

. We then calculate the generalization error of the target task as

\begin{matrix} E_{test} = \frac{1}{4^{υ}} E [{(y_{t, new} - \hat{φ} ({\hat{w}}_{t}^{⊤} a_{t, new}))}^{2}], \end{matrix}

(10)

where the expectation is taken with respect to the new data

(a_{t, new}, y_{t, new})

. The variable

υ

allows us to write a more compact formula:

υ

is taken to be 0 for a regression problem and

υ = 1

for a binary classification problem. Finally, we use the training error

\begin{matrix} E_{train} = \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ (y_{t, i}; a_{t, i}^{⊤} {\hat{w}}_{t}) + \frac{1}{2} {∥ Σ ({\hat{w}}_{t} - {\hat{w}}_{s}) ∥}^{2}, \end{matrix}

to quantify the performance of the training process. Here, we measure the training error on the training data without regularization.

1.2. Main Contributions

The main contributions of this paper are two-fold, as summarized below:

1.2.1. Precise Asymptotic Analysis

We present a precise asymptotic analysis of the transfer learning approaches introduced in (6) and (8) for Gaussian feature vectors and under regularity conditions on the eigenvalue distribution of the weighting matrix

Σ

. Specifically, we show that, as the dimensions

p, n_{s}, n_{t}

grow to infinity with the ratios

α_{s} = n_{s} / p, α_{t} = n_{t} / p

fixed, the generalization errors of the hard and soft formulations can be exactly characterized by the solutions of two low-dimensional deterministic optimization problems. (See Theorem 1 and Corollary 1 for details.) Our asymptotic predictions hold for any convex loss functions used in the training process, including the squared loss for regression problems and logistic loss commonly used for binary classification problems.

As illustrated in Figure 1, our theoretical predictions (drawn as solid lines in the figures) reach excellent agreement with the actual performance (shown as circles) of the transfer learning problem. Figure 1a considers a binary classification setting with logistic loss, and we plot the generalization errors of different transfer approaches as a function of the target data/dimension ratio

α_{t} = n_{t} / p

. We can see that the hard transfer formulation (6) is only useful when

α_{t}

is small. In fact, we encounter negative transfer (i.e., hard transfer performing worse than no transfer) when

α_{t}

becomes sufficiently large. Moreover, the soft transfer formulation (8) seems to achieve more favorable generalization errors compared to the hard formulation. In Figure 1b, we consider a regression setting with a squared loss and explore the impact of different weighting schemes on the performance of the soft formulation. We can see that the soft formulation indeed considerably improves the generalization performance of the standard learning method (i.e., learning the target task without any knowledge transfer).

Figure 1. Theoretical predictions v.s. numerical simulations obtained by averaging over 100 independent Monte Carlo trials with dimension

p = 2500

. (a) Binary classification with logistic loss. We take

α_{s} = 10 α_{t}

,

λ = 0.3

,

Σ = I_{p} / \sqrt{5}

, and

ρ = 0.85

, where

α_{s} = n_{s} / p

and

α_{t} = n_{t} / p

. The functions

φ (\cdot)

and

\hat{φ} (\cdot)

are both the sign function. For hard transfer, we set the transfer rate to be

δ = 0.5

. Full source transfer corresponds to

δ = 1.0

, whereas no transfer corresponds to

δ = 0

. (b) Nonlinear regression using quadratic loss, where

φ (\cdot)

is the ReLu function and

\hat{φ} (\cdot)

is the identity function. Soft identity, beta, and uniform matrices refer to different choices of the weighting matrix in (8). Soft Identity Matrix:

Σ

is an identity matrix. Soft Uniform Matrix:

Σ

is a random matrix with diagonal elements drawn from the uniform distribution. Soft Beta Matrix:

Σ

is a random matrix with diagonal elements drawn from the beta distribution. We scale all diagonal elements of

Σ

to have the same mean. We also take

α_{s} = 10 α_{t}

,

λ = 0.1

, and

ρ = 0.8

.

1.2.2. Phase Transitions

Our asymptotic characterizations reveal a phase transition phenomenon in the hard transfer formulation. Let

δ^{★} = \underset{0 \leq δ \leq 1}{argmin} E_{test} (δ),

be the optimal transfer rate that minimizes the generalization error of the target task. Clearly,

δ^{★} = 0

corresponds to the negative transfer regime, where transferring the knowledge of the source task will actually hurt the performance of the target task. In contract,

δ^{★} > 0

signifies that we have entered the positive transfer regime, where transfer becomes helpful.

Figure 2a illustrates the phase transition from negative to positive transfer regimes in a binary classification setting, as the similarity

ρ

between the two tasks moves past a critical threshold. Similar phase transition phenomena also appear in nonlinear regression, as shown in Figure 2b. Interestingly, for this setting, the optimal transfer rate jumps from

δ^{★} = 0

to

δ^{★} = 1

at the transition threshold.

Figure 2. Phase transitions of the hard transfer formulation. When the similarity

ρ

between the two tasks is small, we are in the negative transfer regime, where we should not transfer the knowledge from the source task. However, as

ρ

moves past a critical threshold, we enter the positive transfer regime. (a) Binary classification with squared loss, with parameters

α_{t} = 2

,

α_{s} = 2 α_{t}

, and

λ = 0

. Both

φ (\cdot)

and

\hat{φ} (\cdot)

are the sign function. (b) Nonlinear regression with squared loss, with parameters

α_{t} = 2

,

α_{s} = 2 α_{t}

, and

λ = 0

.

φ (.)

is the ReLu function and

\hat{φ} (.)

is the identity function.

For general loss functions, the exact locations of the phase transitions can only be found numerically by solving the deterministic optimization problems in our asymptotic characterizations. For the special case of squared loss with no regularization, however, we are able to obtain the following simple analytical characterization for the phase transition threshold: We are in the positive transfer regime if and only if

ρ > ρ_{c} (α_{s}, α_{t}) = 1 - \frac{E [φ^{2} (z)] - E^{2} [z φ (z)]}{2 E^{2} [z φ (z)]} (\frac{1}{α_{t} - 1} - \frac{1}{α_{s} - 1}),

(11)

where z is a standard Gaussian random variable. This result is shown in Proposition 1.

By the Cauchy–Schwarz inequality,

E [φ^{2} (z)] \geq E^{2} [z φ (z)]

. It follows that

ρ_{c} (α_{s}, α_{t})

is an increasing function of

α_{t}

and a decreasing function of

α_{s}

. This property is consistent with our intuition: As we increase

α_{t}

, the target task has more training data to work with, and thus, we should set a higher bar in terms of when to transfer knowledge. As we increase

α_{s}

, the quality of the optimal source vector becomes better, in which case, we can start the transfer at a lower similarity level. In particular, when

α_{t} > α_{s}

, we have

ρ_{c} (α_{s}, α_{t}) > 1

and, thus, the inequality in (11) is never satisfied (because

| ρ | \leq 1

by definition). This indicates that no transfer should be done when the target task has more training data than the source task.

1.3. Related Work

The idea of transferring informaton between different domains or different tasks was first proposed in [1] and further developed in [2]. It has been attracting significant interest in recent literature [4,5,6,7,8,9,11,12]. While most work focuses on the practical aspects of transfer learning, there have been several studies (e.g., [13,14]) that seek to provide analytical understandings of transfer learning in simplified models. Our work is particularly related to [14], which considers a transfer learning model similar to ours but for the special case of linear regression. The analysis in this paper is more general as it considers arbitrary convex loss functions. We would also like to mention an interesting recent work that studies a different but related setting referred to as knowledge distillation [15].

In term of technical tools, our asymptotic predictions are derived using the convex Gaussian min–max theorem (CGMT). The CGMT was first introduced in [16] and further developed in [17]. It extends a Gaussian comparison inequality first introduced in [18]. It particularly uses convexity properties to show the equivalence between two Gaussian processes. The CGMT has been successfully used to analyze convex regression formulations [17,19,20] and convex classification formulations [21,22,23,24].

1.4. Organization

The rest of this paper is organized as follows. Section 2 states the technical assumptions under which our results are obtained. Section 3 provides an asymptotic characterization of the soft transfer formulation. Precise analysis of the hard transfer formulation is presented in Section 4. We provide remarks about our approach in Section 5. Our theoretical predictions hold for general convex loss functions. We specialize these results to the settings of nonlinear regression and binary classification in Section 6, where we also provide additional numerical results to validate our predictions. Section 7 provides detailed proof of the technical statements introduced in Section 3 and Section 4. Section 8 concludes the paper. The Appendix provides additional technical details.

2. Technical Assumptions

The theoretical analysis of this paper is carried out under the following assumptions.

Assumption 1

(Gaussian Feature Vectors). The feature vectors

{a_{s, i}}_{i = 1}^{n_{s}}

and

{a_{t, i}}_{i = 1}^{n_{t}}

are drawn independently from a standard Gaussian distribution. The vector

ξ_{s} \in R^{p}

can be expressed as

ξ_{s} = ρ ξ_{t} + \sqrt{1 - ρ^{2}} ξ_{r}

, where the vectors

ξ_{t} \in R^{p}

and

ξ_{r} \in R^{p}

are independent from the feature vectors, and they are generated independently from a uniform distribution on the unit sphere.

Moreover, our results are valid in a high-dimensional asymptotic setting, where the dimensions p,

n_{s}

,

n_{t}

, and m grow to infinity at fixed ratios.

Assumption 2

(High-dimensional Asymptotic). The number of samples and the number of transferred components in hard transfer satisfy

n_{s} = n_{s} (p)

,

n_{t} = n_{t} (p)

, and

m = m (p)

, with

α_{s, p} = n_{s} (p) / p \to α_{s} > 0

,

α_{t, p} = n_{t} (p) / p \to α_{t} > 0

, and

δ_{p} = m (p) / p \to δ > 0

as

p \to \infty

.

The CGMT framework makes specific assumptions about the loss function and the feasibility sets. To guarantee these assumptions, this paper considers a family of loss functions that satisfy the following conditions. Note that the assumption is stated for the target task, but we assume that it is also valid for the source task.

Assumption 3

(Loss Function). If

λ > 0

, the loss function

ℓ (y; .)

defined in (5) is a proper convex function in

R

. If

λ = 0

, the loss function

ℓ (y; .)

defined in (5) is a proper strongly convex function in

R

, where the constant

S > 0

is a strong convexity parameter. In this case, we only consider the case when

α_{t} > 1

. Define a random function

L (x) = \sum_{i = 1}^{n_{t}} ℓ (y_{i}; x_{i})

, where

y_{i} \sim φ (z_{i})

, with

\{z_{i}\}

being a collection of independent standard normal random variables and ∼ denoting equality in distribution. Denote by

\partial L

the sub-differential set of

L (x)

. Then, for any constant

C > 0

, there exists a constant

R > 0

such that

\begin{matrix} \{\begin{matrix} P (sup_{∥ v ∥ \leq C \sqrt{n_{t}}} sup_{s \in \partial L (v)} ∥ s ∥ \leq R \sqrt{n_{t}}) \overset{p \to \infty}{\to} 1 . \\ P (sup_{∥ v ∥ \leq C \sqrt{n_{t}}} | L (v) | \leq R n_{t}) \overset{p \to \infty}{\to} 1 . \end{matrix} \end{matrix}

(12)

Furthermore, we consider the following assumption to guarantee that the generalization error defined in (10) concentrates in the large system limit.

Assumption 4

(Regularity Conditions). The data-generating function

φ (\cdot)

is independent from the feature vectors. Moreover, the following conditions are satisfied.

$φ (\cdot)$ and $\hat{φ} (\cdot)$ are continuous almost everywhere in $R$ . For every $h > 0$ and $z \sim N (0, h)$ , we have $0 < E [φ^{2} (z)] < + \infty$ and $0 < E [{\hat{φ}}^{2} (z)] < + \infty$ .
For any compact interval $[c, C]$ , there exists a function $g (\cdot)$ such that

$sup_{h \in [c, C]} {| \hat{φ} (h x) |}^{2} \leq g (x) for all x \in R .$

Additionally, the function $g (\cdot)$ satisfies $E [g^{2} (z)] < + \infty$ , where $z \sim N (0, 1)$ .

Finally, we introduce the following assumption to guarantee that the training and generalization errors of the soft formulation can be asymptotically characterized by deterministic optimization problems.

Assumption 5

(Weighting Matrix). Let

Λ = Σ^{⊤} Σ

, where Σ is the weighting matrix in the soft transfer formulation. Let

σ_{\min, 1} (Λ)

and

σ_{\min, 2} (Λ)

denote its two smallest eigenvalues. There exists a constant

μ_{\min} \geq 0

such that

\begin{matrix} \{\begin{matrix} σ_{\min, 1} (Λ) \overset{p}{\to} μ_{\min} \\ | σ_{\min, 1} (Λ) - σ_{\min, 2} (Λ) | \overset{p}{\to} 0 . \end{matrix} \end{matrix}

(13)

Moreover, we assume that empirical distribution of the eigenvalues of the matrix Λ converges weakly to a probability distribution

P_{μ} (\cdot)

.

The above assumptions are essential to show that the soft formulation in (8) concentrates in the large system limit. We provide more details about these assumptions in Appendix A.

3. Sharp Asymptotic Analysis of Soft Transfer Formulation

In this section, we study the asymptotic properties of soft transfer formulation. Specifically, we provide a precise characterization of the training and generalization errors corresponding to (8).

The asymptotic performance of the source formulation defined in (4) has been studied in the literature [24]. In particular, it has been shown that the asymptotic limit of the source formulation in (4) can be quantified by the following deterministic optimization problem:

min_{\begin{matrix} q_{s}, r_{s} \geq 0 \end{matrix}} sup_{\begin{matrix} σ_{s} > 0 \end{matrix}} α_{s} E [M_{ℓ (Y_{s}, .)} (r_{s} H_{s} + q_{s} S_{s}; \frac{r_{s}}{σ_{s}})] - \frac{r_{s} σ_{s}}{2} + \frac{λ}{2} (q_{s}^{2} + r_{s}^{2}),

(14)

where

Y_{s} = φ (S_{s})

and

H_{s}

and

S_{s}

are two independent standard Gaussian random variables. Furthermore, the function

M_{ℓ (Y_{s}, .)}

introduced in the scalar optimization problem (14) is the Moreau envelope function defined as

\begin{matrix} M_{ℓ (y, .)} (a; b) = min_{c \in R} ℓ (y; c) + \frac{1}{2 b} {(c - a)}^{2} . \end{matrix}

(15)

The expectation in (14) is taken over the random variables

H_{s}

and

S_{s}

.

In our work, we focus on the target problem with soft transfer, as formulated in (8). It turns out that the asymptotic performance of the target problem can also be characterized by a deterministic optimization problem:

\begin{matrix} min_{\begin{matrix} q_{t}, r_{t} \geq 0 \end{matrix}} & sup_{σ_{t} > - μ_{\min}} - \frac{σ_{t} r_{t}^{2}}{2} + \frac{1}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) T_{2} (σ_{t}) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{1} (σ_{t}))] + \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) \\ - \frac{1}{2} {(q_{t} - ρ q_{s}^{★})}^{2} (σ_{t} - 1 / T_{1} (σ_{t})), \end{matrix}

(16)

where

Y_{t} = φ (S_{t})

, and

H_{t}

and

S_{t}

are independent standard Gaussian random variables. Additionally,

μ_{\min}

represents the minimum value of the random variable with distribution

P_{μ} (.)

as defined in Assumption 5. In the formulation (16), the constants

q_{s}^{★}

and

r_{s}^{★}

are optimal solutions of the asymptotic formulation given in (14). Moreover, the functions

T_{1} (.)

and

T_{2} (.)

are defined as follows:

\begin{matrix} T_{1} (σ_{t}) = E_{μ} [1 / (μ + σ_{t})], T_{2} (σ_{t}) = E_{μ} [μ σ_{t} / (μ + σ_{t})], \end{matrix}

where the expectations are taken over the probability distribution

P_{μ} (.)

defined in Assumption 5.

Theorem 1

(Precise Analysis of the Soft Transfer). Suppose that Assumptions 1–5 are satisfied. Then, the training error corresponding to the soft transfer formulation in (8) converges in probability as follows:

\begin{matrix} E_{train} \overset{p \to \infty}{\to} C_{t}^{★} - \frac{λ}{2} ({(q_{t}^{★})}^{2} + {(r_{t}^{★})}^{2}), \end{matrix}

(17)

where

C_{t}^{★}

denotes the minimum value achieved by the scalar formulation introduced in (16), and

q_{t}^{★}

and

r_{t}^{★}

are optimal solutions of the scalar formulation in (16). Moreover, the generalization error introduced in (10) corresponding to soft transfer formulation converges in probability as follows:

\begin{matrix} E_{test} \overset{p \to \infty}{\to} \frac{1}{4^{υ}} E [{(φ (ν_{1}) - \hat{φ} (ν_{2}))}^{2}], \end{matrix}

(18)

where

ν_{1}

and

ν_{2}

are two jointly Gaussian random variables with zero mean and a covariance matrix given by

\begin{matrix} [\begin{matrix} 1 & q_{t}^{★} \\ q_{t}^{★} & {(q_{t}^{★})}^{2} + {(r_{t}^{★})}^{2} \end{matrix}] . \end{matrix}

The proof of Theorem 1 is based on the CGMT framework [17] (Theorem 6.1). A detailed proof is provided in Section 7.3. The statements in Theorem 1 are valid for a general convex loss function and general learning models that can be expressed as in (1) and (2). The analysis in Section 7.3 shows that the deterministic problems in (14) and (16) are the asymptotic limits of the source and target formulations given in (4) and (8), respectively. Moreover, it shows that the deterministic problems (14) and (16) are strictly convex in the minimization variables. This implies the uniqueness of the optimal solutions of the minimization problems.

Remark 1.

The results of the theorem show that the training and generalization errors corresponding to soft transfer formulation can be fully characterized using the optimal solutions of scalar formulation in (16). Moreover, from its definition, (16) depends on the optimal solutions of the scalar formulation in (14) of the source task. This shows that the precise asymptotic performance of the soft transfer formulation can be characterized after solving two scalar deterministic problems.

4. Sharp Asymptotic Analysis of Hard Transfer Formulation

In this section, we study the asymptotic properties of hard transfer formulation. We then use these predictions to rigorously prove the existence of phase transitions from negative to positive transfer.

4.1. Asymptotic Predictions

As mentioned earlier, the hard transfer formulation can be recovered from (8) as a special case where the eigenvalues of the matrix

Λ

are

+ \infty

with probability

δ

and 0 otherwise. Thus, we obtain the following result as a simple consequence of Theorem 1.

Corollary 1.

Suppose that Assumptions 1–4 are satisfied. Then, the asymptotic limit of the hard formulation defined in (6) is given by the following deterministic formulation:

\begin{matrix} min_{\begin{matrix} q_{t}, r_{t} \geq 0 \end{matrix}} sup_{σ > 0} & \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) + \frac{σ δ}{2} [(1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}] \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; \frac{1 - δ}{σ})] - \frac{σ r_{t}^{2}}{2} \\ + \frac{σ δ}{2 (1 - δ)} {(q_{t} - ρ q_{s}^{★})}^{2} . \end{matrix}

(19)

Additionally, the training and generalization errors associated with the hard formulation converge in probability to the limits given in (17) and (18), respectively.

4.2. Phase Transitions

As illustrated in Figure 2, there is a phase transition phenomenon in the hard transfer formulation, where the problem moves from negative transfer to positive transfer as the similarity of the source and target tasks increases. For general loss functions, the exact location of the phase transition boundary can only be determined by numerically solving the scalar optimization problem in (19).

For the special case of squared loss, however, we are able to obtain analytical expressions. For the rest of this section, we restrict our discussions to the following special settings:

(a): The loss function $ℓ (\cdot, \cdot)$ in (4) and (6) is the squared loss, i.e., $ℓ (y, x) = \frac{1}{2} {(y - x)}^{2}$ .
(b): The regularization strength $λ = 0$ in the source and target formulations (4) and (6).
(c): The data/dimension ratios $α_{s}$ and $α_{t}$ satisfy $α_{s} > 1$ and $α_{t} > 1$ .

We first consider a nonlinear regression task, where the function

φ (\cdot)

in the generative models (1) and (2) can be arbitrary and where the function

\hat{φ} (\cdot)

in (9) is the identity function.

Proposition 1

(Regression Phase Transition). In addition to conditions (a)–(c) introduced above, assume that the predefined function

\hat{φ} (\cdot)

in (9) is the identity function. Let

δ^{★}

be the optimal transfer rate that leads to the lowest generalization error in the hard formulation (6). Then,

\begin{matrix} δ^{★} = \{\begin{matrix} 0 & if ρ < ρ_{c} (α_{s}, α_{t}) \\ 1 & if ρ > ρ_{c} (α_{s}, α_{t}), \end{matrix} \end{matrix}

(20)

where

ρ_{c} (α_{s}, α_{t})

is defined in (11).

The result of Proposition 1, for which the proof can be found in Section 7.4, shows that

ρ_{c} (α_{s}, α_{t})

is the phase transition boundary separating the negative transfer regime from the positive transfer regime. When the similarity metric is

ρ < ρ_{c} (α_{s}, α_{t})

, the optimal transfer ratio is

δ^{★} = 0

, indicating that we should not transfer any source knowledge. Transfer becomes helpful only when

ρ

moves past the threshold. Note that, for this particular model, there is also an interesting feature that the optimal

δ^{★}

jumps to 1 in the positive transfer phase, meaning that we should fully copy the source weight vector.

4.3. Sufficient Condition

Next, we consider a binary classification task, where the nonlinear functions

φ (\cdot)

and

\hat{φ} (\cdot)

are both the sign function. In this part, we provide a sufficient condition for when the hard transfer is beneficial. Before stating our predictions, we need a few definitions related to the Moreau envelope function defined in (15). For simplicity of notation, we refer to the Moreau envelope function as

M_{ℓ} (\cdot, \cdot)

. Based on [25],

M_{ℓ} (\cdot, \cdot)

is differentiable in

R \times R^{+}

. We refer to its derivatives with respect to the first and second arguments as

M_{ℓ, 1}^{'} (\cdot, \cdot)

and

M_{ℓ, 2}^{'} (\cdot, \cdot)

, respectively. If

M_{ℓ} (\cdot, \cdot)

is twice differentiable, we refer to its second derivative with respect to the first and second arguments as

M_{ℓ, 1}^{″} (\cdot, \cdot)

and

M_{ℓ, 2}^{″} (\cdot, \cdot)

, respectively. Additionally, we refer to its second derivative with respect to the first then the second arguments as

M_{ℓ, 12}^{″} (\cdot, \cdot)

.

We define

q_{0}

,

r_{0}

, and

σ_{0}

as the optimal solutions of the standard learning formulation (i.e.,

δ = 0

in (19)). Moreover, we define the constants

β_{1}

and

β_{2}

as follows:

\begin{matrix} β_{1} = (1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}, β_{2} = ρ q_{s}^{★}, \end{matrix}

(21)

where

q_{s}^{★}

and

r_{s}^{★}

are optimal solutions of the deterministic source formulation given in (14). Define the constants

I_{11}

,

I_{12}

,

I_{13}

, and

I_{14}

as follows:

\begin{matrix} \{\begin{matrix} I_{11} = α_{t} E (S H M_{ℓ, 1}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]), I_{12} = α_{t} E (S^{2} M_{ℓ, 1}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) + λ \\ I_{13} = - \frac{α_{t}}{σ_{0}^{2}} E (S M_{ℓ, 12}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]); I_{14} = - \frac{α_{t}}{σ_{0}} E (S M_{ℓ, 12}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) + σ_{0} (q_{0} - β_{2}), \end{matrix} \end{matrix}

where

Y = φ (S)

, and H and S are two independent standard Gaussian random variables. Now, define the constants

I_{21}

,

I_{22}

,

I_{23}

, and

I_{24}

as follows:

\begin{matrix} \{\begin{matrix} I_{21} = α_{t} E (H^{2} M_{ℓ, 1}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) - σ_{0} + λ, I_{22} = α_{t} E (H S M_{ℓ, 1}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) \\ I_{23} = - \frac{α_{t}}{σ_{0}^{2}} E (H M_{ℓ, 12}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) - r_{0}, I_{24} = - \frac{α_{t}}{σ_{0}} E (H M_{ℓ, 12}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) . \end{matrix} \end{matrix}

Finally, define the constants

I_{31}

,

I_{32}

,

I_{33}

, and

I_{34}

as follows:

\begin{matrix} \{\begin{matrix} I_{31} = - \frac{α_{t}}{σ_{0}^{2}} E (H M_{ℓ, 21}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) - r_{0}, I_{32} = - \frac{α_{t}}{σ_{0}^{2}} E (S M_{ℓ, 12}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) \\ I_{33} = \frac{2 α_{t}}{σ_{0}^{3}} E (M_{ℓ, 2}^{'} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) + \frac{α_{t}}{σ_{0}^{4}} E (M_{ℓ, 2}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) \\ I_{34} = \frac{α_{t}}{σ_{0}^{2}} E (M_{ℓ, 2}^{'} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) + \frac{α_{t}}{σ_{0}^{3}} E (M_{ℓ, 2}^{″} [r_{0} H + q_{0} S; \frac{1}{σ_{0}}]) + \frac{1}{2} {(q_{0} - β_{2})}^{2} + \frac{β_{1}}{2} . \end{matrix} \end{matrix}

Now, we are ready to state our sufficient condition.

Proposition 2

(Classification). Assume that the Moreau envelope function is twice continuously differentiable almost everywhere in

R \times R^{+}

and that the above expectations are all well-defined. Moreover, assume that both

φ (\cdot)

and

\hat{φ} (\cdot)

are the sign function. Then,

\begin{matrix} δ^{★} > 0 if q_{0}^{'} r_{0} - q_{0} r_{0}^{'} > 0, \end{matrix}

(22)

where

q_{0}^{'}

and

r_{0}^{'}

are solutions to the following linear system of equations:

\begin{matrix} \{\begin{matrix} I_{11} r_{0}^{'} + I_{12} q_{0}^{'} + I_{13} σ_{0}^{'} + I_{14} = 0 \\ I_{21} r_{0}^{'} + I_{22} q_{0}^{'} + I_{23} σ_{0}^{'} + I_{24} = 0 \\ I_{31} r_{0}^{'} + I_{32} q_{0}^{'} + I_{33} σ_{0}^{'} + I_{34} = 0 . \end{matrix} \end{matrix}

(23)

We prove this result at the end of Section 7. Note that Proposition 2 is valid for a general family of loss functions and general regularization strength

λ \geq 0

. For instance, we can see that the results stated in Proposition 2 are valid for the squared loss and the least absolute deviation (LAD) loss, i.e.,

\begin{matrix} \{\begin{matrix} ℓ (y; x) = \frac{1}{2} {(1 - y x)}^{2} \\ ℓ (y; x) = | 1 - y x | . \end{matrix} \end{matrix}

(24)

Unlike (20), the result in (22) only provides a sufficient condition for when the hard transfer is beneficial. Nevertheless, our numerical simulations show that the sufficient condition in (22) provides a good prediction of the phase transition boundary for the majority of parameter settings.

5. Remarks

5.1. Learning Formulations

Given that the target task predicts the new label with

\hat{φ} (\cdot)

, it is more natural to consider loss functions satisfying the following form:

ℓ (y_{i}; \hat{φ} (a_{i}^{⊤} w)) .

In this case, the convexity assumption is not necessarily satisfied since the loss function can be viewed as the composition of a convex function with a nonlinear function. To guarantee the convexity, we need additional assumptions on the function

\hat{φ} (\cdot)

. Moreover, note that, once the convexity is guaranteed, the function

\hat{φ} (\cdot)

can be absorbed by the loss function

ℓ (\cdot; \cdot)

.

5.2. Transition from Negative to Positive Transfer

Our first simulation example in Figure 2 shows that the optimal transfer rate

δ^{★}

can be 1 while the similarity

ρ

is still less than 1. Here, we provide an intuitive explanation of this behavior.

Given that the source and target feature vectors are generated from the same distribution, one can see that the source labels can be equivalently expressed as follows:

\begin{matrix} y_{s, i} = φ (ρ a_{t, i}^{⊤} ξ_{t} + z_{i}), \forall i \in {1, \dots, n_{s}}, \end{matrix}

(25)

where

{z_{i}}_{i = 1}^{n_{s}}

is an additive noise caused by the mismatch between the source and target hidden vectors. Moreover, note that the noise strength depends on the similarity measure

ρ

.

First, consider the case when the number of source samples is bigger than the number of target samples (i.e.,

α_{s} > α_{t}

). We can see that a large value of

ρ

means that the source and target models are very closely related. Then, one can expect that the additional available data in the source task will be capable of defeating the effects of noise in (25) for large values of

ρ

. Specifically, it is expected in this regime that the source model will perform better than the standard learning formulation for values of

ρ

close to 1. However, as we decrease the similarity

ρ

, the source model will have a small information about the target data. Then, the performance of the hard formulation is expected to be lower than the standard formulation for small values of

ρ

. In this regime, the source information may hurt the generalization performance of the target task. Then, we need to only transfer a portion of the source information (see Figure 2a). In some settings, the transition is sharp, which means that the source information is irrelevant for the target task when

ρ

is smaller than a threshold (see Figure 2b).

Second, consider the case when the number of source samples is smaller than the number of target samples (i.e.,

α_{s} < α_{t}

). Given the observation in (25), the performance of the standard method is expected to be better than the hard formulation for all possible values of

ρ

in this regime (see Figure 7).

6. Additional Simulation Results

In this section, we provide additional simulation examples to confirm our asymptotic analysis and illustrate the phase transition phenomenon. In our experiments, we focus on the regression and classification models.

6.1. Model Assumptions

For the regression model, we assume that the source, target, and test data are generated according to

\begin{matrix} y_{i} = max (a_{i}^{⊤} ξ, 0), \forall i \in {1, \dots, n} . \end{matrix}

(26)

The data

{(a_{i}, y_{i})}_{i = 1}^{n}

can be the training data of the source or target tasks. In this regression model, we assume that the function

\hat{φ} (.)

is the identity function, i.e.,

\hat{φ} (x) = x

. Then, the generalization error corresponding to the soft formulation converges in probability as follows:

\begin{matrix} E_{test} \overset{p \to \infty}{\to} v - 2 c q_{t}^{★} + ({(q_{t}^{★})}^{2} + {(r_{t}^{★})}^{2}), \end{matrix}

where c and v are defined as follows

\begin{matrix} c = E [z max (z, 0)], v = E [max {(z, 0)}^{2}], \end{matrix}

where z is a standard Gaussian random variable and

q_{t}^{★}

and

r_{t}^{★}

are defined in Theorem 1. Additionally, the asymptotic limit of the generalization error corresponding to the hard formulation can be expressed in a similar fashion.

For the binary classification model, we assume that the source, target, and test data labels are binary and generated as follows:

\begin{matrix} y_{i} = sign (a_{i}^{⊤} ξ), \forall i \in {1, \dots, n}, \end{matrix}

(27)

where the data

{(a_{i}, y_{i})}_{i = 1}^{n}

can be the training data of the source and target tasks. In this classification model, the objective is to predict the correct sign of any unseen sample

y_{new}

. Then, we fix the function

\hat{φ} (.)

to be the sign function. Following Theorem 1, it can be easily shown that the generalization error corresponding to the soft formulation given in (8) converges in probability as follows:

\begin{matrix} E_{test} \overset{p \to \infty}{\to} \frac{1}{π} {cos}^{- 1} (\frac{q_{t}^{★}}{\sqrt{{(q_{t}^{★})}^{2} + {(r_{t}^{★})}^{2}}}) . \end{matrix}

Here,

q_{t}^{★}

and

r_{t}^{★}

are optimal solutions of the target scalar formulation given in (16). The generalization error corresponding to the hard formulation given in (6) can be expressed in a similar fashion.

6.2. Phase Transitions in the Hard Formulation

In Section 4, we presented analytical formulas for the phase transition phenomenon but only for the special case of squared loss with no regularization. The main purpose of this experiment, shown in Figure 3, is to demonstrate that the phase transition phenomenon still takes place in more general settings with different loss functions and regularization strengths.

Figure 3. Additional illustrations of the phase transition phenomenon. (a) Regression (squared loss,

α_{t} = 0.5

, and

α_{s} = 3 α_{t}

) (b) Regression (squared loss,

α_{t} = 2

, and

α_{s} = 2 α_{t}

) (c) Binary classification (squared loss,

α_{t} = 1.5

, and

α_{s} = 3 α_{t}

) (d) Binary classification (hinge loss,

α_{t} = 1.5

, and

α_{s} = 3 α_{t}

). In all the experiments, we set the regularization strength to be

λ = 0.1

. The blue line represents our theoretical predictions of the optimal transfer rate obtained by solving our asymptotic results in Section 4 for multiple values of

δ

. The empirical results are averaged over 100 independent Monte Carlo trials with

p = 2500

.

In all the cases shown in Figure 3, the transition from negative to positive transfer is a discontinuous jump from standard learning (i.e., no transfer) to full source transfer. Additionally, Figure 3c,d show that the loss function has a small effect on the phase transition boundary.

6.3. Sufficient Condition for the Hard Formulation

In Section 4, we presented a sufficient condition for positive transfer. This sufficient condition is valid for a general family of loss functions and a general regularization strength. The main purpose of this experiment, shown in Figure 4, is to illustrate the precision of the sufficient condition for two particular loss functions, i.e., the squared loss and LAD loss.

Figure 4. Illustrations of the sufficient condition in Proposition 2. (a) Classification (squared loss,

α_{t} = 1.5

, and

α_{s} = 8 α_{t}

) (b) Classification (LAD loss,

α_{t} = 1.5

, and

α_{s} = 8 α_{t}

). In all the experiments, we set the regularization strength to be

λ = 0.1

. The blue line represents our theoretical predictions of the optimal transfer rate obtained by solving our asymptotic results in Section 4 for multiple values of

δ

. The green line represents our sufficient condition for positive transfer stated in Proposition 2.

In all the cases shown in Figure 4, we can see that the transition from negative to positive transfer is a discontinuous jump from standard learning to full source transfer. Additionally, Figure 4a,b show that the sufficient condition summarized in Proposition 2 provides a good prediction of the phase transition boundary for the considered setting.

6.4. Soft Transfer: Impact of the Weighting Matrix and Regularization Strength

In this experiment, we empirically explore the impact of the weighting matrix

Σ

on the generalization error corresponding to the soft formulation. We focus on the binary classification problem with logistic loss. The weighting matrix in (8) takes the following form:

Σ = \sqrt{β_{t}} V,

(28)

where

V

is a diagonal matrix generated in three different ways. (1) Soft Identity:

V

is an identity matrix; (2) Soft Uniform: the diagonal entries of

V

are drawn independently from the uniform distribution and then scaled to have their mean equal to 1; and (3): Soft Beta: similar to (2), but with the diagonal entries drawn from the beta distribution, followed by rescaling to the unit mean.

Figure 5a shows that the considered weighting matrix choices have similar generalization performances, with the identity matrix being slightly better than the other alternatives. Moreover, Figure 5b illustrates the effects of the parameter

β_{t}

in (28) on the generalization performance. It points to the interesting possibility of “designing” the optimal weight matrix to minimize the generalization error.

Figure 5. Continuous line: theoretical predictions. Circles: numerical simulations. (a)

α_{s} = 6 α_{t}

,

λ = 0.1

,

β_{t} = 1 / 10

, and

ρ = 0.9

. (b)

α_{t} = 1

,

α_{s} = 5 α_{t}

,

λ = 0.3

, and

ρ = 0.75

. In all the experiments, we consider the binary classification problem with the logistic loss function. The empirical results are averaged over 50 independent Monte Carlo trials, and we set

p = 1000

.

6.5. Soft and Hard Transfer Comparison

In this simulation example, we consider the regression model and compare the performances of the hard and soft transfer formulations as functions of

α_{t}

and

ρ

.

Figure 6a shows that the soft formulation provides the best generalization performance for all values of

α_{t}

. Moreover, we can see that the hard transfer formulation is only useful for small values

α_{t}

. Figure 6b shows that the performance of the soft and hard transfer formulations depend on the similarity between the source and target tasks. Specifically, the generalization performances of different transfer approaches all improve as we increase the similarity measure

ρ

. We can also see that the full source transfer approach provides the lowest generalization error when the similarity measure is close to 1, while the soft transfer method leads to the best generalization performance at moderate values of the similarity measure. At very small values of

ρ

, which means that the two tasks share little resemblance, the standard learning method (i.e., no transfer) is the best scheme one should use.

Figure 6. Continuous line: theoretical predictions. Circles: numerical simulations. (a)

α_{s} = 12 α_{t}

,

λ = 0.2

, and

ρ = 0.75

. (b)

α_{t} = 1.5

,

α_{s} = 8 α_{t}

, and

λ = 0.4

. In all the experiments, we consider the regression setting with a squared loss. The hard transfer formulation uses

δ = 0.5

, and the soft transfer formulation uses an identity weighting matrix. The empirical results are averaged over 50 independent Monte Carlo trials and we set

p = 1000

.

6.6. Effects of the Source Parameters

In the last simulation example, we consider the regression and classification models. We study the performance of the hard and soft transfer formulations when

α_{s} < α_{t}

.

Figure 7a considers the regression model. It first shows that the soft transfer formulation provides a slightly better generalization performance compared to the standard method. This behavior can be explained by the fact that the soft formulation requires the target weight vector to be close and not necessarily equal to the source weight vector. Additionally, the source model carries some information about the target task.

Figure 7. Continuous line: theoretical predictions. Circles: numerical simulations. (a)

α_{s} = 0.5 α_{t}

,

λ = 0.6

, and

ρ = 0.7

. We consider the regression setting with a squared loss. (b)

α_{s} = 0.5 α_{t}

,

λ = 0.3

, and

ρ = 0.8

. We consider the classification setting with a logistic loss. The hard transfer formulation uses

δ = 0.5

, and the soft transfer formulation uses an identity weighting matrix. The empirical results are averaged over 60 independent Monte Carlo trials, and we set

p = 1000

.

We can also see that the hard transfer approach is not beneficial when the number of source samples is smaller than the number of target samples. This result can be explained by the fact that the hard formulation restricts some entries in the target weight vector to be exactly equal to the corresponding entries in the source weight vector. Moreover, the source model is not perfectly aligned with the target model and has smaller data than the target model (see Section 5.2).

The same behavior can be observed in Figure 7b, which considers the classification model.

7. Technical Details

In this section, we provide a detailed proof of Theorem 1, and Proportions 1 and 2. Specifically, we focus on analyzing the generalized formulation in (8) using the CGMT framework introduced in the following part.

7.1. Technical Tool: Convex Gaussian Min–Max Theorem

The CGMT provides an asymptotic equivalent formulation of primary optimization (PO) problems of the following form:

Φ_{p} (G) = min_{w \in S_{w}} max_{u \in S_{u}} u^{⊤} G w + ψ (w, u) .

(29)

Specifically, the CGMT shows that the PO given in (29) is asymptotically equivalent to the following formulation:

ϕ_{p} (g, h) = min_{w \in S_{w}} max_{u \in S_{u}} ∥ u ∥ g^{⊤} w + ∥ w ∥ h^{⊤} u + ψ (w, u),

(30)

referred to as the auxiliary optimization (AO) problem. Before showing the equivalence between PO and AO, the CGMT assumes that

G \in R^{n \times p}

,

g \in R^{p}

, and

h \in R^{n}

; that all have independent and identically distributed standard normal entries; that the feasibility sets

S_{w} \subset R^{p}

and

S_{u} \subset R^{n}

are convex and compact; and that the function

ψ (., .) : R^{p} \times R^{n} \to R

is continuous convex-concave on

S_{w} \times S_{u}

. Moreover, the function

ψ (., .)

is independent of the matrix

G

. Under these assumptions, the CGMT [17] (Theorem 6.1) shows that, for any

χ \in R

and

ζ > 0

, the following holds:

P (| Φ_{p} (G) - χ | > ζ) \leq 2 P (| ϕ_{p} (g, h) - χ | > ζ) .

(31)

Additionally, the CGMT [17] (Theorem 6.1) provides the following conditions under which the optimal solutions of the PO and AO concentrates around the same set.

Theorem 2

(CGMT Framework). Consider an open set

S_{p}

. Moreover, define the set

S_{p}^{c} = S_{w} \ S_{p}

. Let

ϕ_{p}

and

ϕ_{p}^{c}

be the optimal cost values of AO formulation in (30) with feasibility sets

S_{w}

and

S_{p}^{c}

, respectively. Assume that the following properties are all satisfied:

(1): There exists a constant ϕ such that the optimal cost $ϕ_{p}$ converges in probability to ϕ as p goes to $+ \infty$ .
(2): There exists a positive constant $ζ > 0$ such that $ϕ_{p}^{c} \geq ϕ + ζ$ with probability going to 1 as $p \to + \infty$ .

Then, the following convergence in probability holds:

| Φ_{p} - ϕ_{p} | \overset{p \to + \infty}{⟶} 0, and P ({\hat{w}}_{p} \in S_{p}) \overset{p \to + \infty}{⟶} 1,,

where

Φ_{p}

and

{\hat{w}}_{p}

are the optimal cost and the optimal solution of the PO formulation in (29).

Theorem 2 allows us to analyze the generally easy AO problem to infer the asymptotic properties of the generally hard PO problem. Next, we use the CGMT to rigorously prove the technical results presented in Theorem 1.

7.2. Precise Analysis of the Source Formulation

The source formulation defined in (4) is well-studied in recent literature [26]. Specifically, it has been rigorously proven that the performance of the source formulation can be fully characterized after solving the following scalar formulation:

\begin{matrix} min_{\begin{matrix} q_{s}, r_{s} \geq 0 \end{matrix}} sup_{\begin{matrix} σ_{s} > 0 \end{matrix}} & α_{s} E [M_{ℓ (Y_{s}, .)} (r_{s} H_{s} + q_{s} S_{s}; \frac{r_{s}}{σ_{s}})] \\ - \frac{r_{s} σ_{s}}{2} + \frac{λ}{2} (q_{s}^{2} + r_{s}^{2}), \end{matrix}

(32)

where

Y_{s} = φ (S_{s})

, and

H_{s}

and

S_{s}

are two independent standard Gaussian random variables. The expectation in (32) is taken over the random variables

H_{s}

and

S_{s}

. Furthermore, the function

M_{ℓ (Y_{s}, .)}

introduced in the scalar optimization problem (32) is the Moreau envelope function defined in (15).

7.3. Precise Analysis of the Soft Transfer Approach

In this part, we provide a precise asymptotic analysis of the generalized transfer formulation given in (8). Specifically, we focus on analyzing the following formulation:

\begin{matrix} min_{w \in R^{p}} & \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ (y_{i}; a_{i}^{⊤} w) + \frac{λ}{2} {∥ w ∥}^{2} + \frac{1}{2} {∥ Σ (w - {\hat{w}}_{s}) ∥}^{2}, \end{matrix}

(33)

where

{\hat{w}}_{s}

is the optimal solution of the source formulation given in (4). Note that the vector

{\hat{w}}_{s}

is independent of the training data of the target task. For simplicity of notation, we denote by

{(a_{i}, y_{i})}_{i = 1}^{n_{t}}

the training data of the target task. Here, we use the CGMT framework introduced in Section 7.1 to precisely analyze the above formulation.

7.3.1. Formulating the Auxiliary Optimization Problem

Our first objective is to rewrite the generalized formulation in the form of the PO problem given in (29). To this end, we introduce additional optimization variables. Specifically, the generalized formulation can be equivalently formulated as follows:

\begin{matrix} min_{w \in R^{p}} max_{u \in R^{n_{t}}} & \frac{1}{p} u^{⊤} A w - \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) + \frac{λ}{2} {∥ w ∥}^{2} \\ + \frac{1}{2} {∥ Σ (w - {\hat{w}}_{s}) ∥}^{2}, \end{matrix}

(34)

where the optimization vector

u \in R^{n_{t}}

is formed as

u = {[u_{1}, \dots, u_{n_{t}}]}^{⊤}

and the data matrix

A \in R^{n_{t} \times p}

is given by

A = {[a_{1}, \dots, a_{n_{t}}]}^{⊤}

. Additionally, the function

ℓ^{★} (y; .)

denotes the convex conjugate function of the loss function

ℓ (y; .)

. First, observe that the CGMT framework assumes that the feasibility sets of the minimization and maximization problems are compact. Then, our next step is to show that the formulation given in (34) satisfies this assumption.

Lemma 1

(Primal-Dual Compactness). Assume that

\hat{w}

and

\hat{u}

are optimal solutions of the optimization problem in (34). Then, there exist two constants

C_{w} > 0

and

C_{u} > 0

such that the following convergence in probability holds:

\begin{matrix} P (∥ \hat{w} ∥ \leq C_{w}) \overset{p \to + \infty}{\to} 1, P (∥ \hat{u} ∥ / \sqrt{n_{t}} \leq C_{u}) \overset{p \to + \infty}{\to} 1 . \end{matrix}

(35)

A detailed proof of Lemma 1 is provided in Appendix B. The proof of the above result follows using Assumption 3 to prove the compactness of the optimal solution

\hat{w}

. Moreover, it uses the asymptotic results in [27] (Theorem 2.1), which provides the concentration properties of the minimum and maximum eigenvalues of random matrices. To show the compactness of the optimal dual vector

\hat{u}

, we use Assumption 3 and the result in [25] (Proposition 11.3), which provides the inversion rules for subgradient relations.

The theoretical result in Lemma 1 shows that the optimization problem in (34) can be equivalently formulated with compact feasibility sets on events with probability going to one. Then, it suffices to study the constrained version of (34). Note that the data labels

{y_{i}}_{i = 1}^{n_{t}}

depend on the data matrix

A

. Then, one can decompose the matrix

A

as follows:

\begin{matrix} A & = A P_{ξ_{t}} + A P_{ξ}^{⊥} = A ξ_{t} ξ_{t}^{⊤} + A P_{ξ}^{⊥}, \end{matrix}

where the matrix

P_{ξ_{t}} \in R^{p \times p}

denotes the projection matrix onto the space spanned by the vector

ξ_{t}

and the matrix

P_{ξ}^{⊥} = I_{p} - ξ_{t} ξ_{t}^{⊤}

denotes the projection matrix onto the orthogonal complement of the space spanned by the vector

ξ_{t}

. Note that we can express

A

as follows without changing its statistics:

\begin{matrix} A = s_{t} ξ_{t}^{⊤} + G P_{ξ}^{⊥}, \end{matrix}

(36)

where

s_{t} \sim N (0, I_{n_{t}})

and the components of the matrix

G \in R^{n_{t} \times p}

are drawn independently from a standard Gaussian distribution and where

s_{t}

and

G

are independent. Here, (36) represents an equality in distribution. This means that the formulation in (34) can be expressed as follows:

\begin{matrix} min_{∥ w ∥ \leq C_{w}} max_{u \in C_{t}} \frac{1}{p} u^{⊤} G P_{ξ}^{⊥} w + \frac{1}{p} u^{⊤} s_{t} ξ_{t}^{⊤} w + \frac{λ}{2} {∥ w ∥}^{2} \\ - \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) + \frac{1}{2} {∥ Σ (w - {\hat{w}}_{s}) ∥}^{2}, \end{matrix}

(37)

where the set

C_{t}

is defined as

C_{t} = {u : ∥ u ∥ / \sqrt{n_{t}} \leq C_{u}}

. Note that the formulation in (37) is in the form of the primary formulation given in (29). Here, the function

ψ (., .)

is defined as follows:

\begin{matrix} ψ (w, u) & = \frac{1}{p} u^{⊤} s_{t} ξ_{t}^{⊤} w + \frac{λ}{2} {∥ w ∥}^{2} - \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) \\ + \frac{1}{2} {∥ Σ (w - {\hat{w}}_{s}) ∥}^{2} . \end{matrix}

(38)

One can easily see that the optimization problem in (37) has compact convex feasibility sets. Moreover, the function

ψ (., .)

is continuous, convex–concave, and independent of the Gaussian matrix

G

. This shows that the assumptions of the CGMT are all satisfied by the primary formulation in (37). Then, following the CGMT framework, the auxiliary formulation corresponding to our primary problem in (37) can be expressed as follows:

\begin{matrix} min_{∥ w ∥ \leq C_{w}} max_{u \in C_{t}} \frac{∥ u ∥}{p} g^{⊤} P_{ξ}^{⊥} w + \frac{1}{p} u^{⊤} s_{t} ξ_{t}^{⊤} w + \frac{h^{⊤} u}{p} ∥ P_{ξ}^{⊥} w ∥ \\ + \frac{λ}{2} {∥ w ∥}^{2} - \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) + \frac{1}{2} {∥ Σ (w - {\hat{w}}_{s}) ∥}^{2}, \end{matrix}

(39)

where

g \in R^{p}

and

h \in R^{n_{t}}

are two independent standard Gaussian vectors. The rest of the proof focuses on simplifying the obtained AO formulation and on studying its asymptotic properties.

7.3.2. Simplifying the AO Problem of the Target Task

Here, we focus on simplifying the auxiliary formulation corresponding to the target task. We start our analysis by decomposing the target optimization vector

w \in R^{p}

as follows:

\begin{matrix} w = (ξ_{t}^{⊤} w) ξ_{t} + B_{ξ_{t}}^{⊥} r_{t}, \end{matrix}

(40)

where

r_{t} \in R^{p - 1}

is a free vector and

B_{ξ_{t}}^{⊥} \in R^{p \times (p - 1)}

is formed by an orthonormal basis orthogonal to the vector

ξ_{t}

. Now, define the variable

q_{t}

as follows:

q_{t} = ξ_{t}^{⊤} w

. Based on the result in Lemma 1 and the decomposition in (40), there exist

C_{q_{t}} > 0

,

C_{r} > 0

, and

C_{u} > 0

such that our auxiliary formulation can be asymptotically expressed in terms of the variables

q_{t}

and

r_{t}

as follows:

\begin{matrix} min_{\begin{matrix} (q_{t}, r_{t}) \in T_{1} \end{matrix}} max_{\begin{matrix} u \in C_{t} \end{matrix}} \frac{∥ u ∥}{p} g^{⊤} B_{ξ_{t}}^{⊥} r_{t} + \frac{∥ r_{t} ∥}{p} h^{⊤} u + \frac{q_{t}}{p} u^{⊤} s_{t} + \frac{λ}{2} q_{t}^{2} \\ + \frac{λ}{2} {∥ r_{t} ∥}^{2} - \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) + \frac{1}{2} q_{t}^{2} V_{p, t} - q_{t} V_{p, t s} \\ + \frac{1}{2} r_{t}^{⊤} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ B_{ξ_{t}}^{⊥} r_{t} + q_{t} ξ_{t}^{⊤} Λ B_{ξ_{t}}^{⊥} r_{t} - r_{t}^{⊤} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ {\hat{w}}_{s} . \end{matrix}

Here, we drop terms independent of the optimization variables and the matrix

Λ \in R^{p \times p}

is defined as

Λ = Σ^{⊤} Σ

. Additionally, the feasibility set

T_{1}

is defined as follows:

\begin{matrix} T_{1} = \{(q_{t}, r_{t}) : | q_{t} | \leq C_{q_{t}}, ∥ r_{t} ∥ \leq C_{r}\} . \end{matrix}

(41)

Here, the sequence of random variables

V_{p, t}

and

V_{p, t s}

are defined as follows:

\begin{matrix} V_{p, t} = ξ_{t}^{⊤} Λ ξ_{t}, V_{p, t s} = ξ_{t}^{⊤} Λ {\hat{w}}_{s} . \end{matrix}

(42)

Next, we focus on simplifying the obtained auxiliary formulation. Our strategy is to solve over the direction of the optimization vector

r \in R^{p - 1}

. This step requires an interchange between non-convex minimization and non-concave maximization. We can justify the interchange using the theoretical result in [17] (Lemma A.3). The main argument in [17] (Lemma A.3) is that the strong convexity of the primary formulation in (37) allows us to perform such an interchange in the corresponding auxiliary formulation. The optimization problem over the vector

r_{t}

with fixed norm, i.e.,

∥ r_{t} ∥ = r_{t}

, can be formulated as follows:

\begin{matrix} C_{p}^{★} = & min_{r_{t} \in R^{p - 1}} b_{p}^{⊤} r_{t} + \frac{1}{2} r_{t}^{⊤} Λ^{⊥} r_{t}, s . t . ∥ r_{t} ∥ = r_{t} . \end{matrix}

(43)

Here, we ignore constant terms independent of

r_{t}

, and the matrix

Λ^{⊥} \in R^{(p - 1) \times (p - 1)}

and the vector

b_{p} \in R^{p - 1}

can be expressed as follows:

\begin{matrix} Λ^{⊥} = {(B_{ξ_{t}}^{⊥})}^{⊤} Λ B_{ξ_{t}}^{⊥}, b_{p} & = \frac{∥ u ∥}{p} {(B_{ξ_{t}}^{⊥})}^{⊤} g + q_{t} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ ξ_{t}^{⊤} \\ - {(B_{ξ_{t}}^{⊥})}^{⊤} Λ {\hat{w}}_{s} . \end{matrix}

The optimization problem in (43) is non-convex given the norm equality constraint. It is well-studied in the literature [28] and is known as the trust region subproblem. Using the same analysis as in [20], the optimal cost value of the optimization problem (43) can be expressed in terms of a one-dimensional optimization problem as follows:

\begin{matrix} C_{p}^{★} = sup_{σ_{t} > - μ_{p}} \{- \frac{1}{2} b_{p}^{⊤} {[Λ^{⊥} + σ_{t} I_{p - 1}]}^{- 1} b_{p} - \frac{σ_{t} r_{t}^{2}}{2}\}, \end{matrix}

(44)

where

μ_{p}

is the minimum eigenvalue of the matrix

Λ^{⊥}

, denoted by

σ_{\min} (Λ^{⊥})

. This result can be seen by equivalently formulating the non-convex problem in (43) as follows:

\begin{matrix} C_{p}^{★} = & min_{r_{t} \in R^{p - 1}} max_{σ_{t} \in R} b_{p}^{⊤} r_{t} + \frac{1}{2} r_{t}^{⊤} Λ^{⊥} r_{t} + \frac{σ_{t}}{2} ({∥ r_{t} ∥}^{2} - r_{t}^{2}) . \end{matrix}

Then, we show that the optimal

σ_{t}

satisfies a constraint that preserves the convexity over

r_{t}

. This allows us to interchange the maximization and minimization and to solve over the vector

r_{t}

. The above analysis shows that the AO formulation corresponding to our primary problem can be expressed as follows:

\begin{matrix} min_{\begin{matrix} (q_{t}, r_{t}) \in T_{2} \end{matrix}} max_{\begin{matrix} u \in C_{t} \end{matrix}} sup_{σ_{t} > - μ_{p}} \frac{r_{t}}{p} h^{⊤} u + \frac{q_{t}}{p} u^{⊤} s_{t} + \frac{λ}{2} q_{t}^{2} + \frac{λ}{2} r_{t}^{2} \\ - \frac{1}{p} \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) + \frac{1}{2} q_{t}^{2} V_{p, t} - q_{t} V_{p, t s} - \frac{{∥ u ∥}^{2}}{2 p} T_{p, g} (σ_{t}) \\ - \frac{σ_{t} r_{t}^{2}}{2} - \frac{1}{2} q_{t}^{2} T_{p, t} (σ_{t}) - \frac{1}{2} T_{p, s} (σ_{t}) + q_{t} T_{p, t s} (σ_{t}), \end{matrix}

(45)

where the set

T_{2}

has the same definition as the set

T_{1}

except that we replace

∥ r_{t} ∥

with

r_{t}

. Here, the sequence of random functions

T_{p, g} (.)

,

T_{p, t} (.)

,

T_{p, s} (.)

, and

T_{p, t s} (.)

can be expressed as follows:

\begin{matrix} \{\begin{matrix} T_{p, g} (σ_{t}) = \frac{1}{p} g^{⊤} B_{ξ_{t}}^{⊥} {[Λ^{⊥} + σ_{t} I_{p - 1}]}^{- 1} {(B_{ξ_{t}}^{⊥})}^{⊤} g \\ T_{p, t} (σ_{t}) = ξ_{t}^{⊤} Λ B_{ξ_{t}}^{⊥} {[Λ^{⊥} + σ_{t} I_{p - 1}]}^{- 1} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ ξ_{t} \\ T_{p, s} (σ_{t}) = {\hat{w}}_{s}^{⊤} Λ B_{ξ_{t}}^{⊥} {[Λ^{⊥} + σ_{t} I_{p - 1}]}^{- 1} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ {\hat{w}}_{s} \\ T_{p, t s} (σ_{t}) = ξ_{t}^{⊤} Λ B_{ξ_{t}}^{⊥} {[Λ^{⊥} + σ_{t} I_{p - 1}]}^{- 1} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ {\hat{w}}_{s} . \end{matrix} \end{matrix}

Note that the formulation in (45) is obtained after dropping terms that converge in probability to zero. This simplification can be justified using a similar analysis to that in [20] (Lemma 3). The main idea in [20] (Lemma 3) is to show that both loss functions converge uniformly to the same limit.

Next, the objective is to simplify the obtained AO formulation over the optimization vector

u \in R^{n_{t}}

. Based on the property stated in [20] (Lemma 4), the optimization over the vector

u

can be expressed as follows:

\begin{matrix} I_{p}^{★} & = max_{\begin{matrix} u \in C_{t} \end{matrix}} r_{t} h^{⊤} u + q_{t} u^{⊤} s_{t} - \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i}) - \frac{{∥ u ∥}^{2}}{2} T_{p, g} (σ_{t}) \\ = \sum_{i = 1}^{n_{t}} M_{ℓ (y_{i}, .)} (r_{t} h_{i} + q_{t} s_{t, i}; T_{p, g} (σ_{t})) . \end{matrix}

This result is valid on events with probability going to one as p goes to

+ \infty

. Here, the function

M_{ℓ (y_{i}, .)}

is the Moreau envelope function defined in (15). The proof of this property is omitted since it follows the same ideas as [20] (Lemma 4). The main idea in [20] (Lemma 4) is to use Assumption 3 to show that the optimal solution of the unconstrained version of the maximization problem is bounded asymptotically and then to use the property introduced in [25] (Example 11.26) to complete the proof. Now, our auxiliary formulation can be asymptotically simplified to a scalar optimization problem as follows:

\begin{matrix} min_{\begin{matrix} (q_{t}, r_{t}) \in T_{2} \end{matrix}} sup_{σ_{t} > - μ_{p}} \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) - \frac{σ_{t} r_{t}^{2}}{2} - \frac{1}{2} q_{t}^{2} Z_{p, t} (σ_{t}) - \frac{1}{2} Z_{p, s} (σ_{t}) \\ + \frac{1}{p} \sum_{i = 1}^{n_{t}} M_{ℓ (y_{i}, .)} (r_{t} h_{i} + q_{t} s_{t, i}; T_{p, g} (σ_{t})) + q_{t} Z_{p, t s} (σ_{t}), \end{matrix}

(46)

where the functions

Z_{p, t} (\cdot)

,

Z_{p, t s} (\cdot)

, and

Z_{p, s} (\cdot)

are defined as follows:

\begin{matrix} Z_{p, t} (σ_{t}) = T_{p, t} (σ_{t}) - V_{p, t}, Z_{p, t s} (σ_{t}) = T_{p, t s} (σ_{t}) - V_{p, t s} \end{matrix}

(47)

\begin{matrix} Z_{p, s} (σ_{t}) = T_{p, s} (σ_{t}) - V_{p, s}, where V_{p, s} = {\hat{w}}_{s}^{⊤} Λ {\hat{w}}_{s} . \end{matrix}

(48)

Note that the auxiliary formulation in (46) now has scalar optimization variables. Then, it remains to study its asymptotic properties. We refer to this problem as the target scalar formulation.

7.3.3. Asymptotic Analysis of the Target Scalar Formulation

In this part, we study the asymptotic properties of the target scalar formulation expressed in (46). We start our analysis by studying the asymptotic properties of the sequence of random functions

T_{p, g} (.)

,

Z_{p, t} (.)

,

Z_{p, s} (.)

, and

Z_{p, t s} (.)

as given in the following lemma.

Lemma 2

(Asymptotic Properties). First, the random variable

μ_{p}

converges in probability to

μ_{\min}

, where

μ_{\min}

is defined in Assumption 5. For any fixed

σ > 0

, the following convergence in probability holds true:

\begin{matrix} \{\begin{matrix} Z_{p, t} (σ - μ_{p}) \overset{p \to + \infty}{\to} Z_{t} (σ - μ_{\min}) \\ Z_{p, t s} (σ - μ_{p}) \overset{p \to + \infty}{\to} Z_{t s} (σ - μ_{\min}) \\ Z_{p, s} (σ - μ_{p}) \overset{p \to + \infty}{\to} Z_{s} (σ - μ_{\min}) \\ T_{p, g} (σ - μ_{p}) \overset{p \to + \infty}{\to} T_{g} (σ - μ_{\min}) = T_{1} (σ - μ_{\min}) . \end{matrix} \end{matrix}

Here, the deterministic functions

Z_{t} (.)

,

Z_{t s} (.)

,

Z_{s} (.)

,

T_{1} (.)

, and

T_{3} (.)

are defined as follows:

\begin{matrix} \{\begin{matrix} Z_{t} (σ) = σ - 1 / T_{1} (σ), Z_{t s} (σ) = ρ q_{s}^{★} Z_{t} (σ) \\ Z_{s} (σ) = ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) T_{3} (σ) + {(ρ q_{s}^{★})}^{2} Z_{t} (σ) \\ T_{1} (σ) = E_{μ} [1 / (μ + σ)], T_{3} (σ) = - E_{μ} [μ σ / (μ + σ)] . \end{matrix} \end{matrix}

Moreover, the constants

q_{s}^{★}

and

r_{s}^{★}

are optimal solutions of the source asymptotic formulation defined in (32).

A detailed proof of Lemma 2 is provided in Appendix C. Now that we obtained the asymptotic properties of the sequence of random variables, it remains to study the asymptotic properties of the optimal cost and optimal solution set of the scalar formulation in (46). To state our first asymptotic result, we define the following deterministic optimization problem:

\begin{matrix} min_{\begin{matrix} (q_{t}, r_{t}) \in T_{2} \end{matrix}} sup_{σ_{t} > - μ_{\min}} \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) - \frac{σ_{t} r_{t}^{2}}{2} - \frac{1}{2} Z_{s} (σ_{t}) - \frac{1}{2} q_{t}^{2} Z_{t} (σ_{t}) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{g} (σ_{t}))] + q_{t} Z_{t s} (σ_{t}), \end{matrix}

(49)

where

H_{t}

and

S_{t}

are two independent standard Gaussian random variables and

Y_{t} = φ (S_{t})

. Here, the function

M_{ℓ (Y_{t}, .)}

denotes the Moreau envelope function defined in (15) and the expectation is taken over the random variables

H_{t}

and

S_{t}

, and the possibly random function

φ (.)

. Now, we are ready to state our asymptotic property of the cost function of (46).

Lemma 3

(Cost Function of the Traget AO Formulation). Define

O_{p, t} (.)

as the loss function of the target scalar optimization problem given in (46). Additionally, define

O_{t} (.)

as the cost function of the deterministic formulation in (49). Then, the following convergence in probability holds true:

\begin{matrix} O_{p, t} (q_{t}, r_{t}, σ_{t} - μ_{p}) \overset{p \to + \infty}{\to} O_{t} (q_{t}, r_{t}, σ_{t} - μ_{\min}), \end{matrix}

(50)

for any fixed feasible

q_{t}

,

r_{t}

, and

σ_{t} > 0

.

The proof of the asymptotic property stated in Lemma 3 uses the asymptotic results stated in Lemma 2. Moreover, it uses the weak law of large numbers to show that the empirical mean of the Moreau envelope concentrates around its expected value. Based on Assumption 3, one can see that the following pointwise convergence is valid:

\begin{matrix} \frac{1}{n_{t}} \sum_{i = 1}^{n_{t}} M_{ℓ (y_{i}, .)} (r_{t} h_{i} + q_{t} s_{t, i}; x) \overset{p \to + \infty}{\to} E [M_{ℓ (Y, .)} (r_{t} H + q_{t} S; x)], \end{matrix}

where H and S are independent standard Gaussian random variables and

Y = φ (S)

. The above property is valid for any

x > 0

,

r_{t} \geq 0

, and

q_{t}

. Based on [25] (Theorem 2.26), the Moreau envelope function is convex and continuously differentiable with respect to

x > 0

. Combining this with [29] (Theorem 7.46), the above asymptotic function is continuous in

x > 0

. Then, using Lemma 2, the uniform convergence, and the continuity property, we conclude that the empirical average of the Moreau envelope converges in probability to the following function:

\begin{matrix} E [M_{ℓ (Y, .)} (r_{t} H + q_{t} S; T_{g} (σ_{t} - μ_{\min}))], \end{matrix}

(51)

for any fixed feasible

q_{t}

,

r_{t}

, and

σ_{t} > 0

. This completes the proof of Lemma 3.

Before continuing our analysis, we provide the convexity properties of the cost function of the deterministic problem in (49) in the following lemma.

Lemma 4

(Strong Conexity). Define

O_{t} (\cdot, \cdot, \cdot)

as the cost function of the optimization problem in (49). Then,

O_{t} (\cdot, \cdot, \cdot)

is concave in the maximization variable

σ_{t}

for any fixed feasible

(q_{t}, r_{t})

. Moreover, define the function

O_{t} (\cdot, \cdot)

as follows:

\begin{matrix} O_{t} (q_{t}, r_{t}) = sup_{σ_{t} > - μ_{\min}} f (q_{t}, r_{t}, σ_{t}) . \end{matrix}

(52)

Then, the function

O_{t} (\cdot, \cdot)

is strongly convex in the minimization variables

(q_{t}, r_{t})

.

The proof of Lemma 4 is provided in Appendix D. Now, we use these properties to show that the optimal solution set of the formulation in (46) converges in probability to the optimal solution set of the formulation in (49).

Lemma 5

(Consistency of the Target AO Formulation). Define

P_{p, t}

and

P_{t}

as the optimal set of

(q_{t}, r_{t})

of the optimization problems formulated in (46) and (49). Moreover, define

O_{p, t}^{★}

and

O_{t}^{★}

as the optimal cost values of the optimization problems formulated in (46) and (49). Then, the following converges in probability holds true:

\begin{matrix} O_{p, t}^{★} \overset{p \to + \infty}{\to} O_{t}^{★}, D (P_{p, t}, P_{t}) \overset{p \to + \infty}{\to} 0, \end{matrix}

(53)

where

D (A, B)

denotes the deviation between the sets

A

and

B

and is defined as

D (A, B) = {sup}_{c_{1} \in A} {inf}_{c_{2} \in B} ∥ c_{1} - c_{2} ∥

.

The stated result can be proven by first observing that the loss function

O_{t} (.)

corresponding to the deterministic formulation in (49) satisfies the following:

\begin{matrix} lim_{σ_{t} \to + \infty} O_{t} (q_{t}, r_{t}, σ_{t} - μ_{\min}) = - \infty \end{matrix}

(54)

for any

r_{t} > 0

and any fixed

q_{t}

. Combining this with the convergence result in Lemma 3, ref. [17] (Lemma B.1), and [17] (Lemma B.2), we obtain the following asymptotic result:

\begin{matrix} sup_{σ_{t} > 0} O_{p, t} (q_{t}, r_{t}, σ_{t} - μ_{p}) \overset{p \to + \infty}{\to} sup_{σ_{t} > 0} O_{t} (q_{t}, r_{t}, σ_{t} - μ_{\min}) . \end{matrix}

Here, the results in [17] (Lemma B.1) and [17] (Lemma B.2) provide convergence properties of minimization problems over open sets. Note that, if

r_{t} = 0

, the supremum in the above convergence result occurs at

σ_{t} \to + \infty

. However, it can be checked that the above convergence result still holds. Based on Lemma 4, the cost function of the minimization problem in (49) is strongly convex in

(q_{t}, r_{t})

. Moreover, the feasibility set of the minimization problem is convex and compact. Additionally, the cost function of the minimization problem in (49) is continuous in the feasibility set. Then, using the results in [30] (Theorem II.1) and [31] (Theorem 2.1), we obtain the convergence properties stated in Lemma 5. Here, the results in [30] (Theorem II.1) and [31] (Theorem 2.1) provide uniform convergence and consistency properties of convex optimization problems.

Now that we obtained the asymptotic problem, it remains to study the asymptotic properties of the training and generalization errors corresponding to the target formulation in (8).

7.3.4. Specialization to Hard Formulation

Before starting the analysis of the generalization error, we specialize our general analysis to the hard transfer formulation. First, note that

δ = 1

implies that the hard transfer formulation is equivalent to the source formulation. Next, we assume that

δ < 1

. To obtain the asymptotic limit of the hard formulation, we specialize the general results in (49) to the following probability distribution:

\begin{matrix} P_{p} (μ) = \{\begin{matrix} 0 & with probability (1 - δ) \\ + \infty & with probability δ . \end{matrix} \end{matrix}

(55)

Note that the probability distribution in (55) satisfies Assumption 5. Then, the asymptotic limit of the soft formulation corresponding to the probability distribution

P_{μ} (.)

, defined in (55), can be expressed as follows:

\begin{matrix} min_{\begin{matrix} (q_{t}, r_{t}) \in T_{2} \end{matrix}} sup_{σ_{t} > 0} & \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) + \frac{σ_{t} δ}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; \frac{1 - δ}{σ_{t}})] - \frac{σ_{t} r_{t}^{2}}{2} \\ + \frac{σ_{t} δ}{2 (1 - δ)} {(q_{t} - ρ q_{s}^{★})}^{2} . \end{matrix}

(56)

This shows that the asymptotic limit of the hard formulation is the deterministic problem (56).

7.3.5. Asymptotic Analysis of the Training and Generalization Errors

First, the generalization error corresponding to the target task is given by

\begin{matrix} E_{test} & = \frac{1}{4^{υ}} E [{(φ (a_{t, new}^{⊤} ξ_{t}) - \hat{φ} ({\hat{w}}_{t}^{⊤} a_{t, new}))}^{2}], \end{matrix}

(57)

where

a_{t, new}

is an unseen target feature vector. Now, consider the following two random variables

\begin{matrix} ν_{1} = a_{t, new}^{⊤} ξ_{t}, and ν_{2} = {\hat{w}}_{t}^{⊤} a_{t, new} . \end{matrix}

Given

{\hat{w}}_{t}

and

ξ_{t}

, the random variables

ν_{1}

and

ν_{2}

have a bivaraite Gaussian distribution with zero mean vector and covariance matrix given as follows:

\begin{matrix} C_{p} = [\begin{matrix} {∥ ξ_{t} ∥}^{2} & ξ_{t}^{⊤} {\hat{w}}_{t} \\ ξ_{t}^{⊤} {\hat{w}}_{t} & {∥ {\hat{w}}_{t} ∥}^{2} \end{matrix}] . \end{matrix}

(58)

To precisely analyze the asymptotic behavior of the generalization error, it suffices to analyze the properties of the covariance matrix

C_{p}

. Define the random variables

{\hat{q}}_{p, t}^{★}

and

{\hat{r}}_{p, t}^{★}

for the target task as follows:

\begin{matrix} {\hat{q}}_{p, t}^{★} = ξ_{t}^{⊤} {\hat{w}}_{t}, and {\hat{r}}_{p, t}^{★} = ∥ {(B_{ξ_{t}}^{⊥})}^{⊤} {\hat{w}}_{t} ∥, \end{matrix}

(59)

where

B_{ξ_{t}}^{⊥}

is defined in Section 7.3.2. Then, the covariance matrix

C_{p}

given in (58) can be expressed as follows:

\begin{matrix} [\begin{matrix} 1 & {\hat{q}}_{p, t}^{★} \\ {\hat{q}}_{p, t}^{★} & {({\hat{q}}_{p, t}^{★})}^{2} + {({\hat{r}}_{p, t}^{★})}^{2} \end{matrix}] . \end{matrix}

Hence, to study the asymptotic properties of the generalization error, it suffices to study the asymptotic properties of the random quantities

{\hat{q}}_{p, t}^{★}

and

{\hat{r}}_{p, t}^{★}

.

Lemma 6

(Consistency of the Target Formulation). The random quantities

{\hat{q}}_{p, t}^{★}

and

{\hat{r}}_{p, t}^{★}

satisfy the following asymptotic properties:

\begin{matrix} {\hat{q}}_{p, t}^{★} \overset{p \to + \infty}{\to} q_{t}^{★}, and {\hat{r}}_{p, t}^{★} \overset{p \to + \infty}{\to} r_{t}^{★}, \end{matrix}

where

q_{t}^{★}

and

r_{t}^{★}

are the optimal solutions of the deterministic formulation stated in (49).

To prove the above asymptotic result, we define

{\tilde{q}}_{p, t}^{★}

and

{\tilde{r}}_{p, t}^{★}

as follows:

\begin{matrix} {\tilde{q}}_{p, t}^{★} = ξ_{t}^{⊤} {\tilde{w}}_{t}, and {\tilde{r}}_{p, t}^{★} = ∥ {(B_{ξ_{t}}^{⊥})}^{⊤} {\tilde{w}}_{t} ∥, \end{matrix}

(60)

where

{\tilde{w}}_{t}

is the optimal solution of the auxiliary formulation in (39). Given the result in Lemma 5 and the analysis in Section 7.3.2 and Section 7.3.3, the convergence result in Lemma 5 is also satisfied by our auxiliary formulation in (39), i.e.,

\begin{matrix} {\tilde{q}}_{p, t}^{★} \overset{p \to + \infty}{\to} q_{t}^{★}, and {\tilde{r}}_{p, t}^{★} \overset{p \to + \infty}{\to} r_{t}^{★} . \end{matrix}

The rest of the proof of the convergence result stated in Lemma 6 is based on the CGMT framework, i.e., Theorem 2. Specifically, it follows after showing that the assumptions in Theorem 2 are all satisfied. First, we define the set

S_{p}

in Theorem 2 as follows:

\begin{matrix} S_{p} = {w \in R^{p} : | ξ_{t}^{⊤} w - q_{t}^{★} | < ϵ} \cup {w \in R^{p} : | ∥ {(B_{ξ_{t}}^{⊥})}^{⊤} w ∥ - r_{t}^{★} | < ϵ}, \end{matrix}

(61)

where

q_{t}^{★}

and

r_{t}^{★}

are the optimal solutions of the deterministic formulation stated in (49). Note that the cost function of the problem (49) is strongly convex in the minimization variables. Based on the analysis in the previous sections, note that the feasibility sets of the problems defined in Theorem 2 are compact asymptotically. Moreover, the analysis in the previous sections shows that there exists a constant

ϕ

such that the optimal cost

ϕ_{p}

defined in Theorem 2 converges in probability to

ϕ

as p goes to

+ \infty

. Additionally, the same analysis in the previous sections shows that there exists a constant

ϕ^{c}

such that the optimal cost

ϕ_{p}^{c}

defined in Theorem 2 converges in probability to

ϕ^{c}

as p goes to

+ \infty

. The strong convexity property of the cost function of the optimization problem in (49) can then be used to show that there exists

ζ > 0

such that

ϕ^{c} > ϕ + ζ

. This implies that the second assumption in Theorem 2 is satisfied for the considered set

S_{p}

and any fixed

ϵ > 0

. This then shows that the convergence results in Lemma 6 are all satisfied.

Note that the CGMT framework applied to prove Lemma 6 also shows that the optimal cost value of the soft target formulation in (8) converges in probability to the optimal cost value of the deterministic formulation given in (49). Combining this with the result in Lemma 6 shows the convergence property of the training error stated in (17). Now, it remains to show the convergence of the generalization error. It suffices to show that the generalization error defined in (57) is continuous in the quantities

{\hat{q}}_{p, t}^{★}

and

{\hat{r}}_{p, t}^{★}

. This follows based on Assumption 4 and the continuity under integral sign property [32]. This shows the convergence result in (18), which completes the proof of Theorem 1. Note that the above analysis of the soft target formulation in (8) is valid for any choice of

C_{q_{t}}

and

C_{r}

that satisfy the result in Lemma 1. One can ignore these bounds given the convexity properties of the deterministic formulation in (49). This leads to the scalar formulations introduced in (16) and (19).

7.4. Phase Transitions in Hard Formulation

In this part, we provide a rigorous proof of Proposition 1. Here, we consider the squared loss function. In this case, the deterministic source formulation given in (14) can be simplified as follows:

\begin{matrix} min_{\begin{matrix} q_{s}, r_{s} \geq 0 \end{matrix}} & \frac{1}{2} max {\{- r_{s} + \sqrt{α_{s}} {(q_{s}^{2} + r_{s}^{2} + v_{s} - 2 q_{s} c_{s})}^{\frac{1}{2}}, 0\}}^{2} \\ + \frac{λ}{2} (q_{s}^{2} + r_{s}^{2}), \end{matrix}

(62)

where the constants

v_{s}

and

c_{s}

are defined as

v_{s} = E [Y_{s}^{2}]

and

c_{s} = E [S_{s} Y_{s}]

,

Y_{s} = φ (S_{s})

, and

S_{s}

is a standard Gaussian random variable. Additionally, the target scalar formulation given in (16) can be simplified as follows:

\begin{matrix} min_{\begin{matrix} q_{t}, r_{t} \geq 0 \end{matrix}} sup_{σ_{t} > 0} & \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) + \frac{σ_{t} δ}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) \\ + \frac{α_{t} σ_{t}}{2 (1 - δ) + 2 σ_{t}} (r_{t}^{2} + q_{t}^{2} + v_{t} - 2 q_{t} c_{t}) - \frac{σ_{t} r_{t}^{2}}{2} \\ + \frac{σ_{t} δ}{2 (1 - δ)} {(q_{t} - ρ q_{s}^{★})}^{2}, \end{matrix}

(63)

where the constants

v_{t}

and

c_{t}

are defined as

v_{t} = E [Y_{t}^{2}]

and

c_{t} = E [Y_{t} S_{t}]

,

Y_{t} = φ (S_{t})

, and

S_{t}

is a standard Gaussian random variable. Under the conditions stated in Proposition 1, the source deterministic formulation given in (62) can be simplified as follows:

\begin{matrix} min_{\begin{matrix} q_{s}, r_{s} \geq 0 \end{matrix}} & - r_{s} + \sqrt{α_{s}} {(q_{s}^{2} + r_{s}^{2} + v_{s} - 2 q_{s} c_{s})}^{\frac{1}{2}} . \end{matrix}

(64)

Note that one can easily solve the variables

q_{s}

and

r_{s}

. Specifically, the optimal solutions of (64) can be expressed as follows:

\begin{matrix} q_{s}^{★} = c_{s}, and r_{s}^{★} = \sqrt{v_{s} - c_{s}^{2}} / \sqrt{α_{s} - 1} . \end{matrix}

(65)

Moreover, the target deterministic formulation given in (63) can be expressed as follows:

\begin{matrix} min_{\begin{matrix} q_{t}, r_{t} \geq 0 \end{matrix}} sup_{σ_{t} > 0} & \frac{σ_{t} δ}{2} β_{2} + \frac{α_{t} σ_{t}}{2 (1 - δ) + 2 σ_{t}} (r_{t}^{2} + q_{t}^{2} + v_{t} - 2 q_{t} c_{t}) - \frac{σ_{t} r_{t}^{2}}{2} \\ + \frac{σ_{t} δ}{2 (1 - δ)} {(q_{t} - β_{1})}^{2}, \end{matrix}

(66)

where

β_{1}

and

β_{2}

are given by

\begin{matrix} β_{1} = ρ q_{s}^{★}, β_{2} = ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) . \end{matrix}

(67)

Before solving the optimization problem in (66), we consider the following change in variable:

\begin{matrix} x_{t}^{2} + r_{t}^{2} - δ β_{2} - \frac{δ}{1 - δ} {(q_{t} - β_{1})}^{2} . \end{matrix}

(68)

Note that the above change in variable is valid since the formulation in (66) requires the left-hand side of (68) to be positive. Therefore, the formulation in (66) can be expressed in terms of

x_{t}

instead of

r_{t}

as follows:

\begin{matrix} min_{\begin{matrix} q_{t}, x_{t} \geq 0 \end{matrix}} sup_{σ_{t} > 0} & \frac{α_{t} σ_{t}}{2 (1 - δ) + 2 σ_{t}} (x_{t}^{2} + δ β_{2} + \frac{δ}{1 - δ} {(q_{t} - β_{1})}^{2} + q_{t}^{2} + v_{t} - 2 q_{t} c_{t}) - \frac{σ_{t} x_{t}^{2}}{2} . \end{matrix}

(69)

Now, it can be easily checked that the above optimization problem can be solved over the variable

σ_{t}

to give the following formulation:

\begin{matrix} min_{\begin{matrix} q_{t}, x_{t} \geq 0 \end{matrix}} & \frac{1}{2} max {\{- x_{t} \sqrt{1 - δ} + \sqrt{α_{t}} {(x_{t}^{2} + δ β_{2} + \frac{δ}{1 - δ} {(q_{t} - β_{1})}^{2} + q_{t}^{2} + v_{t} - 2 q_{t} c_{t})}^{\frac{1}{2}}, 0\}}^{2} . \end{matrix}

It is now clear that one can solve the problem in (69) in closed form. Moreover, it can be easily checked that the optimal solutions of the optimization problem (66) can be expressed as follows:

\begin{matrix} \{\begin{matrix} q_{t}^{★} = (1 - δ) c_{t} + δ β_{1} \\ {(r_{t}^{★})}^{2} = \frac{1 - δ}{α_{t} + δ - 1} ((δ - 1) c_{t}^{2} + δ β_{1}^{2} + δ β_{2} + v_{t} - 2 δ β_{1} c_{t}) + δ β_{2} + δ (1 - δ) {(c_{t} - β_{1})}^{2} . \end{matrix} \end{matrix}

Then, the asymptotic limit of the generalization error corresponding to the hard formulation can be determined in closed-form. Given that the source and target models given in (1) and (2) use the same data-generating function, the constants

v_{t}

,

c_{t}

,

v_{s}

, and

c_{s}

are all equal. We express them as v and c in the rest of the proof.

Next, we assume that the function

\hat{φ} (.)

is the identity function. Based on the asymptotic result stated in Corollary 1, the asymptotic limit of the generalization error corresponding to the hard formulation can be expressed as follows:

\begin{matrix} E_{test} & = v - 2 c q_{t}^{★} + {(q_{t}^{★})}^{2} + {(r_{t}^{★})}^{2} . \end{matrix}

It can be easily checked that the generalization error can be express as follows:

\begin{matrix} E_{test} & = \frac{α_{t}}{α_{t} + δ - 1} (δ {{(c - β_{1})}^{2} + β_{2}} + (v - c^{2})) . \end{matrix}

(70)

Note that the generalization error obtained above depends explicitly on

δ

. Now, it suffices to study the derivative of

E_{test}

to find the properties of the optimal transfer rate

δ

that minimizes the generalization error. Note that the derivative can be expressed as follows:

\begin{matrix} E_{test}^{'} (δ) = \frac{(α_{t} - 1) {{(c - β_{1})}^{2} + β_{2}} - (v - c^{2})}{{(α_{t} + δ - 1)}^{2}} . \end{matrix}

(71)

This shows that the derivative of the generalization error has the same sign as the numerator. This means that the optimal transfer rate satisfies the following:

\begin{matrix} δ^{★} = \{\begin{matrix} 1 & if Z_{t} < 0 \\ 0 & if Z_{t} > 0 \\ [0 1] & otherwise, \end{matrix} \end{matrix}

(72)

where

Z_{t}

is given by

\begin{matrix} Z_{t} = (α_{t} - 1) {{(c - β_{1})}^{2} + β_{2}} - (v - c^{2}) . \end{matrix}

(73)

It can be easily shown that the condition in (72) can be expressed as the one given in (20). This completes the proof of Proposition 1.

7.5. Sufficient Condition for the Hard Formulation

In this part, we provide a rigorous proof to Proposition 2. Suppose that the assumptions in Proposition 2 are all satisfied. Additionally, we assume that the function

\hat{φ} (.)

is the sign function. Based on the asymptotic result stated in Corollary 1, the asymptotic limit of the generalization error corresponding to the hard formulation can be expressed as follows:

\begin{matrix} E_{test} (δ) & = \frac{1}{π} acos (\frac{q_{t} (δ)}{\sqrt{{(q_{t} (δ))}^{2} + {(r_{t} (δ))}^{2}}}), \end{matrix}

(74)

where

q_{t} (δ)

and

r_{t} (δ)

are optimal solutions to the deterministic problem in (19) for fixed

δ

. A simple sufficient condition for positive transfer is when

E_{test} (δ)

is decreasing at

δ = 0

. This means that there exists some

δ > 0

such that the transfer learning method introduced in (6) is better than the standard method when the following function increases at

δ = 0

:

\begin{matrix} g (δ) = \frac{q_{t} (δ)}{\sqrt{{(q_{t} (δ))}^{2} + {(r_{t} (δ))}^{2}}} . \end{matrix}

(75)

After computing the derivative of the function

g (\cdot)

at zero, one can see that the transfer learning method introduced in (6) is better than the standard method when the following condition is true:

\begin{matrix} q_{t}^{'} (0) r_{t} (0) - q_{t} (0) r_{t}^{'} (0) > 0, \end{matrix}

(76)

where

q_{t} (0)

and

r_{t} (0)

denote the optimal solutions of the standard learning formulation (i.e.,

δ = 0

in (19)). Additionally,

q_{t}^{'} (0)

and

r_{t}^{'} (0)

denote the derivative of the functions

q_{t} (δ)

and

r_{t} (δ)

at

δ = 0

. The above analysis shows that it suffices to find the values of

q_{t}^{'} (0)

and

r_{t}^{'} (0)

to fully characterize the sufficient condition in (76). Before stating our analysis, we define

β_{1}

and

β_{2}

as follows:

\begin{matrix} β_{1} = ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}), β_{2} = ρ q_{s}^{★}, \end{matrix}

(77)

where

q_{s}^{★}

and

r_{s}^{★}

are the optimal solutions of the deterministic source formulation given in (14).

Note that the optimal solution of the deterministic formulation in (19) satisfy the following system of equations:

\begin{matrix} \{\begin{matrix} α_{t} E (S M_{ℓ, 1}^{'} [r_{t} (δ) H + q_{t} (δ) S; \frac{1 - δ}{σ_{t} (δ)}]) + \frac{δ σ_{t} (δ)}{1 - δ} (q_{t} (δ) - β_{2}) + λ q_{t} (δ) = 0 \\ α_{t} E (H M_{ℓ, 1}^{'} [r_{t} (δ) H + q_{t} (δ) S; \frac{1 - δ}{σ_{t} (δ)}]) - σ_{t} (δ) r_{t} (δ) + λ r_{t} (δ) = 0 \\ \frac{δ}{2} β_{1} - \frac{α_{t} (1 - δ)}{σ_{t} {(δ)}^{2}} E (M_{ℓ, 2}^{'} [r_{t} (δ) H + q_{t} (δ) S; \frac{1 - δ}{σ_{t} (δ)}]) - \frac{r_{t} {(δ)}^{2}}{2} + \frac{δ}{2 (1 - δ)} {(q_{t} (δ) - β_{2})}^{2} = 0 . \end{matrix} \end{matrix}

The derivative of the first equation at

δ = 0

can be expressed as follows:

\begin{matrix} α_{t} E ((S H r^{'} (0) + S^{2} q^{'} (0)) M_{ℓ, 1}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + σ_{t} (0) (q_{t} (0) - β_{2}) \\ - \frac{α_{t}}{σ_{t} {(0)}^{2}} (σ_{t} (0) + σ_{t}^{'} (0)) E (S M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + λ q_{t}^{'} (0) = 0, \end{matrix}

(78)

where

q_{t} (0)

,

r_{t} (0)

, and

σ_{t} (0)

denote optimal solutions of the standard learning formulation (i.e.,

δ = 0

in (19)). This means that they are known. Moreover,

q_{t}^{'} (0)

,

r_{t}^{'} (0)

and

σ_{t}^{'} (0)

are unknown and denote the derivative of the functions

q_{t} (δ)

,

r_{t} (δ)

, and

σ_{t} (δ)

at

δ = 0

. Now, define the constants

I_{11}

,

I_{12}

,

I_{13}

, and

I_{14}

as follows:

\begin{matrix} \{\begin{matrix} I_{11} = α_{t} E (S H M_{ℓ, 1}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) \\ I_{12} = α_{t} E (S^{2} M_{ℓ, 1}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + λ \\ I_{13} = - \frac{α_{t}}{σ_{t} {(0)}^{2}} E (S M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) \\ I_{14} = - \frac{α_{t}}{σ_{t} (0)} E (S M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + σ_{t} (0) (q_{t} (0) - β_{2}) . \end{matrix} \end{matrix}

(79)

This means that the equation in (78) can be expressed as follows:

\begin{matrix} I_{11} r_{t}^{'} (0) + I_{12} q_{t}^{'} (0) + I_{13} σ_{t}^{'} (0) + I_{14} = 0 . \end{matrix}

(80)

Similarly, the derivative of the second equation at

δ = 0

can be expressed as follows:

\begin{matrix} α_{t} E ((H^{2} r^{'} (0) + H S q^{'} (0)) M_{ℓ, 1}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) - σ_{t}^{'} (0) r_{t} (0) - σ_{t} (0) r_{t}^{'} (0) \\ - \frac{α_{t}}{σ_{t} {(0)}^{2}} (σ_{t} (0) + σ_{t}^{'} (0)) E (H M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + λ r_{t}^{'} (0) = 0 . \end{matrix}

(81)

Now, define the constants

I_{21}

,

I_{22}

,

I_{23}

, and

I_{24}

as follows:

\begin{matrix} \{\begin{matrix} I_{21} = α_{t} E (H^{2} M_{ℓ, 1}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) - σ_{t} (0) + λ \\ I_{22} = α_{t} E (H S M_{ℓ, 1}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) \\ I_{23} = - \frac{α_{t}}{σ_{t} {(0)}^{2}} E (H M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) - r_{t} (0) \\ I_{24} = - \frac{α_{t}}{σ_{t} (0)} E (H M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) . \end{matrix} \end{matrix}

(82)

This means that the equation in (81) can be expressed as follows:

\begin{matrix} I_{21} r_{t}^{'} (0) + I_{22} q_{t}^{'} (0) + I_{23} σ_{t}^{'} (0) + I_{24} = 0 . \end{matrix}

(83)

Moreover, the derivative of the third equation at

δ = 0

can be expressed as follows:

\begin{matrix} \frac{β_{1}}{2} + \frac{α_{t}}{σ_{t} {(0)}^{3}} (σ_{t} (0) + 2 σ_{t}^{'} (0)) E (M_{ℓ, 2}^{'} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) - r_{t}^{'} (0) r_{t} (0) \\ - \frac{α_{t}}{σ_{t} {(0)}^{2}} E ((H r^{'} (0) + S q^{'} (0)) M_{ℓ, 21}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + \frac{1}{2} {(q_{t} (0) - β_{2})}^{2} \\ + \frac{α_{t}}{σ_{t} {(0)}^{4}} (σ_{t} (0) + σ_{t}^{'} (0)) E (M_{ℓ, 2}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) = 0 . \end{matrix}

(84)

We define the constants

I_{31}

,

I_{32}

,

I_{33}

, and

I_{34}

as follows:

\begin{matrix} \{\begin{matrix} I_{31} = - \frac{α_{t}}{σ_{t} {(0)}^{2}} E (H M_{ℓ, 21}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) - r_{t} (0) \\ I_{32} = - \frac{α_{t}}{σ_{t} {(0)}^{2}} E (S M_{ℓ, 12}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) \\ I_{33} = \frac{2 α_{t}}{σ_{t} {(0)}^{3}} E (M_{ℓ, 2}^{'} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + \frac{α_{t}}{σ_{t} {(0)}^{4}} E (M_{ℓ, 2}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) \\ I_{34} = \frac{α_{t}}{σ_{t} {(0)}^{2}} E (M_{ℓ, 2}^{'} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) + \frac{α_{t}}{σ_{t} {(0)}^{3}} E (M_{ℓ, 2}^{″} [r_{t} (0) H + q_{t} (0) S; \frac{1}{σ_{t} (0)}]) \\ + \frac{1}{2} {(q_{t} (0) - β_{2})}^{2} + \frac{β_{1}}{2} . \end{matrix} \end{matrix}

Therefore, the equation in (84) can be expressed as follows:

\begin{matrix} I_{31} r_{t}^{'} (0) + I_{32} q_{t}^{'} (0) + I_{33} σ_{t}^{'} (0) + I_{34} = 0 . \end{matrix}

(85)

The above analysis shows that the values of

q_{t}^{'} (0)

and

r_{t}^{'} (0)

can be determined after solving the following system of linear equations:

\begin{matrix} \{\begin{matrix} I_{11} r_{t}^{'} (0) + I_{12} q_{t}^{'} (0) + I_{13} σ_{t}^{'} (0) + I_{14} = 0 \\ I_{21} r_{t}^{'} (0) + I_{22} q_{t}^{'} (0) + I_{23} σ_{t}^{'} (0) + I_{24} = 0 \\ I_{31} r_{t}^{'} (0) + I_{32} q_{t}^{'} (0) + I_{33} σ_{t}^{'} (0) + I_{34} = 0, \end{matrix} \end{matrix}

(86)

over the three unknowns

q_{t}^{'} (0)

,

r_{t}^{'} (0)

, and

σ_{t}^{'} (0)

. This completes the proof of Proposition 2.

8. Conclusions

In this paper, we presented a precise characterization of the asymptotic properties of two simple transfer learning formulations. Specifically, our results show that the training and generalization errors corresponding to the considered transfer formulations converge to deterministic functions. These functions can be explicitly found by combining the solutions of two deterministic scalar optimization problems. Our simulation results validate our theoretical predictions and reveal the existence of a phase transition phenomenon in the hard transfer formulation. Specifically, it shows that the hard transfer formulation moves from negative transfer to positive transfer when the similarity of the source and target tasks move past a well-defined critical threshold.

Author Contributions

Conceptualization, O.D. and Y.M.L.; methodology, O.D. and Y.M.L.; software, O.D.; validation, O.D. and Y.M.L.; formal analysis, O.D. and Y.M.L.; investigation, O.D. and Y.M.L.; resources, O.D. and Y.M.L.; data curation, O.D.; writing—original draft preparation, O.D.; writing—review and editing, O.D. and Y.M.L.; visualization, O.D.; supervision, Y.M.L.; project administration, Y.M.L.; funding acquisition, Y.M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Harvard FAS Dean’s Fund for Promising Scholarship and by the US National Science Foundations under grants CCF-1718698 and CCF-1910410.

Institutional Review Board Statement.

Not applicable

Informed Consent Statement.

Not applicable

Data Availability Statement.

Not applicable

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Technical Assumptions

Note that Assumption 1 is essential to show that the soft formulation in (4) concentrates in the large system limit. It also guarantees that the vectors

ξ_{t} \in R^{p}

and

ξ_{s} \in R^{p}

have correlations equal to

ρ

, asymptotically. This is aligned with the definition in (3). Assumption 4 is also introduced to guarantee that the generalization error concentrates in the large system limit. It is satisfied by popular regression and classification models. For instance, observe that the conditions in Assumption 4 are all satisfied by the regression model considering

φ : x \to max (x, 0)

. Moreover, they are satisfied by the binary classification model considering

φ : x \to sign (x)

.

The analysis presented in this paper mostly focuses on regularized transfer learning formulations (i.e.,

λ > 0

). The convexity properties in Assumption 3 are essential to apply the CGMT framework. Moreover, the properties in (12) are used to guarantee the compactness assumptions in the CGMT framework (see Theorem 2). In this appendix, we check the validity of Assumption 3 using popular loss functions, i.e., squared loss for regression tasks and logistic and hinge losses for binary classification tasks. To this end, assume that C is an arbitrary fixed positive constant.

Squared loss: It is easy to see that the squared loss is a proper strongly convex function in $R$ , where 1 is a strong convexity parameter. Moreover, $L (\cdot)$ and its sub-differential set $\partial L (\cdot)$ can be expressed as follows:

$\begin{matrix} L (v) = \frac{1}{2} {∥ v - y ∥}^{2}, \partial L (v) = {v - y}, \end{matrix}$

(A1)

where the vector $y$ is formed by the concatenation of ${y_{i}}_{i = 1}^{n_{t}}$ . Then, there exists $R > 0$ such that

$\begin{matrix} sup_{∥ v ∥ \leq C \sqrt{n_{t}}} | L (v) | \leq R n_{t}, sup_{∥ v ∥ \leq C \sqrt{n_{t}}} sup_{s \in \partial L (v)} ∥ s ∥ = sup_{∥ v ∥ \leq C \sqrt{n_{t}}} ∥ v - y ∥ \leq R \sqrt{n_{t}}, \end{matrix}$

(A2)

with probability going to 1 as p grow to $+ \infty$ . The inequality follows using the regularity condition in Assumption 4 and the weak law of large numbers. Then, the squared loss satisfies Assumption 3 for any $λ \geq 0$ .
Logistic loss: Now, we consider the logistic loss applied to a binary classification model (i.e., $y_{i} \in {- 1, 1}$ ). Note that the logistic loss is a proper convex function in $R$ . Moreover, $L (\cdot)$ and its sub-differential set $\partial L (\cdot)$ are given by

$L (v) = \sum_{i = 1}^{n_{t}} log (1 + e^{- y_{i} v_{i}}), \partial L (v) = {x}, where x_{i} = \frac{- y_{i} e^{- y_{i} v_{i}}}{1 + e^{- y_{i} v_{i}}}, \forall i \in {1 \dots n_{t}} .$

(A3)

First, observe that the loss $L (\cdot)$ satisfies the following inequality:

$\begin{matrix} | L (v) | \leq n_{t} + {∥ v ∥}_{1} . \end{matrix}$

(A4)

This means that there exists $R_{1} > 0$ such that the following inequality is valid:

$\begin{matrix} sup_{∥ v ∥ \leq C \sqrt{n_{t}}} | L (v) | \leq R_{1} n_{t} . \end{matrix}$

(A5)

Additionally, the following results hold true:

$\begin{matrix} sup_{∥ v ∥ \leq C \sqrt{n_{t}}} sup_{s \in \partial L (v)} ∥ s ∥ & = sup_{∥ v ∥ \leq C \sqrt{n_{t}}} ∥ x ∥ \leq {(\sum_{i = 1}^{n_{t}} y_{i}^{2})}^{\frac{1}{2}} . \end{matrix}$

(A6)

This means that there exists $R_{2} > 0$ such that the following inequality is valid:

$\begin{matrix} sup_{∥ v ∥ \leq C \sqrt{n_{t}}} sup_{s \in \partial L (v)} ∥ s ∥ \leq R_{2} \sqrt{n_{t}} . \end{matrix}$

(A7)

Then, there exists a universal constant $R > 0$ such that Assumption 3 is satisfied for the logistic loss for any $λ > 0$ .
Hinge loss: Finally, we consider the hinge loss applied to a binary classification model (i.e., $y_{i} \in {- 1, 1}$ ). It is clear that the hinge loss is a proper convex function in $R$ . Moreover, $L (\cdot)$ is given by $L (\cdot) = \sum_{i = 1}^{n_{t}} max (1 - y_{i} v_{i}, 0)$ . Following [33], the sub-differential set $\partial L (\cdot)$ can be expressed as follows:

$\begin{matrix} \partial L (v) = \{\frac{1}{2} D (1 + g) : {∥ g ∥}_{\infty} \leq 1, g^{⊤} (D v + 1) = {∥ D v + 1 ∥}_{1}\}, \end{matrix}$

(A8)

where $D \in R^{n_{t} \times n_{t}}$ is a diagonal matrix with diagonal entries ${y_{i}}_{i = 1}^{n_{t}}$ . Note that the loss function $L (\cdot)$ satisfies the following inequality:

$\begin{matrix} | L (v) | \leq n_{t} + {∥ v ∥}_{1} . \end{matrix}$

(A9)

This means that there exists $R_{1} > 0$ such that the following inequality is valid:

$\begin{matrix} sup_{∥ v ∥ \leq C \sqrt{n_{t}}} | L (v) | \leq R_{1} n_{t} . \end{matrix}$

(A10)

Moreover, the result in (A8) shows that any element $s$ in the sub-differential set $\partial L (v)$ satisfies the following:

$\begin{matrix} ∥ s ∥ \leq \frac{1}{2} \sqrt{n_{t}} + \frac{1}{2} ∥ g ∥ \leq \frac{1}{2} \sqrt{n_{t}} + \frac{1}{2} \sqrt{n_{t}} {∥ g ∥}_{\infty} \leq \sqrt{n_{t}} . \end{matrix}$

(A11)

This means that there exists $R_{2} > 0$ such that the following inequality is valid:

$\begin{matrix} sup_{∥ v ∥ \leq C \sqrt{n_{t}}} sup_{s \in \partial L (v)} ∥ s ∥ \leq R_{2} \sqrt{n_{t}} . \end{matrix}$

(A12)

Then, there exists a universal constant $R > 0$ such that Assumption 3 is satisfied for the hinge loss for any $λ > 0$ .

Appendix B. Proof of Lemma 1

Appendix B.1. Primal Compactness

We start our analysis by assuming that

λ > 0

. We first consider the compactness of the source problem given in (4). Note that the formulation in (4) has a unique optimal solution. Assume that

{\hat{w}}_{s, p} \in R^{p}

is the unique optimal solution of the optimization problem given in (4). The analysis in [20] (Lemma 1) can be used to prove that there exists

C_{1} > 0

such that the following inequality is valid:

\begin{matrix} {∥ {\hat{w}}_{s, p} ∥}^{2} \leq C_{1}, \end{matrix}

(A13)

with probability going to one as

p \to \infty

. Moreover, observe that the formulation in (33) has a unique optimal solution. Assume that

{\hat{w}}_{t, p} \in R^{p}

is the unique optimal solution of the optimization problem given in (33). Assumption 3 supposes that the loss function is proper. Then, we can conclude that there exists

C_{2} > 0

such that

\begin{matrix} ℓ (y, z) \geq - C_{2}, \forall z \in R . \end{matrix}

(A14)

Now, we define

O_{t, p}^{★}

as the optimal objective value of the formulation in (33). Then, we can see that there exists

C_{3} > 0

such that

\begin{matrix} \frac{λ}{2} {∥ {\hat{w}}_{t, p} ∥}^{2} \leq O_{t, p}^{★} + C_{3} . \end{matrix}

(A15)

Given that

{\hat{w}}_{s, p}

is a feasible solution in the formulation given in (33), we obtain the following inequality:

\begin{matrix} \frac{λ}{2} {∥ {\hat{w}}_{t, p} ∥}^{2} \leq \frac{1}{p} \sum_{i = 1}^{p} ℓ (y_{i}; a_{i}^{⊤} {\hat{w}}_{s, p}) + \frac{λ}{2} {∥ {\hat{w}}_{s, p} ∥}^{2} + C_{3} . \end{matrix}

(A16)

Based on [27] (Theorem 2.1), the following convergence in probability holds:

\begin{matrix} \frac{∥ A ∥}{\sqrt{n_{t}}} \overset{p \to + \infty}{\to} \frac{\sqrt{α_{t}} + 1}{\sqrt{α_{t}}}, \end{matrix}

(A17)

where the matrix

A \in R^{n_{t} \times p}

is formed by the concatenation of vectors

{a_{i}}_{i = 1}^{n_{t}}

. Then, there exists

C_{4} > 0

such that the following inequality is valid:

\begin{matrix} ∥ A {\hat{w}}_{s, p} ∥ \leq ∥ A ∥ ∥ {\hat{w}}_{s, p} ∥ \leq C_{4} \sqrt{n_{t}}, \end{matrix}

(A18)

with probability going to one as

p \to \infty

. Combining this with the assumption in (12), we see that there exists

C_{5} > 0

such that the following inequality is valid:

\begin{matrix} \frac{1}{n_{t}} | \sum_{i = 1}^{p} ℓ (y_{i}; a_{i}^{⊤} {\hat{w}}_{s, p}) | \leq C_{5} . \end{matrix}

(A19)

Given that

λ > 0

and the result in (A13), we conclude that there exists

C_{6} > 0

such that the following holds:

\begin{matrix} {∥ {\hat{w}}_{t, p} ∥}^{2} \leq C_{6}, \end{matrix}

(A20)

with probability going to one as

p \to \infty

.

Now, we consider the case when

λ = 0

. Define

g_{s, p} (\cdot)

,

O_{s, p}^{★}

, and

{\hat{w}}_{s, p}

as the cost function, the optimal cost value, and the optimal solution of the formulation in (4). Moreover, define

g_{t, p} (\cdot)

,

O_{t, p}^{★}

, and

{\hat{w}}_{t, p}

as the cost function, the optimal cost value, and the optimal solution of the formulation in (33). Note that the loss function

ℓ (y, \cdot)

is strongly convex with a strong convexity parameter

S > 0

. Then, for any

x_{1}, x_{2} \in R

, the following property is valid:

\begin{matrix} ℓ (y, \frac{x_{1} + x_{2}}{2}) \leq \frac{1}{2} ℓ (y, x_{1}) + \frac{1}{2} ℓ (y, x_{2}) - \frac{S}{8} {| x_{1} - x_{2} |}^{2} . \end{matrix}

(A21)

This means that, for any

i \in {1, \dots, n}

, the following property is valid:

\begin{matrix} ℓ (y_{i}, \frac{a_{i}^{⊤} w_{1} + a_{i}^{⊤} w_{2}}{2}) \leq \frac{1}{2} ℓ (y_{i}, a_{i}^{⊤} w_{1}) + \frac{1}{2} ℓ (y_{i}, a_{i}^{⊤} w_{2}) - \frac{S}{8} {| a_{i}^{⊤} w_{1} - a_{i}^{⊤} w_{2} |}^{2}, \end{matrix}

(A22)

where n can be the number of samples of the source task or target task and

{y_{i}}_{i = 1}^{n}

are the labels of the source task or target task. Given the convexity of the norm, we obtain the following inequality:

\begin{matrix} g_{p} (\frac{w_{1} + w_{2}}{2}) \leq \frac{1}{2} g_{p} (w_{1}) + \frac{1}{2} g_{p} (w_{2}) - \frac{S}{8 p} {∥ A (w_{1} - w_{2}) ∥}^{2}, \end{matrix}

(A23)

where

g_{p} (\cdot)

can be the cost value of the source task or target task formulations. Now, we focus on the source formulation. Take

w_{1} = {\hat{w}}_{s, p}

and

w_{2} = 0

. Moreover, see that the loss function is proper. Then, there exists

C_{7} > 0

such that

\begin{matrix} \frac{S}{8 p} {∥ A {\hat{w}}_{s, p} ∥}^{2} \leq C_{7} + \frac{1}{2} g_{s, p} (0) . \end{matrix}

(A24)

Given the assumption in (12),

S > 0

,

α_{s} > 1

, and the analysis in [27] (Theorem 2.1), there exists

C_{8} > 0

such that

\begin{matrix} {∥ {\hat{w}}_{s, p} ∥}^{2} \leq C_{8} . \end{matrix}

(A25)

Now, we focus on the target task. Take

w_{1} = {\hat{w}}_{t, p}

and

w_{2} = {\hat{w}}_{s, p}

. Moreover, see that the loss function is proper. Then, there exists

C_{9} > 0

such that

\begin{matrix} \frac{S}{8 p} {∥ A ({\hat{w}}_{t, p} - {\hat{w}}_{s, p}) ∥}^{2} \leq C_{9} + \frac{1}{2} g_{t, p} ({\hat{w}}_{s, p}) . \end{matrix}

(A26)

Given the assumption in (12), the result in (A25),

S > 0

,

α_{t} > 1

, and the analysis in [27] (Theorem 2.1), there exists

C_{10} > 0

such that

\begin{matrix} {∥ {\hat{w}}_{t, p} ∥}^{2} \leq C_{10} . \end{matrix}

(A27)

This completes the first part of the proof of Lemma 1.

Appendix B.2. Dual Compactness

The analysis in Appendix B.1 shows that the formulation in (34) can be equivalently formulated, where the primal feasibility set is given by

\begin{matrix} {∥ w ∥}^{2} \leq C, \end{matrix}

(A28)

where

C > 0

is a sufficiently large constant that satisfies the analysis in Appendix B.1. Now, define

{\hat{u}}_{p}

as the optimal solution of the formulation in (34). Additionally, define the function

L^{★} (\cdot)

as

L^{★} (u) = \sum_{i = 1}^{n_{t}} ℓ^{★} (y_{i}; u_{i})

. We can see that the optimal vector

{\hat{u}}_{p}

solves the following maximization problem:

\begin{matrix} {\hat{u}}_{p} = \underset{u \in R^{n_{t}}}{argmax} u^{⊤} A w - L^{★} (u), \end{matrix}

where the data matrix

A = {[a_{1}, \dots, a_{n_{t}}]}^{⊤} \in R^{n_{t} \times p}

. Now, we denote by

\partial L^{★} (u)

the sub-differential set of the function

L^{★} (\cdot)

evaluated at

u

. Therefore, the solution of the above maximization problem satisfies the following condition:

\begin{matrix} A w \in \partial L^{★} ({\hat{u}}_{p}) . \end{matrix}

(A29)

Now, we use the result in [25] (Proposition 11.3) to show that the condition in (A29) can be equivalently expressed as follows:

\begin{matrix} {\hat{u}}_{p} \in \partial L (A w), \end{matrix}

(A30)

where the loss function

L (w) = \sum_{i = 1}^{n_{t}} ℓ (y_{i}; w_{i})

based on [25] (Proposition 11.22). Note that the introduced constraint in (A28) is satisfied. Moreover, the analysis presented in (A17) shows that there exists

C_{1} > 0

such that the following inequality holds:

\begin{matrix} {∥ A w ∥}^{2} \leq C_{1} n_{t}, \end{matrix}

(A31)

with probability going to one as p goes to

+ \infty

. Now, we use the assumption in (12) to conclude that there exists

C_{2} > 0

such that the following inequality holds:

\begin{matrix} {∥ {\hat{u}}_{p} ∥}^{2} \leq C_{2} n_{t}, \end{matrix}

(A32)

with probability going to one as p goes to

+ \infty

. This completes the proof of Lemma 1.

Appendix C. Proof of Lemma 2

To prove the convergence properties stated in Lemma 2, we show first that they are valid for the auxiliary formulation corresponding to the source problem.

Appendix C.1. Auxiliary Convergence

Note that the analysis present in Section 7 is also valid for the source problem. This is because the formulation in (8) is equivalent to the source problem in (4) if

Σ

is the all zero matrix and we use the source training data. Then, we can see that the optimal solution of the auxiliary formulation corresponding to the source problem, denoted by

{\tilde{w}}_{s}

, can be expressed as follows:

\begin{matrix} {\tilde{w}}_{s} = q_{p, s}^{★} ξ_{s} - \frac{r_{p, s}^{★}}{∥ {\tilde{g}}_{s} ∥} B_{ξ_{s}}^{⊥} {\tilde{g}}_{s}, \end{matrix}

(A33)

where

{\tilde{g}}_{s} = {(B_{ξ_{s}}^{⊥})}^{⊤} g_{s}

and

g_{s}

has independent standard Gaussian components. Here,

B_{ξ_{s}}^{⊥} \in R^{p \times (p - 1)}

is formed by an orthonormal basis orthogonal to the vector

ξ_{s}

. Additionally, our analysis in Section 7 shows that the following convergence in probability holds:

\begin{matrix} q_{p, s} \overset{p \to + \infty}{\to} q_{s}^{★} and r_{p, s}^{★} \overset{p \to + \infty}{\to} r_{s}^{★} . \end{matrix}

(A34)

Here,

q_{s}^{★}

and

r_{s}^{★}

are the optimal solutions of the asymptotic limit of the source formulation defined in (14).

Note that

μ_{p}

can be expressed as follows:

\begin{matrix} μ_{p} = σ_{\min} ({(B_{ξ_{t}}^{⊥})}^{⊤} Λ B_{ξ_{t}}^{⊥}) . \end{matrix}

(A35)

Using the eigenvalue interlacing theorem, one can see that

\begin{matrix} σ_{\min, 1} (Λ) \leq μ_{p} \leq σ_{\min, 2} (Λ) . \end{matrix}

(A36)

Then, using the assumption in (13), we can see that the random variable

μ_{p}

converges in probability to

μ_{\min}

, where

μ_{\min}

is defined in Assumption 5. Now, we study the properties of the remaining functions using the optimal solution of the auxiliary formulation defined in (A33), i.e.,

{\tilde{w}}_{s}

, instead of

{\hat{w}}_{s}

. For instance, we first study the random sequence

{\tilde{V}}_{p, t s} = ξ_{t}^{⊤} Λ {\tilde{w}}_{s}

to infer the asymptotic properties of

V_{p, t s}

.

First, fix

σ > - μ_{\min}

. Then, based on the convergence of

μ_{p}

and [34] (Proposition 3), the sequence of random functions

T_{p, g} (.)

converges in probability as follows:

\begin{matrix} T_{p, g} (σ) \overset{p \to + \infty}{\to} T_{g} (σ) = E_{μ} [1 / (μ + σ)] . \end{matrix}

(A37)

Now, we express

σ

as

σ = σ^{'} - x

, where

σ^{'} > 0

. This means that the following convergence in probability holds true:

\begin{matrix} T_{p, g} (σ^{'} - x) \overset{p \to + \infty}{\to} T_{g} (σ^{'} - x), \end{matrix}

(A38)

for any

x < σ^{'} + μ_{\min}

. Note that the functions

T_{p, g} (.)

and

T_{g} (.)

are both convex and continuous in the variable x in the set

[0, σ^{'} + μ_{\min} [

. Then, based on [30] (Theorem II.1), the convergence in (A38) is uniform in the variable x in the compact set

[0, σ^{'} / 2 + μ_{\min}]

. Now, note that

μ_{p}

converges in probability to

μ_{\min}

. Therefore, we obtain the following convergence in probability:

\begin{matrix} T_{p, g} (σ^{'} - μ_{p}) \overset{p \to + \infty}{\to} T_{g} (σ^{'} - μ_{\min}), \end{matrix}

(A39)

valid for any fixed

σ^{'} > 0

. Using the block matrix inversion lemma, the function

T_{p, t} (.)

can be expressed as follows:

\begin{matrix} T_{p, t} (σ) & = ξ_{t}^{⊤} Λ B_{ξ_{t}}^{⊥} {[{(B_{ξ_{t}}^{⊥})}^{⊤} Λ B_{ξ_{t}}^{⊥} + σ I_{p - 1}]}^{- 1} {(B_{ξ_{t}}^{⊥})}^{⊤} Λ ξ_{t} \\ = ξ_{t}^{⊤} Λ ξ_{t} + σ - \frac{1}{ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} ξ_{t}} . \end{matrix}

(A40)

Therefore, we obtain the following expression:

\begin{matrix} Z_{p, t} (σ) = σ - \frac{1}{ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} ξ_{t}} . \end{matrix}

(A41)

Then, using the theoretical results stated in [34] (Proposition 3), the functions

Z_{p, t} (.)

converges in probability as follows:

\begin{matrix} Z_{p, t} (σ) \overset{p \to + \infty}{\to} Z_{t} (σ) = σ - \frac{1}{E_{μ} [1 / (μ + σ)]} . \end{matrix}

(A42)

Combine this with the above analysis to obtain the following convergence in probability:

\begin{matrix} Z_{p, t} (σ^{'} - μ_{p}) \overset{p \to + \infty}{\to} Z_{t} (σ^{'} - μ_{\min}), \end{matrix}

(A43)

valid for any

σ^{'} > 0

. Based on the result in (A33), the sequence of random functions

{\tilde{Z}}_{p, t s} (.)

converges in probability to the following function:

\begin{matrix} Z_{t s} (σ) = q_{s}^{★} ρ Z_{t} (σ) . \end{matrix}

(A44)

Combine this with the above analysis to obtain the following convergence in probability:

\begin{matrix} {\tilde{Z}}_{p, t s} (σ^{'} - μ_{p}) \overset{p \to + \infty}{\to} Z_{t s} (σ^{'} - μ_{\min}), \end{matrix}

(A45)

valid for any

σ^{'} > 0

. Using the same analysis and based on (A33) and (A34), one can see that the sequence of random functions

{\tilde{Z}}_{p, s} (.)

converges in probability to the following function:

\begin{matrix} {\tilde{Z}}_{p, s} (σ) & \overset{p \to + \infty}{\to} Z_{s} (σ) = {(ρ q_{s}^{★})}^{2} Z_{t} (σ) \\ - ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) E_{μ} [μ σ / (μ + σ)] . \end{matrix}

(A46)

Combine this with the above analysis to obtain the following convergence in probability:

\begin{matrix} {\tilde{Z}}_{p, s} (σ^{'} - μ_{p}) \overset{p \to + \infty}{\to} Z_{s} (σ^{'} - μ_{\min}), \end{matrix}

(A47)

valid for any

σ^{'} > 0

. The above analysis shows that the asymptotic properties stated in Lemma 2 are valid for the AO formulation corresponding to the source problem. Now, it remains to show that these properties also hold for the primary formulation.

Appendix C.2. Primary Convergence

Here, we assume that

λ > 0

. The case when

λ = 0

can be conducted similarly. Now, we show that the convergence properties proved above are also valid for the primary problem. To this end, we show that all the assumptions in Theorem 2 are satisfied. We start our proof by defining the following open set:

\begin{matrix} T_{ϵ} = {w \in R^{p} : | ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} w - ρ q_{s}^{★} K | < ϵ}, \end{matrix}

where K is defined as follows:

\begin{matrix} K = E_{μ} [1 / (μ + σ)] . \end{matrix}

(A48)

Now, we consider the feasibility set

D_{ϵ} = T_{1} / S_{ϵ}

, where

T_{1}

is defined in (41). Based on the analysis of the generalized target formulation in Section 7.3.2, one can see that the AO formulation corresponding to the source formulation with the set

D_{ϵ}

can be asymptotically expressed as follows:

\begin{matrix} V_{p} : min_{\begin{matrix} (q_{s}, r_{s}) \in T_{2} \end{matrix}} min_{\begin{matrix} r_{s} \in {\tilde{D}}_{ϵ} \end{matrix}} max_{u \in C_{s}} \frac{∥ u ∥}{p} g_{s}^{⊤} B_{ξ_{s}}^{⊥} r_{s} + \frac{q_{s}}{p} u^{⊤} s_{s} \\ + \frac{λ}{2} (q_{s}^{2} + {∥ r_{s} ∥}^{2}) + \frac{1}{p} ∥ r_{s} ∥ h_{s}^{⊤} u - \frac{1}{p} \sum_{i = 1}^{n_{s}} ℓ^{★} (y_{s, i}; u_{i}) . \end{matrix}

Here, the feasibility set

T_{2}

is defined in Section 7.3.2 and the feasibility set

{\tilde{D}}_{ϵ}

is given by

\begin{matrix} {r_{s} : | q_{s} ρ K_{p, t} + q_{s} \sqrt{1 - ρ^{2}} K_{p, r} + ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} B_{ξ_{s}}^{⊥} r_{s} - ρ q_{s}^{★} K | \geq ϵ \\ , ∥ r_{s} ∥ = r_{s}} . \end{matrix}

This follows based on the decomposition in (40) and where

K_{p, t} = ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} ξ_{t}

and

K_{p, r} = ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} ξ_{r}

. Note that the optimization problem given in

V_{p}

can be equivalently formulated as follows:

\begin{matrix} V_{p} : min_{\begin{matrix} (q_{s}, r_{s}) \in {\hat{S}}_{ϵ} \end{matrix}} min_{\begin{matrix} r_{s} \in {\tilde{D}}_{ϵ} \end{matrix}} max_{u \in C_{s}} \frac{∥ u ∥}{p} g_{s}^{⊤} B_{ξ}^{⊥} r_{s} + \frac{q_{s}}{p} u^{⊤} s_{s} \\ + \frac{λ}{2} (q_{s}^{2} + {∥ r_{s} ∥}^{2}) + \frac{1}{p} ∥ r_{s} ∥ h_{s}^{⊤} u - \frac{1}{p} \sum_{i = 1}^{n_{s}} ℓ^{★} (y_{s, i}; u_{i}) . \end{matrix}

Here, we replace the feasibility set

T_{2}

by the feasibility set

{\hat{S}}_{ϵ}

defined as follows:

\begin{matrix} {| q_{s} ρ K_{p, t} + q_{s} \sqrt{1 - ρ^{2}} K_{p, r} - r_{s} ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} B_{ξ_{s}}^{⊥} \frac{{\tilde{g}}_{s}}{∥ {\tilde{g}}_{s} ∥} - ρ q_{s}^{★} K | \\ \geq ϵ} \cap T_{2}, \end{matrix}

where

{\tilde{g}}_{s} = {(B_{ξ_{s}}^{⊥})}^{⊤} g_{s}

. This follows since the first set in

{\hat{S}}_{ϵ}

satisfies the condition in the set

{\tilde{D}}_{ϵ}

. Now, assume that

{\hat{ϕ}}_{p}^{★}

is the optimal cost value of the optimization problem

V_{p}

and define the function

{\hat{h}}_{p} (.)

as follows:

\begin{matrix} {\hat{h}}_{p} (q_{s}, r_{s}) = min_{\begin{matrix} r_{s} \in {\tilde{D}}_{ϵ} \end{matrix}} max_{u \in C_{s}} \frac{∥ u ∥}{p} g_{s}^{⊤} B_{ξ_{s}}^{⊥} r_{s} + \frac{q_{s} u^{⊤} s_{s}}{p} \\ + \frac{λ}{2} (q_{s}^{2} + r_{s}^{2}) + \frac{r_{s}}{p} h_{s}^{⊤} u - \frac{1}{p} \sum_{i = 1}^{n_{s}} ℓ^{★} (y_{s, i}; u_{i}), \end{matrix}

in the set

{\hat{S}}_{ϵ}

. Based on the max–min inequality [35], the function

{\hat{h}}_{p} (.)

can be lower bounded by the following function:

\begin{matrix} {\tilde{h}}_{p} (q_{s}, r_{s}) = max_{u \in C_{s}} min_{\begin{matrix} r_{s} \in {\tilde{D}}_{ϵ} \end{matrix}} \frac{∥ u ∥}{p} g_{s}^{⊤} B_{ξ_{s}}^{⊥} r_{s} + \frac{q_{s} u^{⊤} s_{s}}{p} \\ + \frac{λ}{2} (q_{s}^{2} + r_{s}^{2}) + \frac{r_{s}}{p} h_{s}^{⊤} u - \frac{1}{p} \sum_{i = 1}^{n_{s}} ℓ^{★} (y_{s, i}; u_{i}) . \end{matrix}

This is valid for any

(q_{s}, r_{s}) \in {\hat{S}}_{ϵ}

. Moreover, note that the following inequality holds true:

\begin{matrix} min_{\begin{matrix} r_{s} \in {\tilde{D}}_{ϵ} \end{matrix}} \frac{∥ u ∥}{p} g_{s}^{⊤} B_{ξ_{s}}^{⊥} r_{s} \geq - \frac{∥ u ∥}{p} ∥ {(B_{ξ_{s}}^{⊥})}^{⊤} g_{s} ∥ r_{s}, \end{matrix}

(A49)

for any

(q_{s}, r_{s}) \in {\hat{S}}_{ϵ}

. Following the generalized analysis in Section 7.3.2, one can see that the auxiliary problem corresponding to the source formulation can be expressed as follows:

\begin{matrix} min_{\begin{matrix} (q_{s}, r_{s}) \in T_{2} \end{matrix}} sup_{σ_{s} > 0} \frac{1}{n_{s}} \sum_{i = 1}^{n_{s}} M_{ℓ (y_{s, i}, .)} (r_{s} h_{s, i} + q_{s} s_{s, i}; \frac{r_{s} ∥ {\tilde{g}}_{s} ∥}{\sqrt{n_{s}} σ_{s}}) \\ - \frac{r_{s} σ_{s}}{2} \frac{∥ {\tilde{g}}_{s} ∥}{\sqrt{n_{s}}} + \frac{λ}{2} (q_{s}^{2} + r_{s}^{2}) . \end{matrix}

(A50)

This means that the function

{\tilde{h}}_{p} (.)

can be lower bounded by the cost function of the minimization problem formulated in (A50) denoted by

{\hat{g}}_{p} (.)

, i.e.,

\begin{matrix} {\hat{g}}_{p} (q_{s}, r_{s}) \leq {\tilde{h}}_{p} (q_{s}, r_{s}) . \end{matrix}

(A51)

Here, both functions are defined in the feasibility set

{\hat{S}}_{ϵ}

. Now, define

ϕ_{p}^{★}

as the optimal cost value of the auxiliary optimization problem corresponding to the source formulation defined in Section 7.3.1. Note that the loss function

{\hat{g}}_{p} (.)

is strongly convex in the variables

(q_{s}, r_{s})

with strong convexity parameter

λ > 0

. This means that, for any

β \in [0, 1]

,

(q_{s, 1}, r_{s, 1}) \in T_{2}

and

(q_{s, 2}, r_{s, 2}) \in T_{2}

, we have the following inequality:

\begin{matrix} {\hat{g}}_{p} (β v_{1} + (1 - β) v_{2}) \leq β {\hat{g}}_{p} (v_{1}) \\ + (1 - β) {\hat{g}}_{p} (v_{2}) - \frac{λ}{2} β (1 - β) {∥ v_{1} - v_{2} ∥}^{2}, \end{matrix}

(A52)

where

v_{1} = [q_{s, 1}, r_{s, 1}]

and

v_{2} = [q_{s, 2}, r_{s, 2}]

. Take

v_{1}

as

v_{p}^{★}

, which represents the optimal solution of the optimization problem (A50). Then, the inequality in (A52) implies the following inequality:

\begin{matrix} ϕ_{p}^{★} \leq {\hat{g}}_{p} (v_{2}) - \frac{λ}{2} β {∥ v_{p}^{★} - v_{2} ∥}^{2} . \end{matrix}

(A53)

This is valid for any

v_{2}

in the set

T_{2}

. Now, taking

β = 1 / 2

and the minimum over

v_{2}

in the set

{\hat{S}}_{ϵ}

in both sides, we obtain the following inequality:

\begin{matrix} ϕ_{p}^{★} + \frac{λ}{4} min_{v \in {\hat{S}}_{ϵ}} {∥ v_{p}^{★} - v ∥}^{2} \leq min_{v \in {\hat{S}}_{ϵ}} {\hat{g}}_{p} (v) . \end{matrix}

Based on the above analysis, note that the following inequality also holds true:

\begin{matrix} min_{v \in {\hat{S}}_{ϵ}} {\hat{g}}_{p} (v) \leq {\hat{ϕ}}_{p}^{★} . \end{matrix}

(A54)

Then, to verify the assumption of [17] (Theorem 6.1), it remains to show that there exists

ϵ^{'} > 0

such that, the following inequality holds:

\begin{matrix} \frac{λ}{4} min_{v \in {\hat{S}}_{ϵ}} {∥ v_{p}^{★} - v ∥}^{2} \geq ϵ^{'}, \end{matrix}

(A55)

with probability going to 1 as

p \to \infty

. Note that any element in the set

{\hat{S}}_{ϵ}

satisfies the following inequality:

\begin{matrix} ϵ \leq | q_{s} ρ K_{p, t} + q_{s} \sqrt{1 - ρ^{2}} K_{p, r} - r_{s} ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} B_{ξ}^{⊥} \frac{{\tilde{g}}_{s}}{∥ {\tilde{g}}_{s} ∥} - ρ q_{s}^{★} K | \leq \\ | q_{s} ρ K_{p, t} - ρ q_{s}^{★} K | + | q_{s} \sqrt{1 - ρ^{2}} | | K_{p, r} | + | r_{s} | | ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} B_{ξ}^{⊥} \frac{{\tilde{g}}_{s}}{∥ {\tilde{g}}_{s} ∥} | . \end{matrix}

Based on the analysis in Appendix C.1, we have the following convergence in probability:

\begin{matrix} | q_{s} ρ K_{p, t} - ρ q_{s}^{★} K | \overset{p \to + \infty}{\to} | q_{s} - q_{s}^{★} | ρ K \\ | q_{s} | \sqrt{1 - ρ^{2}} | K_{p, r} | \overset{p \to + \infty}{\to} 0, | r_{s} | | ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} B_{ξ}^{⊥} \frac{{\tilde{g}}_{s}}{∥ {\tilde{g}}_{s} ∥} | \overset{p \to + \infty}{\to} 0 . \end{matrix}

(A56)

This means that there exists

ϵ^{″} > 0

such that any elements in the set

{\hat{S}}_{ϵ}

satisfies the following inequality:

\begin{matrix} | q_{s} - q_{s}^{★} | ρ K \geq ϵ^{″}, \end{matrix}

(A57)

with probability going to 1 as

p \to \infty

. Combining this with Assumption 5 and the consistency result stated in (A34) shows that there exists

ϵ^{'} > 0

such that the following inequality holds:

\begin{matrix} \frac{λ}{4} min_{v \in {\hat{D}}_{ϵ}} {∥ v_{p}^{★} - v ∥}^{2} \geq ϵ^{'}, \end{matrix}

(A58)

with probability going to 1 as

p \to \infty

. This also proves that there exists

ϵ^{'} > 0

such that the following inequality holds:

\begin{matrix} {\hat{ϕ}}_{p}^{★} \geq ϕ_{p}^{★} + ϵ^{'}, \end{matrix}

(A59)

with probability going to 1 as

p \to \infty

. This completes the verification of the assumptions in Theorem 2. This means that the optimal solution of the primary problem belongs to the set

S_{ϵ}

on events with probability going to 1 as

p \to \infty

. Since the choice of

ϵ

is arbitrary, we obtain the following asymptotic result:

\begin{matrix} ξ_{t}^{⊤} {[Λ + σ I_{p}]}^{- 1} {\hat{w}}_{s} \overset{p \to + \infty}{\to} q_{s}^{★} ρ K, \end{matrix}

(A60)

where

{\hat{w}}_{s}

is the optimal solution of the source problem (4). Following the same analysis, one can also show the convergence properties stated in Lemma 2.

Appendix D. Proof of Lemma 4

Here, we assume that

λ > 0

. The case when

λ = 0

can be conducted similarly. The cost function of the optimization problem (49) can be expressed as follows:

\begin{matrix} O_{t} (q_{t}, r_{t}, σ_{t}) & = \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) - \frac{σ_{t} r_{t}^{2}}{2} - \frac{1}{2} Z_{s} (σ_{t}) - \frac{1}{2} q_{t}^{2} Z_{t} (σ_{t}) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{g} (σ_{t}))] + q_{t} Z_{t s} (σ_{t}) . \end{matrix}

(A61)

Note that the function

O_{t} (\cdot, \cdot, \cdot)

can be expressed as follows:

\begin{matrix} O_{t} (q_{t}, r_{t}, σ_{t}) & = - \frac{σ_{t} r_{t}^{2}}{2} + \frac{1}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) T_{2} (σ_{t}) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{1} (σ_{t}))] + \frac{λ}{2} (q_{t}^{2} + r_{t}^{2}) \\ - \frac{1}{2} {(q_{t} - ρ q_{s}^{★})}^{2} (σ_{t} - 1 / T_{1} (σ_{t})) . \end{matrix}

(A62)

Here, the functions

T_{1} (.)

and

T_{2} (.)

are defined as follows:

\begin{matrix} T_{1} (σ_{t}) = E_{μ} [1 / (μ + σ_{t})], T_{2} (σ_{t}) = E_{μ} [μ σ_{t} / (μ + σ_{t})] . \end{matrix}

Based on Assumption 5, the functions

T_{1} (\cdot)

and

T_{2} (\cdot)

are twice continuously differentiable in the feasibility set. We start our analysis by showing that the function

O_{t} (\cdot, \cdot, \cdot)

is concave in the variable

σ_{t}

for fixed feasible

(q_{t}, r_{t})

. First, note that the function

T_{2} (\cdot)

is concave in the feasibility set. Now, define the function

g (\cdot)

as follows:

\begin{matrix} g (σ_{t}) = \frac{1}{T_{1} (σ_{t})} . \end{matrix}

(A63)

Then, we can see that the second derivative of the function

g (\cdot)

can be expressed as follows:

\begin{matrix} g^{''} (σ_{t}) = - \frac{T_{1}^{''} (σ_{t}) T_{1} (σ_{t}) - 2 T_{1}^{'} {(σ_{t})}^{2}}{T_{1} {(σ_{t})}^{3}} . \end{matrix}

(A64)

Here, the first and second derivatives of the function

T_{1} (\cdot)

can be expressed as follows:

\begin{matrix} T_{1}^{'} (σ_{t}) = - E_{μ} [1 / {(μ + σ_{t})}^{2}], T_{1}^{''} (σ_{t}) = 2 E_{μ} [1 / {(μ + σ_{t})}^{3}] . \end{matrix}

(A65)

Then, using the Cauchy–Schwarz inequality, one can see that the second derivative of the function

g (\cdot)

is negative. This implies the concavity of the function

g (\cdot)

. Therefore, using the properties in [35] (Section 3.2), the function

O_{t} (\cdot, \cdot, \cdot)

is concave in the variable

σ_{t}

.

Now, we focus on proving the strong convexity properties. Define the function

O_{t} (\cdot, \cdot)

as follows:

\begin{matrix} O_{t} (q_{t}, r_{t}) = sup_{σ_{t} > - μ_{\min}} O_{t} (q_{t}, r_{t}, σ_{t}) . \end{matrix}

(A66)

Note that the term

\frac{λ}{2} (q_{t}^{2} + r_{t}^{2})

is strongly convex in the variables

(q_{t}, r_{t})

. Then, to prove our property it suffices to show that the following function is jointly convex in the variables

(q_{t}, r_{t})

in the feasibility set:

\begin{matrix} h (q_{t}, r_{t}) & = sup_{σ_{t} > - μ_{\min}} - \frac{σ_{t} r_{t}^{2}}{2} + \frac{1}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) T_{2} (σ_{t}) - \frac{1}{2} {(q_{t} - ρ q_{s}^{★})}^{2} (σ_{t} - 1 / T_{1} (σ_{t})) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{1} (σ_{t}))], \end{matrix}

(A67)

Note that the function

h (\cdot, \cdot)

can also be expressed as follows:

\begin{matrix} h (q_{t}, r_{t}) & = sup_{σ_{t} > - μ_{\min}} min_{0 \leq τ \leq C_{τ}} \frac{σ_{t} τ^{2}}{2} - τ r_{t} σ_{t} + \frac{1}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) T_{2} (σ_{t}) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{1} (σ_{t}))] - \frac{1}{2} {(q_{t} - ρ q_{s}^{★})}^{2} (σ_{t} - 1 / T_{1} (σ_{t})) . \end{matrix}

(A68)

Here, the feasibility set of the variable

τ

is bounded given that the optimal

τ

satisfies

τ^{★} = r_{t}

. It can be easily seen that the cost function of the optimization problem in (A68) is convex in

τ

and concave in

σ_{t}

. Then, using the result in [36], the function

h (\cdot, \cdot)

can also be expressed as follows:

\begin{matrix} h (q_{t}, r_{t}) = inf_{0 < τ \leq C_{τ}} sup_{σ_{t} > - μ_{\min}} \frac{σ_{t} τ}{2} - r_{t} σ_{t} + \frac{1}{2} ((1 - ρ^{2}) {(q_{s}^{★})}^{2} + {(r_{s}^{★})}^{2}) T_{2} (σ_{t} / τ) \\ + α_{t} E [M_{ℓ (Y_{t}, .)} (r_{t} H_{t} + q_{t} S_{t}; T_{1} (σ_{t} / τ))] - \frac{1}{2} {(q_{t} - ρ q_{s}^{★})}^{2} (σ_{t} / τ - 1 / T_{1} (σ_{t} / τ)) . \end{matrix}

(A69)

Then, to prove our property, it suffices to show that the cost function of the above problem is jointly convex in the variables

(q_{t}, r_{t}, τ)

. Using the positivity of the second derivative, it is easy to see that the function

τ \to T_{2} (σ_{t} / τ)

is convex. Now, using the analysis below Equation (161) in [20] (Appendix H), we can see that the remaining functions are jointly convex in the variables

(q_{t}, r_{t}, τ)

. We omit these steps since they are similar to the approach employed in [20] (Appendix H). This shows that the function

O_{t} (\cdot, \cdot)

is strongly convex in the variables

(q_{t}, r_{t})

.

References

Pratt, L.Y.; Mostow, J.; Kamm, C.A. Direct Transfer of Learned Information among Neural Networks. In Proceedings of the Ninth National Conference on Artificial Intelligence—Volume 2, AAAI’91, Anaheim, CA, USA, 14–19 July 1991; pp. 584–589. [Google Scholar]
Pratt, L.Y. Discriminability-Based Transfer between Neural Networks. In Advances in Neural Information Processing Systems; Hanson, S., Cowan, J., Giles, C., Eds.; Morgan-Kaufmann: Burlington, MA, USA, 1993; Volume 5, pp. 204–211. [Google Scholar]
Perkins, D.; Salomon, G. Transfer of Learning; Pergamon: Oxford, UK, 1992. [Google Scholar]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A Survey on Deep Transfer Learning. arXiv 2018, arXiv:1808.01974. [Google Scholar]
Rosenstein, M.T.; Marx, Z.; Kaelbling, P.K.; Dietterich, T.G. To transfer or not to transfer. In NIPS Workshop on Transfer Learning; NIPS: Vancouver, BC, Canada, 2005. [Google Scholar]
Bakker, B.; Heskes, T. Task Clustering and Gating for Bayesian Multitask Learning. J. Mach. Learn. Res. 2003, 4, 83–99. [Google Scholar]
Ben-David, S.; Schuller, R. Exploiting Task Relatedness for Multiple Task Learning. In Learning Theory and Kernel Machines; Schölkopf, B., Warmuth, M.K., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 567–580. [Google Scholar]
Kornblith, S.; Shlens, J.; Le, Q.V. Do Better ImageNet Models Transfer Better? arXiv 2019, arXiv:1805.08974. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How Transferable Are Features in Deep Neural Networks? arXiv 2014, arXiv:1411.1792. [Google Scholar]
Tommasi, T.; Orabona, F.; Caputo, B. Learning Categories From Few Examples With Multi Model Knowledge Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 928–941. [Google Scholar] [CrossRef]
Yang, J.; Yan, R.; Hauptmann, A.G. Adapting SVM Classifiers to Data with Shifted Distributions. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), Omaha, NE, USA, 28–31 October 2007; pp. 69–76. [Google Scholar]
Lampinen, A.K.; Ganguli, S. An Analytic Theory of Generalization Dynamics and Transfer Learning in Deep Linear Networks. arXiv 2019, arXiv:1809.10374. [Google Scholar]
Dar, Y.; Baraniuk, R.G. Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks. arXiv 2021, arXiv:2006.07002. [Google Scholar]
Saglietti, L.; Zdeborová, L. Solvable Model for Inheriting the Regularization through Knowledge Distillation. arXiv 2020, arXiv:2012.00194. [Google Scholar]
Stojnic, M. A Framework to Characterize Performance of LASSO Algorithms. arXiv 2013, arXiv:1303.7291. [Google Scholar]
Thrampoulidis, C.; Abbasi, E.; Hassibi, B. Precise high-dimensional error analysis of regularized M-estimators. In Proceedings of the 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 29 September–2 October 2015; pp. 410–417. [Google Scholar]
Gordon, Y. On Milman’s inequality and random subspaces which escape through a mesh in $R^{n}$ . In Geometric Aspects of Functional Analysis; Lindenstrauss, J., Milman, V.D., Eds.; Springer: Berlin/Heidelberg, Germany, 1988; pp. 84–106. [Google Scholar]
Dhifallah, O.; Thrampoulidis, C.; Lu, Y.M. Phase Retrieval via Polytope Optimization: Geometry, Phase Transitions, and New Algorithms. arXiv 2018, arXiv:1805.09555. [Google Scholar]
Dhifallah, O.; Lu, Y.M. A Precise Performance Analysis of Learning with Random Features. arXiv 2020, arXiv:2008.11904. [Google Scholar]
Salehi, F.; Abbasi, E.; Hassibi, B. The Impact of Regularization on High-dimensional Logistic Regression. arXiv 2019, arXiv:1906.03761. [Google Scholar]
Kammoun, A.; Alouini, M.S. On the Precise Error Analysis of Support Vector Machines. arXiv 2020, arXiv:2003.12972. [Google Scholar]
Mignacco, F.; Krzakala, F.; Lu, Y.M.; Zdeborová, L. The Role of Regularization in Classification of High-Dimensional Noisy Gaussian Mixture. arXiv 2020, arXiv:2002.11544. [Google Scholar]
Aubin, B.; Krzakala, F.; Lu, Y.M.; Zdeborová, L. Generalization Error in High-Dimensional Perceptrons: Approaching Bayes Error with Convex Optimization. arXiv 2020, arXiv:2006.06560. [Google Scholar]
Rockafellar, R.T.; Wets, R.J.B. Variational Analysis; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Thrampoulidis, C.; Oymak, S.; Hassibi, B. Regularized Linear Regression: A Precise Analysis of the Estimation Error. In Proceedings of the 28th Conference on Learning Theory; Grünwald, P., Hazan, E., Kale, S., Eds.; PMLR: Paris, France, 2015; Volume 40, Proceedings of Machine Learning Research. pp. 1683–1709. [Google Scholar]
Rudelson, M.; Vershynin, R. Non-Asymptotic Theory of Random Matrices: Extreme Singular Values. arXiv 2010, arXiv:1003.2990. [Google Scholar]
Adachi, S.; Iwata, S.; Nakatsukasa, Y.; Takeda, A. Solving the Trust-Region Subproblem By a Generalized Eigenvalue Problem. SIAM J. Optim. 2017, 27, 269–291. [Google Scholar] [CrossRef]
Shapiro, A.; Dentcheva, D.; Ruszczyński, A. Lectures on Stochastic Programming: Modeling and Theory, 2nd ed.; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014. [Google Scholar]
Andersen, P.K.; Gill, R.D. Cox’s Regression Model for Counting Processes: A Large Sample Study. Ann. Statist. 1982, 10, 1100–1120. [Google Scholar] [CrossRef]
Newey, W.K.; Mcfadden, D. Chapter 36 Large sample estimation and hypothesis testing. In Handbook of Econometrics; Elsevier: Amsterdam, The Netherlands, 1994; p. 2111. [Google Scholar]
Schilling, R.L. Measures, Integrals and Martingales; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Shor, N. Minimization Methods for Non-Differentiable Functions; Springer: Berlin/Heidelberg, Germany, 1985. [Google Scholar]
Debbah, M.; Hachem, W.; Loubaton, P.; de Courville, M. MMSE analysis of certain large isometric random precoded systems. IEEE Trans. Inf. Theory 2003, 49, 1293–1311. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Sion, M. On general minimax theorems. Pac. J. Math. 1958, 8, 171–176. [Google Scholar] [CrossRef]

Figure 1. Theoretical predictions v.s. numerical simulations obtained by averaging over 100 independent Monte Carlo trials with dimension

p = 2500

. (a) Binary classification with logistic loss. We take

α_{s} = 10 α_{t}

,

λ = 0.3

,

Σ = I_{p} / \sqrt{5}

, and

ρ = 0.85

, where

α_{s} = n_{s} / p

and

α_{t} = n_{t} / p

. The functions

φ (\cdot)

and

\hat{φ} (\cdot)

are both the sign function. For hard transfer, we set the transfer rate to be

δ = 0.5

. Full source transfer corresponds to

δ = 1.0

, whereas no transfer corresponds to

δ = 0

. (b) Nonlinear regression using quadratic loss, where

φ (\cdot)

is the ReLu function and

\hat{φ} (\cdot)

is the identity function. Soft identity, beta, and uniform matrices refer to different choices of the weighting matrix in (8). Soft Identity Matrix:

Σ

is an identity matrix. Soft Uniform Matrix:

Σ

is a random matrix with diagonal elements drawn from the uniform distribution. Soft Beta Matrix:

Σ

is a random matrix with diagonal elements drawn from the beta distribution. We scale all diagonal elements of

Σ

to have the same mean. We also take

α_{s} = 10 α_{t}

,

λ = 0.1

, and

ρ = 0.8

.

Figure 2. Phase transitions of the hard transfer formulation. When the similarity

ρ

between the two tasks is small, we are in the negative transfer regime, where we should not transfer the knowledge from the source task. However, as

ρ

moves past a critical threshold, we enter the positive transfer regime. (a) Binary classification with squared loss, with parameters

α_{t} = 2

,

α_{s} = 2 α_{t}

, and

λ = 0

. Both

φ (\cdot)

and

\hat{φ} (\cdot)

are the sign function. (b) Nonlinear regression with squared loss, with parameters

α_{t} = 2

,

α_{s} = 2 α_{t}

, and

λ = 0

.

φ (.)

is the ReLu function and

\hat{φ} (.)

is the identity function.

Figure 3. Additional illustrations of the phase transition phenomenon. (a) Regression (squared loss,

α_{t} = 0.5

, and

α_{s} = 3 α_{t}

) (b) Regression (squared loss,

α_{t} = 2

, and

α_{s} = 2 α_{t}

) (c) Binary classification (squared loss,

α_{t} = 1.5

, and

α_{s} = 3 α_{t}

) (d) Binary classification (hinge loss,

α_{t} = 1.5

, and

α_{s} = 3 α_{t}

). In all the experiments, we set the regularization strength to be

λ = 0.1

. The blue line represents our theoretical predictions of the optimal transfer rate obtained by solving our asymptotic results in Section 4 for multiple values of

δ

. The empirical results are averaged over 100 independent Monte Carlo trials with

p = 2500

.

Figure 4. Illustrations of the sufficient condition in Proposition 2. (a) Classification (squared loss,

α_{t} = 1.5

, and

α_{s} = 8 α_{t}

) (b) Classification (LAD loss,

α_{t} = 1.5

, and

α_{s} = 8 α_{t}

). In all the experiments, we set the regularization strength to be

λ = 0.1

. The blue line represents our theoretical predictions of the optimal transfer rate obtained by solving our asymptotic results in Section 4 for multiple values of

δ

. The green line represents our sufficient condition for positive transfer stated in Proposition 2.

Figure 5. Continuous line: theoretical predictions. Circles: numerical simulations. (a)

α_{s} = 6 α_{t}

,

λ = 0.1

,

β_{t} = 1 / 10

, and

ρ = 0.9

. (b)

α_{t} = 1

,

α_{s} = 5 α_{t}

,

λ = 0.3

, and

ρ = 0.75

. In all the experiments, we consider the binary classification problem with the logistic loss function. The empirical results are averaged over 50 independent Monte Carlo trials, and we set

p = 1000

.

Figure 6. Continuous line: theoretical predictions. Circles: numerical simulations. (a)

α_{s} = 12 α_{t}

,

λ = 0.2

, and

ρ = 0.75

. (b)

α_{t} = 1.5

,

α_{s} = 8 α_{t}

, and

λ = 0.4

. In all the experiments, we consider the regression setting with a squared loss. The hard transfer formulation uses

δ = 0.5

, and the soft transfer formulation uses an identity weighting matrix. The empirical results are averaged over 50 independent Monte Carlo trials and we set

p = 1000

.

Figure 7. Continuous line: theoretical predictions. Circles: numerical simulations. (a)

α_{s} = 0.5 α_{t}

,

λ = 0.6

, and

ρ = 0.7

. We consider the regression setting with a squared loss. (b)

α_{s} = 0.5 α_{t}

,

λ = 0.3

, and

ρ = 0.8

. We consider the classification setting with a logistic loss. The hard transfer formulation uses

δ = 0.5

, and the soft transfer formulation uses an identity weighting matrix. The empirical results are averaged over 60 independent Monte Carlo trials, and we set

p = 1000

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Phase Transitions in Transfer Learning for High-Dimensional Perceptrons

Abstract

1. Introduction

1.1. Models and Learning Formulations

1.2. Main Contributions

1.2.1. Precise Asymptotic Analysis

1.2.2. Phase Transitions

1.3. Related Work

1.4. Organization

2. Technical Assumptions

3. Sharp Asymptotic Analysis of Soft Transfer Formulation

4. Sharp Asymptotic Analysis of Hard Transfer Formulation

4.1. Asymptotic Predictions

4.2. Phase Transitions

4.3. Sufficient Condition

5. Remarks

5.1. Learning Formulations

5.2. Transition from Negative to Positive Transfer

6. Additional Simulation Results

6.1. Model Assumptions

6.2. Phase Transitions in the Hard Formulation

6.3. Sufficient Condition for the Hard Formulation

6.4. Soft Transfer: Impact of the Weighting Matrix and Regularization Strength

6.5. Soft and Hard Transfer Comparison

6.6. Effects of the Source Parameters

7. Technical Details

7.1. Technical Tool: Convex Gaussian Min–Max Theorem

7.2. Precise Analysis of the Source Formulation

7.3. Precise Analysis of the Soft Transfer Approach

7.3.1. Formulating the Auxiliary Optimization Problem

7.3.2. Simplifying the AO Problem of the Target Task

7.3.3. Asymptotic Analysis of the Target Scalar Formulation

7.3.4. Specialization to Hard Formulation

7.3.5. Asymptotic Analysis of the Training and Generalization Errors

7.4. Phase Transitions in Hard Formulation

7.5. Sufficient Condition for the Hard Formulation

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement.

Informed Consent Statement.

Data Availability Statement.

Conflicts of Interest

Appendix A. Technical Assumptions

Appendix B. Proof of Lemma 1

Appendix B.1. Primal Compactness

Appendix B.2. Dual Compactness

Appendix C. Proof of Lemma 2

Appendix C.1. Auxiliary Convergence

Appendix C.2. Primary Convergence

Appendix D. Proof of Lemma 4

References

Article Metrics

Citations

Article Access Statistics