Multi-Platform Multivariate Regression with Group Sparsity for High-Dimensional Data Integration

Shanshan Qin; Guanlin Zhang; Xin Gao; Yuehua Wu

doi:10.3390/e28020135

Abstract

High-dimensional regression with multivariate responses poses significant challenges when data are collected across multiple platforms, each with potentially correlated outcomes. In this paper, we introduce a multi-platform multivariate high-dimensional linear regression (MM-HLR) model for simultaneously modeling within-platform correlation and cross-platform information fusion. Our approach incorporates a mixture of Lasso and group Lasso penalties to promote both individual predictor sparsity and cross-platform group sparsity, thereby enhancing interpretability and estimation stability. We develop an efficient computational algorithm based on iteratively reweighted least squares and block coordinate descent to solve the resulting regularized optimization problem. We establish theoretical guarantees for our estimator, including oracle bounds on prediction error, estimation accuracy, and support recovery under mild conditions. Our simulation studies confirm the method’s strong empirical performance, demonstrating low bias, small variance, and robustness across various dimensions. The analysis of real financial data further validates the performance gains achieved by incorporating multivariate responses and integrating data across multiple platforms.

Keywords:

multi-platform; multivariate regression; group sparsity; high-dimensional; data integration

1. Introduction

As data complexity continues to expand in the big-data era, modern datasets increasingly display heterogeneity, high dimensionality, and multiple sources. A common example arises when experiments with the same scientific objective are conducted across different platforms or environments [1,2,3,4]. Modeling each dataset individually may result in failing to capture their intrinsic connections, which stem from their shared goal. This motivates the development of methods capable of integrating multi-platform data for unified, simultaneous analysis.

Data integration provides a timely solution by leveraging multiple data sources to enable more robust and efficient statistical inference than relying on any single source alone [5]. In a systematic review, Ref. [5] surveyed integration methods for combining probability samples with non-probability samples and big data sources (see also the references therein). Within the regression framework, a series of studies have advanced data integration methodologies. For instance, Ref. [1] introduced a pseudolikelihood information criterion for high-dimensional multi-experiment data with mixed response types and varying predictor measurements, establishing selection consistency even with unbounded model size and demonstrating through simulations that data integration substantially outperforms single-source analysis. Building on this work, Ref. [6] implemented the FusionLearn R package, which provides a fusion learning algorithm for cross-platform data analysis. Ref. [7] extended the framework to multi-platform data with sub-Gaussian or sub-exponential errors, developing a consistent model selection criterion based on composite likelihood and Bayesian posterior probabilities to recover the union support of predictors under diverging model dimensions. Meanwhile, Ref. [4] addressed multi-task feature learning with mixed continuous and discrete responses using a mixed

ℓ_{2, 1}

-regularized composite quasi-likelihood function. In a related vein, Ref. [8] proposed a quantile regression approach for high-dimensional multi-source data exhibiting heterogeneity and heavy-tailed error distributions, providing both theoretical guarantees and practical advantages in model recovery.

However, much of the existing literature focuses primarily on univariate response modeling within a single laboratory or experiment. In many modern applications, response variables are multivariate and correlated, yet share the same set of high-dimensional covariates or predictors [9]. For instance, in the UK Biobank population-based cohort study, researchers face large-scale, ultrahigh-dimensional features alongside a wide array of correlated phenotypic outcomes, including lifestyle measures, biomarkers, and disease diagnoses [10]. Such data structures have motivated extensive methodological developments in high-dimensional multivariate regression. Seminal work includes remMap by [11], designed for multivariate response regression in high-dimension–low-sample-size settings. Ref. [9] proposed a blockwise descent algorithm for group-penalized multi-response regression. Further advancing the field, Ref. [12] introduced a regularization method that enhances variable selection by efficiently eliminating irrelevant blocks of regression coefficients. More recently, Ref. [10] developed a scalable sparse reduced-rank regression method for high-dimensional multi-task learning with correlated outcomes. Ref. [13] proposed a novel framework suited for settings with large numbers of responses, response categories, and predictors. Additionally, several quantile regression approaches have been proposed to handle multiple responses under non-Gaussian or heterogeneous error settings, like [14,15], among others.

Nevertheless, relatively few works have simultaneously addressed the complexities of multivariate responses across multiple data sources. We thus propose a multi-platform multivariate high-dimensional linear regression (MM-HLR) method, designed to jointly model within-platform correlation while promoting cross-platform group sparsity. Structured or group sparsity has been well studied recently. Zhang et al. [16] presented a probabilistic framework for subset selection under partition constraints, which aligns with the group-sparsity and cross-platform fusion objectives of our study. In addition, Li et al. [12] proposed a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. Besides, the mixed

ℓ_{2, 1}

penalty has similarly been employed for multitask feature learning and selection [4,17]. Kawano et al. [18] proposed a multivariate regression modeling in integrative analysis that performed group selection by group lasso estimation. In this article, we extended the mixture of

ℓ_{1}

and

ℓ_{2}

penalty formulation from [12] to simultaneously enforce individual sparsity and cross-platform group sparsity in our multi-platform setting. An efficient optimization framework, combining block coordinate descent with a proxy approximation strategy, is introduced to solve the resulting non-convex regularized problem. Theoretically, we establish non-asymptotic bounds on the prediction error, estimation accuracy, and support recovery of the MM-HLR estimator. Empirically, we evaluate the proposed method under varying sample sizes and predictor dimensionalities, benchmarking it against two alternative approaches: (i) FusionLearn [6], which is designed for multi-platform integration but treats responses as univariate, thereby ignoring within-platform correlation structures, and (ii) multivariate Group Lasso [19], which is designed for multivariate regression within a single platform and thus overlooks cross-platform grouping. Comprehensive simulation studies are conducted to demonstrate the method’s performance across diverse data-generating scenarios. Finally, we validate the proposed method through a real-world financial data analysis to assess the gains of jointly modeling multivariate responses and integrating data across multiple platforms.

The remainder of this article is organized as follows: Section 2 introduces the model framework and parameter estimation. Section 3 establishes the theoretical guarantees of the proposed estimator. Section 4 presents simulation studies to evaluate the performance of the method. We provide a real data analysis in Section 5. Finally, Section 6 concludes this article.

2. Methodology

Let

A

and

B

be two matrices.

A ⪯ B

denotes that

B - A

is positive semidefinite.

{∥ A ∥}_{F}

denotes the Frobenius norm of

A

. In addition,

{∥ A ∥}_{1}

and

{| | A | |}_{\infty}

denote the sum and the maximum of the absolute values of all entries of

A

. Let

a

be a vector. We denote its

ℓ_{2}

and

ℓ_{1}

norms by

{| | a | |}_{2}

and

{| | a | |}_{1}

, respectively.

2.1. Model Setup

Assume we have n objects and aim to model the linear relationship between m responses and p predictors for each object. To enhance measurement accuracy, each object is sent to k different platforms, each of which returns two matrices, a response matrix

Y_{i} \in R^{n \times m}

and a design matrix

X_{i} \in R^{n \times p}

,

i = 1, \dots, k

. Due to variations in equipment and measurement protocols across platforms, the scale and continuity of the results may differ [1,6]. Our goal is to identify the common set of influential predictors that affect the responses across all platforms. Moreover, because multivariate responses are measured on the same objects, there exist unknown correlations between responses within the platform and across platforms. Key terminology is summarized in Table 1.

Table 1. Data structure of response matrix

Y_{i} \in R^{n \times m}

, design matrix

X_{i} \in R^{n \times p}

and parameter matrix

B_{i} \in R^{p \times m}

for platform i, where

y_{i j s}

and

X_{i j r}

denote the j-th observed value of response s and predictor r on platform i, respectively. The corresponding regression coefficient of predictor r on response s is

β_{i r s}

,

i = 1, \dots, k; j = 1, \dots, n; s = 1, \dots, m; r = 1, \dots, p

.

From the linearity relationship between responses and predictors, we have the following:

\begin{matrix} Y_{i} = X_{i} B_{i} + E_{i}, i = 1, \dots, k \end{matrix}

(1)

where

Y_{i} \in R^{n \times m}

,

X_{i} \in R^{n \times p}

,

B_{i} \in R^{p \times m}

are respectively the response matrix, design matrix and regression parameter matrix from ith platform.

E_{i} \in R^{n \times m}

is the error matrix, where the rows of

E_{i}

are identically independent from

N_{m} (0, Σ_{i})

, with

Σ_{i} \in R^{m \times m}

being the within-platform error covariance matrix.

To simplify the modeling complexity, we treat each platform independently within the likelihood function. Accordingly, we adopt a marginal composite likelihood approach to integrate information across platforms [7,20,21], i.e., the following is calculated:

\begin{matrix} L (B) = \prod_{i = 1}^{k} f (Y_{i}; B_{i}), \end{matrix}

(2)

where

B = (B_{1}, \dots, B_{k}) \in R^{p \times k m}

, and

f (Y_{i}; B_{i})

denotes the multivariate normal density function of

Y_{i}

, that is,

\begin{matrix} f (Y_{i}; B_{i}) \propto | Σ_{i} |^{- \frac{n}{2}} exp {- \frac{1}{2} tr [(Y_{i} - X_{i} B_{i}) Σ_{i}^{- 1} {(Y_{i} - X_{i} B_{i})}^{⊤}]} . \end{matrix}

(3)

The log-likelihood function can be given by the following (excluding the constant):

\begin{matrix} ℓ (B) = - \frac{n}{2} \sum_{i = 1}^{k} log | Σ_{i} | - \frac{1}{2} \sum_{i = 1}^{k} tr [(Y_{i} - X_{i} B_{i}) Σ_{i}^{- 1} {(Y_{i} - X_{i} B_{i})}^{⊤}] . \end{matrix}

(4)

Following [1], the objective of this study is to recover a union subset of predictors, where each selected predictor is associated with at least one outcome across the platforms. To identify these commonly influential predictors, we enforce the constraint that the set of nonzero rows of

B_{k}

is identical for all platforms

k = 1, \dots, K

. Stacking the r-th row vector of all

B_{i}

s, we obtain the regression coefficient matrix for predictor r across all m responses and k platforms, i.e., written as follows:

Θ^{(r)} = {(θ_{1}^{(r)}, \dots, θ_{k}^{(r)})}^{⊤} \in R^{k \times m},

where

θ_{i}^{(r)} \in R^{m \times 1}

represents the r-th row of

B_{i}

. We assume that the sparsity among k platforms is the same with respect to the predictors, that is,

θ_{1}^{(r)}, \dots, θ_{k}^{(r)}

are either all zero or all nonzero for each

r = 1, \dots, p

. To select the influential predictors among all platforms, we add the group lasso penalty into the log-likelihood function, i.e., written as follows:

\begin{matrix} \tilde{Q} (B) = ℓ (B) - n \sum_{j = 1}^{p} Ω_{λ_{n}} (| | Θ^{(r)} {| |}_{F}), \end{matrix}

(5)

where

| | \cdot {| |}_{F}

is the Frobenius norm (square root of sum of squares of all entries). Assuming the penalty function is the mixture of

L_{1}

and

L_{2}

norms, then a sparse estimate of

B

(denoted as

\hat{B}

) can be obtained by solving the minimization problem as follows:

\begin{matrix} min_{B} \tilde{Q} (B) = min_{B} - ℓ (B) + n \sum_{i, r, s} λ_{i r s} | θ_{i r s} | + n \sum_{r = 1}^{p} λ_{g} | | Θ^{(r)} {| |}_{F} . \end{matrix}

(6)

Denote

p_{1} (B) = n \sum_{i, r, s} λ_{i r s} | θ_{i r s} |

and

p_{2} (B) = n \sum_{r = 1}^{p} λ_{g} | | Θ^{(r)} {| |}_{F}

. The first penalty

p_{1} (B)

encourages sparsity (individual coefficients to zero), and the second penalty

p_{2} (B)

encourages group sparsity for each predictor to be zero across all platforms.

2.2. Parameter Estimation

We use an iteratively reweighted least squares approach and block coordinate descent per predictor

r, r = 1, \dots, p

. For simplicity, we assume that the element-wise regularization parameters for the Lasso penalty are identical,

λ_{i r s} = λ_{e}

. In practice, the covariance matrix

Σ_{i}

among responses is typically unknown for each platform

i = 1, \dots, k

. A simple initial estimate, such as the empirical covariance of the model residuals, can be employed. The following algorithm is then implemented conditional on these estimated

Σ_{i}

s. Denote

f (B_{i}) = \frac{1}{2} tr ((Y_{i} - X_{i} B_{i}) Σ_{i}^{- 1} {(Y_{i} - X_{i} B_{i})}^{⊤})

. Then the optimization problem (6) becomes the following:

\begin{matrix} min_{B} Q (B) = min_{B} \sum_{i = 1}^{k} f (B_{i}) + n \sum_{i, r, s} λ_{e} | θ_{i r s} | + n \sum_{r = 1}^{p} λ_{g} | | Θ^{(r)} {| |}_{F} . \end{matrix}

(7)

Rewrite

X_{i} B_{i} = \sum_{r = 1}^{p} X_{i}^{(r)} θ_{i}^{(r) ⊤}

. For predictor r, by holding

Θ^{(j)}

fixed for

j \neq r

, we define residuals without predictor r for platform i,

R_{i}^{(r)} = Y_{i} - \sum_{j \neq r} X_{i}^{(j)} θ_{i}^{(j) ⊤} \in R^{n \times m}

. Then the first gradient of f, with respect to

θ_{i}^{(r)}

, is

\nabla_{θ_{i}^{(r)}} f = - Σ_{i}^{- 1} R_{i}^{(r) ⊤} X_{i}^{(r)} \in R^{m \times 1}

, where

X_{i}^{(j)} \in R^{n \times 1}

is the column j of

X_{i}, i = 1, \dots, k

. Denote the following:

\begin{matrix} S_{i}^{(r)} = - \nabla_{θ_{i}^{(r)}} f = Σ_{i}^{- 1} R_{i}^{(r) ⊤} X_{i}^{(r)} \in R^{m \times 1} . \end{matrix}

(8)

By coordinate descent algorithm, we estimate

θ_{i}^{(r)}

by fixing the rest. Let

Θ^{(j)}

be fixed for

j \neq r

, then problem (7) can be transformed to be a minimization sub-problem for predictor

r (r = 1, \dots, p)

across k platforms, i.e., the following is calculated:

min_{θ_{i}^{(r)}} Q_{r} (θ_{i}^{(r)}) = \sum_{i = 1}^{k} (\frac{∥ X_{i}^{(r)} ∥_{2}^{2}}{2} θ_{i}^{(r) ⊤} Σ_{i}^{- 1} θ_{i}^{(r)} - S_{i}^{(r) ⊤} θ_{i}^{(r)} + λ_{e} n {∥ θ_{i}^{(r)} ∥}_{1}) + λ_{g} n {∥ Θ^{(r)} ∥}_{F} .

(9)

Next, we will show how to solve the sub-problem (9). Define the element-wise soft-thresholding

S_{i, s o f t}^{(r)} = S_{n λ_{e}} (S_{i}^{(r)})

, where

S_{τ} (x) = sign (x) max (| x | - τ, 0)

, a soft-thresholding operator. In light of the block soft-thresholding solution of the group Lasso in [22], the predictor r is set to be inactive,

θ_{i}^{(r)} = 0

across all platforms if the following is met:

\begin{matrix} T_{r} = \sum_{i = 1}^{k} {∥ S_{i, s o f t}^{(r)} ∥}_{2}^{2} \leq n^{2} λ_{g}^{2} . \end{matrix}

(10)

Otherwise, the predictor r is selected to be active,

∥ Θ^{(r)} ∥_{F} \neq 0

. If so, we can update

θ_{i}^{(r)}

via a block soft-thresholding operator, as shown in [23]. Given our objective to select a union subset of active predictors across all platforms, an active predictor r may be relevant to platform i or other platforms. We therefore consider the following two scenarios for updating

θ_{i}^{(r)}

.

Assume that the predictor r in all platforms except platform i is inactive, that is,

θ_{i^{'}}^{(r)} = 0,

for

i^{'} \neq i .

Then the group Lasso norm reduces to the following:

∥ Θ^{(r)} ∥_{F} = {∥ θ_{i}^{(r)} ∥}_{2},

and the sub-problem (9) simplifies to become

\begin{matrix} min_{θ_{i}^{(r)} \in R^{m}} \frac{∥ X_{i}^{(r)} ∥_{2}^{2}}{2} θ_{i}^{(r) ⊤} Σ_{i}^{- 1} θ_{i}^{(r)} - S_{i}^{(r) ⊤} θ_{i}^{(r)} + n λ_{e} ∥ θ_{i}^{(r)} ∥_{1} + n λ_{g} {∥ θ_{i}^{(r)} ∥}_{2}, \end{matrix}

(11)

where the objective function is a sum of a smooth quadratic term and a composite nonsmooth penalty

λ_{e} {∥ \cdot ∥}_{1} + λ_{g} {∥ \cdot ∥}_{2}

. Its proximal mapping consists of sequential soft-thresholding followed by block shrinkage. Then the minimizer of the row-wise objective with respect to

θ_{i}^{(r)}

is given by the following:

θ_{i}^{(r)} = Σ_{i} \frac{S_{n (λ_{e} + λ_{g})} (S_{i}^{(r)})}{∥ X_{i}^{(r)} ∥_{2}^{2}} .

(12)

Assume that the predictor r is active across all platforms, we conduct the closed-form updating for the nonzero group r as follows. A solution to problem (9) is a minimizer of the following:

{\tilde{Q}}_{r} (θ_{i}^{(r)}) = \sum_{i = 1}^{k} (\frac{∥ X_{i}^{(r)} ∥_{2}^{2}}{2} θ_{i}^{(r) ⊤} Σ_{i}^{- 1} θ_{i}^{(r)} - S_{i, s o f t}^{(r) ⊤} θ_{i}^{(r)}) + λ_{g} n {∥ Θ^{(r)} ∥}_{F} .

(13)

Take the first derivative of

{\tilde{Q}}_{r}

with respect to

θ_{i}^{(r)}

and let them be zero, i.e., as follows:

∥ X_{i}^{(r)} ∥_{2}^{2} Σ_{i}^{- 1} θ_{i}^{(r)} - S_{i, s o f t}^{(r)} + λ_{g} n \frac{θ_{i}^{(r)}}{∥ Θ^{(r)} ∥_{F}} = 0 .

(14)

We rewrite the above equation to group the terms linear in

θ_{i}^{(r)}

, written as follows:

(∥ X_{i}^{(r)} ∥_{2}^{2} Σ_{i}^{- 1} + \frac{λ_{g} n}{∥ Θ^{(r)} ∥_{F}} I_{m}) θ_{i}^{(r)} = S_{i, s o f t}^{(r)} .

(15)

which yields a closed-form of

θ_{i}^{(r)}

, i.e.,

θ_{i}^{(r)} = {(∥ X_{i}^{(r)} ∥_{2}^{2} Σ_{i}^{- 1} + \frac{λ_{g} n}{∥ Θ^{(r)} ∥_{F}} I_{m})}^{- 1} S_{i, s o f t}^{(r)} .

(16)

2.3. Algorithm

We estimate

{B_{i}}_{i = 1}^{k}

via a nested alternating minimization procedure that consists of an outer loop updating covariance matrices and an inner loop performing row-wise block coordinate descent on regression coefficients with sparse group penalties. For the

(t + 1)

-th outer loop,

t = 0, \dots, T_{max} - 1

, we perform a fixed number of inner loops to update the regression coefficients

B_{i}^{(t, ℓ)}, ℓ = 0, 1, \dots, ℓ_{max}

while keeping the covariance matrices

Σ_{i}^{(t)}

fixed. When the inner loop arrives at the maximum iteration, we update

B_{i}^{(t + 1)}

, followed by the updating of

Σ_{i}^{(t + 1)}

and recording the residuals. Convergence of the algorithm is then assessed at the outer-iteration level based on changes in

B_{i}^{(t + 1)}

and

Σ_{i}^{(t + 1)}

. Each outer iteration, therefore, consists of a full inner loop for coefficient updates, followed by updating the regression covariance and convergence evaluation. The details are as follows:

Step 1:: Initialization $B_{i}^{(0)}$ and $Σ_{i}^{(0)}$ for each platform $i, i = 1, \dots, k$ .
Step 2:: (Outer loop) Update $B_{i}^{(t + 1)}$ across all platforms given $Σ_{i}^{(t)}, B_{i}^{(t)}$ at t-th iteration.
Substep 2.1: (Inner loop) Initialization: for predictor $r (r = 1, \dots, p)$ , $θ_{i}^{(r) (t, 0)} = θ_{i}^{(r) (t)}$ , $i = 1, \dots, k$ , where $θ_{i}^{(r) (t)}$ is the r-th row of $B_{i}^{(t)}$ .
Substep 2.2: Update $θ_{i}^{(r) (t, ℓ + 1)}$ across all platforms given $θ_{i}^{(r) (t, ℓ)}$ and $Σ_{i}^{(t)}$ . Perform row-wise group screening. Let $θ_{i}^{(r) (t, ℓ + 1)} = 0$ if $T_{r}^{(t, ℓ)} \leq n^{2} λ_{g}^{2}$ . Otherwise, predictor r is added to the active row set. For each active predictor r, we conduct row-wise coefficient updating, i.e., calculated as follows:

$θ_{i}^{(r) (t, ℓ + 1 / 2)} = Σ_{i}^{(t)} \frac{S_{n (λ_{e} + λ_{g})} (S_{i}^{(r) (t, ℓ)})}{∥ X_{i}^{(r)} ∥_{2}^{2}}$

(17)

if $∥ θ_{i^{'}}^{(r) (t, ℓ)} ∥_{1} = 0, i^{'} \neq i$ . If not, then the following is calculated:

$θ_{i}^{(r) (t, ℓ + 1 / 2)} = {(∥ X_{i}^{(r)} ∥_{2}^{2} Σ_{i}^{- 1 (t)} + \frac{λ_{g} n}{∥ Θ^{(r) (t, ℓ)} ∥_{F}} I_{m})}^{- 1} S_{i, s o f t}^{(r) (t, ℓ)} .$

(18)

Substep 2.3: For each platform i, $B_{i}^{(t, ℓ + 1)} = ({\hat{B}}_{i, \hat{A}}, 0_{{\hat{A}}^{c}})$ , where
$\hat{A} = {r : ∥ θ_{i}^{(r) (t, ℓ + 1 / 2)} ∥_{1} \neq 0}$ , and the following:

${\hat{B}}_{i, \hat{A}} = {(X_{i, \hat{A}}^{⊤} X_{i, \hat{A}} + ε I_{| \hat{A} |})}^{- 1} X_{i, \hat{A}}^{⊤} Y_{i} .$

(19)

Substep 2.4: Repeat substeps 2.2–2.3 until it reaches the maximum inner iterations $ℓ_{max}$ .
Step 3:: Update $Σ_{i}^{(t + 1)}$ via $Σ_{i}^{(t + 1)} = \frac{1}{n} {(Y_{i} - X_{i} B_{i}^{(t + 1)})}^{⊤} (Y_{i} - X_{i} B_{i}^{(t + 1)}) .$
Step 4:: Repeat steps 2–3 until it meets the termination conditions, i.e., the following:

$R_{B}^{(t + 1)} = \frac{1}{k} \sum_{i = 1}^{k} {∥B_{i}^{(t + 1)} - B_{i}^{(t)}∥}_{F}^{2} \leq δ,$

(20)

$R_{Σ}^{(t + 1)} = \frac{1}{k} \sum_{i = 1}^{k} {∥Σ_{i}^{(t + 1)} - Σ_{i}^{(t)}∥}_{F}^{2} \leq δ,$

(21)

where

δ

is a small number, say, 1 × 10⁻⁴.

We remark that, for initialization, we set

B_{i}^{(0)} = {0.1}_{p \times m}

,

Σ_{i}^{(0)} = I_{m}

, for

i = 1, \dots, k

. Tuning parameters are typically selected by evaluating a candidate set of values via cross-validation or an information criterion like the adjusted BIC. To enhance computational efficiency and avoid a costly two-dimensional grid search, we adopt a practical simplification by fixing the relationship between

λ_{g}

and

λ_{e}

, for instance, by setting

λ_{g} = λ_{e}

or

λ_{g} = λ_{e} / 2

. Consequently, we select the optimal

λ_{e}

via two-fold cross-validation from a common empirical candidate set for

λ_{e}

, such as {0.1, 0.2, ..., 2.0 } with an increment of 0.1 or smaller. The range and granularity of the candidate set can be adapted to the specific application.

3. Theoretical Property

The theoretical property is in light of [12], where their theories are built under the framework of multivariate linear regression. Herein, we extend them to multiple platforms. For platform

i (i = 1, \dots, k)

, denote

J_{1} (B_{i}) = {r s : | θ_{i r s} | \neq 0}

the index set of nonzero elements in

B_{i}

, and

J_{2} (B) = {r : | | Θ^{(r)} {| |}_{2} \neq 0, r = 1, \dots, p}

the index set of nonzero rows in

B

. We assume that the sparsity of all

B_{i}, i = 1, \dots, k

is the same. Define

M_{1} (B) = | J_{1} (B_{i}) |

and

M_{2} (B) = | J_{2} (B) |

. For any matrix

Δ \in R^{p \times m}

and

J_{1} \subseteq {r s : 1 \leq r \leq p, 1 \leq s \leq m}

, denote

Δ_{J_{1}}

as the projection of

Δ

on the index set

J_{1}

, which is a matrix with the same elements of

Δ

on coordinates

J_{1}

and zeros on the complementary coordinates

J_{1}^{c}

. Denote

Δ_{J_{2}} = {Θ_{Δ}^{(r)}, r \in J_{2}}

, where

Θ_{Δ}^{(r)}

is a

p \times m

matrix with the rth row same as

Δ

and zeros on the other rows.

Let

B_{i}^{*}

be the true regression coefficient matrices in model (1), and

\hat{B_{i}}

be its estimated counterpart for

i = 1, \dots, k

. Assume each column of the random error matrix follows a multivariate normal distribution

N_{m} (0, Σ_{i}^{*})

. Denote

u = M_{1} (B^{*})

and

v = M_{2} (B^{*})

. The theoretical framework is built on a given covariance matrix for each platform i. In practice,

Σ_{i}^{*}

is usually unknown. So we instead use an estimator

{\hat{Σ}}_{i}

. Before proceeding, we impose mild conditions on both design matrices

X_{i}

and covariance matrices

Σ_{i}^{*}

for all platforms.

Assumption 1.

Assume that the columns of

X_{i}

are centered such that the diagonal elements of the matrix

X_{i}^{⊤} X_{i} / n

are equal to 1 for all

i = 1, \dots, k

. Let

ψ_{max} > 0

be the largest eigenvalue of

X_{i}^{⊤} X_{i} / n

for all

i = 1, \dots, k

.

Assumption 2.

For a given

Σ_{i} \in R^{m \times m}

, assume that there exists a

λ^{*} \in (0, 1)

, such that

- λ^{*} I ⪯ Σ_{i} - Σ_{i}^{*} ⪯ λ^{*} I

for all

i = 1, \dots, k

. There exist positive constants

0 < C_{1} < C_{2}

such that

C_{1} \geq λ^{*}

, and the eigenvalues of

Σ_{i}^{*}

are less than

C_{2}

and larger than

C_{1}

for all

i = 1, \dots, k

.

Assumption 3.

Let

J_{1} \subseteq {r s : 1 \leq r \leq p, 1 \leq s \leq m}

and

J_{2} \subseteq {1, \dots, p}

be any index sets that satisfy

| J_{1} | \leq u, | J_{2} | \leq v

. Let

\tilde{ρ} = {ρ_{r}, r = 1, \dots, p}

be a set of positive numbers. For any nontrivial matrices

Δ_{i} \in R^{p \times m}, i = 1, \dots, k

, if it satisfies the following:

\begin{matrix} \sum_{i = 1}^{k} | Δ_{i, J_{1}^{c}} |_{1} + 2 \sum_{r \in J_{2}^{c}} ρ_{r} | | Δ^{(r)} {| |}_{2} \leq 3 \sum_{i = 1}^{k} | Δ_{i, J_{1}} |_{1} + 2 \sum_{r \in J_{2}} ρ_{r} | | Δ^{(r)} {| |}_{2}, \end{matrix}

(22)

where

Δ^{(r)}

is built by the rth rows of

Δ_{i}

for all

i = 1, \dots, k

, then the following minimums exist and are positive, i.e., the following:

\begin{matrix} κ_{1} (u, v, \tilde{ρ}) = min_{J_{1}, J_{2}, Δ \neq 0} \frac{\sum_{i = 1}^{k} {(tr [X_{i} Δ_{i} Σ_{i}^{- 1} {(X_{i} Δ_{i})}^{⊤}])}^{1 / 2}}{n^{1 / 2} \sum_{i = 1}^{k} | | Δ_{i, J_{1}} {| |}_{F}} > 0, \end{matrix}

(23)

\begin{matrix} κ_{2} (u, v, \tilde{ρ}) = min_{J_{1}, J_{2}, Δ \neq 0} \frac{\sum_{i = 1}^{k} {(tr [X_{i} Δ_{i} Σ_{i}^{- 1} {(X_{i} Δ_{i})}^{⊤}])}^{1 / 2}}{n^{1 / 2} | | Δ_{J_{2}} {| |}_{F}} > 0 . \end{matrix}

(24)

We remark that these constants involved in assumptions represent standard regularity conditions in statistical theory. While their specific values are derived from the proofs and are not tuned in practice, their existence is required to establish the desired theoretical guarantees. Furthermore, we define

s_{{\tilde{ρ}}^{*}} = \sum_{r \in J_{2} (B^{*})} ρ_{r}^{2}

,

σ^{2} = \frac{C_{1}}{{(C_{1} - λ^{*})}^{2}}, λ_{g} = ρ_{r} λ

for

r = 1, \dots, p, ρ = {min}_{r} {1, ρ_{r}}

, and

λ = 2 σ A {log (p m k) / n}^{1 / 2}

for some constant

A > \sqrt{2}

. Rewrite

κ_{1} = κ_{1} (u, v, \tilde{ρ})

and

κ_{2} = κ_{2} (u, v, \tilde{ρ})

.

Theorem 1.

Under Assumptions 1–3, the following oracle bounds for the prediction error, the estimation error, and the order of sparsity hold, with probability at least

1 - {(p m k)}^{1 - A^{2} / 2}

, i.e., written as follows:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} B_{i}^{*} - X_{i} {\hat{B}}_{i}) Σ_{i}^{- 1} {(X_{i} B_{i}^{*} - X_{i} {\hat{B}}_{i})}^{⊤}] \leq 16 λ^{2} k {(\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}})}^{2}, \end{matrix}

(25)

\begin{matrix} \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} \leq \frac{24 λ k^{1 / 2}}{1 + ρ} {(\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}})}^{2}, \end{matrix}

(26)

\begin{matrix} M_{1} ({\hat{B}}_{i}) \leq \frac{64 ψ_{max}}{(C_{1} - λ^{*})} {(\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}})}^{2}, i = 1, \dots, k . \end{matrix}

(27)

The proof of Theorem 1 is given in the Appendix A. The three inequalities (25)–(27) guarantee the performance of the estimator

\hat{B}

in terms of prediction accuracy, estimation error, and sparsity. The first inequality (25) bounds the weighted mean squared prediction error across all k platforms. The bound scales with

log (p m k) / n

and depends on the sparsity and grouping structures. The second inequality (26) controls the cumulative

ℓ_{1}

-norm of

{\hat{B}}_{i}

from

B_{i}^{*}

across all platforms. This bound grows slowly with

\sqrt{log (p q m) / n}

, typical of high-dimensional settings, and depends similarly on the same sparsity and grouping structures. The third inequality (27) bounds

M_{1} ({\hat{B}}_{i})

, a measure of the model complexity or effective sparsity of the estimator. This ensures that the estimated model is not overly dense and depends on the maximum eigenvalue

ψ_{max}

.

4. Simulations

To test our method, we simulated data from model (1) with

k = 2

,

m = 3

, and

(n, p) = {(100, 100), (100, 200), (100, 500), (200, 200), (200, 500), (200, 1000)}

. For each platform i, the design matrix

X_{i}

was generated with entries drawn independently from a p-variate normal distribution with mean zero and unit standard deviation. To introduce high correlation within each row of the error matrix

E_{i}

, we set the diagonal elements of its covariance matrix

Σ_{i}

to 1 and the off-diagonal elements to 0.8. The coefficient matrix

B_{i}

was constructed with 10 active rows. Entries in these rows were sampled uniformly, either from

U (- 1.5, - 0.5)

or

U (0.5, 1.5)

.

The evaluation metrics encompass three primary aspects: parameter estimation accuracy across all platforms, prediction accuracy, and feature selection accuracy. For each platform

i = 1, \dots, k

, we measure the Frobenius norm of the deviation between the true regression coefficient matrix

B_{i}^{*}

and its estimate

{\hat{B}}_{i}

, written as follows:

B_{F, i} = {∥ B_{i}^{*} - {\hat{B}}_{i} ∥}_{F},

the deviation between the true covariance matrix

Σ_{i}^{*}

and its estimate

{\hat{Σ}}_{i}

,

Σ_{F, i} = {∥ Σ_{i}^{*} - {\hat{Σ}}_{i} ∥}_{F},

and the root mean squared errors (RMSE), i.e.,

{RMSE}_{i} = {∥ X_{i} B_{i}^{*} - X_{i} {\hat{B}}_{i} ∥}_{F} / \sqrt{n},

where

X_{i}

is out of samples for all platforms. The overall estimation accuracy of

B

and

Σ

can be summarized by the average across all k platforms, written as follows:

\bar{B_{F}} = \frac{1}{k} \sum_{i = 1}^{k} B_{F, i}, \bar{Σ_{F}} = \frac{1}{k} \sum_{i = 1}^{k} Σ_{F, i}, and \bar{RMSE} = \frac{1}{k} \sum_{i = 1}^{k} {RMSE}_{i} .

Meanwhile, feature selection performance is evaluated using sensitivity, the proportion of truly active predictors that are correctly identified, written as follows:

Sensitivity = \frac{T P}{T P + F N},

and specificity, the proportion of truly inactive predictors that are correctly excluded,

Specificity = \frac{T N}{T N + F P},

where TP and TN represent the number of predictors correctly identified as active (non-zero true coefficients with non-zero estimates) and inactive (zero true coefficients with zero estimates), respectively. Conversely, FN and FP denote the number of predictors incorrectly identified as active (zero true coefficients with non-zero estimates) and inactive (non-zero true coefficients with zero estimates), respectively. These metrics are computed for each platform, and the results are aggregated across all k platforms to provide a comprehensive assessment of feature selection performance.

We compare our proposed method with two alternative approaches, FusionLearn and multivariate Group Lasso. The FusionLearn method, implemented in the R package FusionLearn [6], is designed for multi-platform data integration but treats responses as univariate. To adapt it to our multivariate setting, we apply it separately to each of the m responses across platforms, which ignores the covariance structure among responses. Conversely, the multivariate Group Lasso method, available via the glmnet package [19], is designed for multivariate regression within a single platform. We therefore apply it independently to the data from each platform, which does not leverage information shared across platforms.

The simulation results are given in Table 2 and Table 3. As is seen in Table 2, the comparison of MM-HLR, FusionLearn, and GroupLasso reveals distinct performance in parameter estimation and prediction. In terms of parameter estimation of

B

and

Σ

, the proposed MM-HLR consistently provides estimates closest to the true ones, reflected by their smallest measures. In contrast, both FusionLearn and GroupLasso systematically overestimate the bias of parameters. The estimates of

B

and

Σ

via GroupLasso are often excessively high, especially for smaller n and larger p. In terms of prediction performance, MM-HLR demonstrates a strong advantage, achieving the lowest RMSE in every

(n, p)

setting. Its RMSE values range from 0.38 to 0.58, which are much lower than those of the other two methods. These consistent superiorities confirm MM-HLR’s excellent parameter estimation and predictive capability.

Table 2. Comparison of averaged parameter estimation accuracy (

B_{F}, Σ_{F}

) and prediction accuracy (RMSE) across different methods under various

(n, p)

configurations (standard errors in parentheses).

Table 3. Comparison of averaged feature selection accuracy (sensitivity and specificity) across different methods under various

(n, p)

configurations (standard errors in parentheses).

Table 3 presents the comparison of averaged feature selection accuracy (sensitivity and specificity) across different methods under various

(n, p)

configurations. All three methods achieve perfect sensitivity (1.0) across all pairs of

(n, p)

, indicating their superior ability to select true predictors. However, the specificity values vary significantly among the three methods. MM-HLR achieves near-perfect or perfect specificity (0.9996 to 1.0) in all scenarios. GroupLasso has the lowest specificity overall, ranging from 0.7362 to 0.9396, indicating its weaker ability to exclude irrelevant predictors than the other two methods, while the ability of FusionLearn to exclude irrelevant predictors is in the middle, exhibiting the specificity values from 0.9166 to 0.9524.

5. Real Data Analysis

Zhang et al. [7] integrated three financial indices (S&P 500, Dow Jones, and VIX) as distinct platforms. Similarly, we consider a two-platform analysis (

K = 2

) of U.S. stock market data. Our first platform comprises the S&P 500 index and its volatility index (VIX), while the second consists of the NASDAQ-100 index and its volatility index (VXN). Thus, each platform provides a two-dimensional response vector (

m = 2

). We consider stocks that are actively traded in either the S&P 500 or NASDAQ-100 indices as predictors. The analysis uses weekly data from October 2022 to September 2025. To achieve approximately independent samples, log returns are calculated for all series. After data preprocessing and the removal of missing values, the final dataset contains 156 weekly observations and 510 predictors, which are common to both platforms. The correlations between two responses within each platform are −0.79 for Platform 1 (S&P 500 and VIX) and −0.66 for Platform 2 (NASDAQ-100 and VXN). The dataset is randomly partitioned into a training set of 100 samples (

n = 100

) and a test set comprising the remaining observations (

n_{t e s t} = 56

). Our objective is to identify a common set of predictors that are relevant to at least one of the responses in either platform.

Table 4 presents the prediction errors on the test samples for Platform i

(P E_{i}, i = 1, 2)

and their average

(P E)

, where

P E_{i} = {∥ Y_{i} - X_{i} {\hat{B}}_{i} ∥}_{F} / \sqrt{n_{test}}

. We also report the number of selected predictors (

s u p p o r t

) to quantify the sparsity of the estimated model. The proposed MM-HLR method demonstrates superior overall performance compared to the benchmark methods, FusionLearn and GroupLasso, in terms of both prediction accuracy and model sparsity. Specifically, MM-HLR achieves the lowest total prediction error (

P E = 0.0828

), which is much lower than that of FusionLearn (

P E = 0.8903

) and approximately 20% lower than that of GroupLasso (

P E = 0.1067

). This advantage is consistent across both individual platforms. Furthermore, MM-HLR yields the most parsimonious model, selecting only 6 predictors. In contrast, FusionLearn and GroupLasso select 115 and 75 predictors, respectively, suggesting a higher risk of overfitting. This effective balance between prediction accuracy and model simplicity underscores the capability of the MM-HLR method to leverage both multivariate dependence and multi-platform structures within high-dimensional regression.

Table 4. The platform-specific prediction errors

(P E_{i}, i = 1, 2)

, the average prediction error (PE), and the number of selected predictors (support).

6. Conclusions and Discussion

In this paper, we introduce the multi-platform high-dimensional multivariate linear regression model, designed to simultaneously model correlated multivariate responses across multiple platforms. The proposed framework explicitly accounts for within-platform response correlation while incorporating a group Lasso penalty to fuse information across platforms, thereby promoting both individual sparsity and structured grouping of predictors. The optimization algorithm, combining iteratively reweighted least squares with block coordinate descent, is developed to solve the resulting regularization problem efficiently. Theoretical guarantees are established for the estimator

\hat{B}

, covering prediction accuracy, estimation error bounds, and support recovery (sparsity) under regularity conditions. Simulation studies under various scenarios demonstrate the outperformance of MM-HLR in parameter estimation, prediction, and variable selection, showing minimal bias, low variance, and strong robustness across all tested conditions. The superior performance of our approach is further validated through an analysis of real financial data, which confirms its effectiveness in leveraging both multivariate dependence and multi-platform structures within high-dimensional regression.

In this article, we impose a shared sparsity structure across platforms. Extending the framework to handle scenarios with only partial overlap would be a valuable direction for future work. Accordingly, we note that the method’s performance may degrade if the common sparsity pattern is violated and that formally relaxing this assumption constitutes an important avenue for future research.

Author Contributions

Conceptualization, X.G., Y.W. and S.Q.; methodology, X.G., Y.W. and S.Q.; software, S.Q. and G.Z.; validation, S.Q.; formal analysis, S.Q.; investigation, S.Q.; resources, S.Q. and G.Z.; data curation, S.Q.; writing—original draft preparation, S.Q. and G.Z.; writing—review and editing, Y.W. and S.Q.; supervision, X.G., Y.W. and S.Q.; project administration, S.Q. All authors have read and agreed to the published version of the manuscript.

Funding

Shanshan Qin is supported by the National Natural Science Foundation of China (No.12201454) and the China Scholarship Council (No.202408120083). Xin Gao is supported by the Natural Science and Engineering Research Council of Canada (No. RGPIN-2024-06202) and Yuehua Wu is supported by the Natural Science and Engineering Research Council of Canada (No.RGPIN-2023-05655).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Theoretical Proofs

We provide the proof of Theorem 1. Before proceeding, we present Lemmas A1 and A2, which are required in the proof of Theorem 1.

Lemma A1.

Define a random matrix

W_{i} = E_{i} Σ_{i}^{- 1} \in R^{n \times m}

and a random variable

V_{i r s} = X_{i}^{(r) ⊤} W_{i}^{(s)}

, where

W_{i}^{(s)} \in R^{n}

is the sth column of

W_{i}

,

s = 1, \dots, m; i = 1, \dots, k

. Define an event

A

and its complementary event

A^{c}

, respectively, as follows:

A = {| | X_{i}^{⊤} W_{i} | |_{\infty} \leq \frac{n λ}{2}, for all i = 1, \dots, k},

A^{c} = {at least one | V_{i r s} | \geq \frac{n λ}{2}, i = 1, \dots, k, r = 1, \dots, p, s = 1, \dots, m} .

By Assumptions 1 and 2, the following holds:

P r {A^{c}} \leq {(p m k)}^{1 - A^{2} / 2} \to 0,

for some constant

A > \sqrt{2}

.

Proof of Lemma A1.

Since the rows of

E_{i}

follows

N_{m} (0, Σ_{i}^{*})

independently and identically (iid), then the rows of

W_{i}

are iid with

N_{m} (0, Σ_{i}^{- 1} Σ_{i}^{*} Σ_{i}^{- 1})

random vectors. For the covariance matrix

Σ_{i}^{- 1} Σ_{i}^{*} Σ_{i}^{- 1}

, by Assumption 2, we have the following:

\begin{matrix} - λ^{*} I ⪯ Σ_{i} - Σ_{i}^{*} ⪯ λ^{*} I, Σ^{*} ⪯ C_{2} I \Rightarrow (C_{1} - λ^{*}) I ⪯ Σ_{i} ⪯ (λ^{*} + C_{2}) I \\ \Rightarrow Σ_{i}^{- 1} Σ_{i}^{*} Σ_{i}^{- 1} ⪯ Σ_{i}^{- 1} + λ^{*} Σ_{i}^{- 2} ⪯ (\frac{1}{C_{1} - λ^{*}} + \frac{λ^{*}}{{(C_{1} - λ^{*})}^{2}}) I = \frac{C_{1}}{{(C_{1} - λ^{*})}^{2}} I, \end{matrix}

which means that the diagonal elements of

Σ_{i}^{- 1} Σ_{i}^{*} Σ_{i}^{- 1}

are less than

σ^{2} ≜ \frac{C_{1}}{{(C_{1} - λ^{*})}^{2}}

. Since

W_{i}^{(s)} \sim N_{n} (0, σ_{i s}^{2} I)

, where

σ_{i s}^{2}

is the sth diagonal element of

Σ_{i}^{- 1} Σ_{i}^{*} Σ_{i}^{- 1}

. By Assumption 1,

V a r (V_{i r s}) = X_{i}^{(r) ⊤} c o v (W_{i}^{(s)}) X_{i}^{(r)} = n σ_{i s}^{2} \leq n σ

. Therefore

{(n σ_{i s}^{2})}^{- \frac{1}{2}} V_{i r s} ≜ Z

are standard normal random variables for all

i, r, s

. We have the following:

\begin{matrix} P r {A^{c}} & \leq & \sum_{r = 1}^{p} \sum_{s = 1}^{m} \sum_{i = 1}^{k} P r {| V_{i r s} | \geq \frac{n λ}{2}} \\ = & p \sum_{s = 1}^{m} \sum_{i = 1}^{k} P r {| V_{i r s} | / \sqrt{n} σ_{i s} \geq \frac{\sqrt{n} λ}{2 σ_{i s}}} \\ \leq & p \sum_{s = 1}^{m} \sum_{i = 1}^{k} P r {| Z | \geq \frac{\sqrt{n} λ}{2 σ}} \\ \leq & p m k exp (- \frac{λ^{2} n}{8 σ^{2}}) = {(p m k)}^{1 - A^{2} / 2}, \end{matrix}

where

λ = 2 σ A {log (p m k) / n}^{1 / 2}

. □

Lemma A2.

Under Assumptions 1–3, for any

B_{i} \in R^{p \times m}, i = 1, \dots, k

, with probability at least

1 - {(p m k)}^{1 - A^{2} / 2}

, the following holds:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] + λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i} {| |}_{1} + 2 \sum_{r = 1}^{p} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r)} {| |}_{F} \end{matrix}

\begin{matrix} \leq & \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - B_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - B_{i}))}^{⊤}] + 4 λ \sum_{i = 1}^{k} \sum_{r s \in J_{1} (B_{i})} | {\hat{θ}}_{i r s} - θ_{i r s} | + 4 \sum_{r \in J_{2} (B)} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r)} {| |}_{F} \\ and \end{matrix}

(A1)

\begin{matrix} M_{1} (\hat{B_{i}}) \leq \frac{4}{λ^{2} n^{2}} | | X_{i}^{⊤} X_{i} (\hat{B_{i}} - B_{i}^{*}) Σ_{i}^{- 1} {| |}_{F}^{2} . \end{matrix}

(A2)

Proof of Lemma A2.

For any

B_{i} \in R^{p \times m}, i = 1, \dots, k

,

Q (\hat{B}) \leq Q (B)

. Plugging

Y_{i} = X_{i} B_{i}^{*} + E_{i}, i = 1, \dots, k

into the above inequality, results in the following:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] \end{matrix}

\begin{matrix} \leq \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - B_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - B_{i}))}^{⊤}] \end{matrix}

\begin{matrix} + \frac{2}{n} \sum_{i = 1}^{k} | | X_{i}^{⊤} E_{i} Σ_{i}^{- 1} {| |}_{\infty} | | {\hat{B}}_{i} - B_{i} {| |}_{1} \end{matrix}

\begin{matrix} + 2 λ \sum_{i, r, s} (| θ_{i r s} | - | {\hat{θ}}_{i r s} |) + 2 \sum_{r = 1}^{p} λ_{g} (| | Θ^{(r)} {| |}_{F} - | | {\hat{Θ}}^{(r) ⊤} {| |}_{F}) . \end{matrix}

(A3)

By Lemma A1, it holds that on event

A

, the following is calculated:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] + λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i} {| |}_{1} + 2 \sum_{r = 1}^{m} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r)} {| |}_{F} \end{matrix}

\begin{matrix} \leq \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - B_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - B_{i}))}^{⊤}] + 2 λ \sum_{i = 1}^{k} (| | {\hat{B}}_{i} - B_{i} {| |}_{1} + | | B_{i} {| |}_{1} - | | {\hat{B}}_{i} {| |}_{1}) \end{matrix}

\begin{matrix} + 2 \sum_{r = 1}^{p} λ_{g} (| | {\hat{Θ}}^{(r)} - Θ^{(r)} {| |}_{F} + | | Θ^{(r)} {| |}_{F} - | | {\hat{Θ}}^{(r)} {| |}_{F}) \end{matrix}

\begin{matrix} \leq \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - B_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - B_{i}))}^{⊤}] + 4 λ \sum_{i = 1}^{k} \sum_{r s \in J_{1} (B)} | {\hat{θ}}_{i r s} - θ_{i r s} | \end{matrix}

\begin{matrix} + 4 \sum_{r \in J_{2} (B)} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r)} {| |}_{F} . \end{matrix}

(A4)

This completes the proof of the first inequality (A1) in Lemma A2.

To prove the second inequality (A2) in Lemma A2, we use the KKT conditions. For

\forall {\hat{θ}}_{i r s} \neq 0

, the following is written:

X_{i}^{(r) ⊤} (Y_{i} - X_{i} {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} = 2 n λ s i g n ({\hat{θ}}_{i r s}) + \frac{2 n λ_{g} {\hat{θ}}_{i r s}}{| | {\hat{Θ}}^{(r)} {| |}_{F}},

where

{(Σ_{i}^{- 1})}_{\cdot s}

is the s column of

Σ_{i}^{- 1}

, which implies that

\begin{matrix} λ \leq \frac{1}{n} | X_{i}^{(r) ⊤} (Y_{i} - X_{i} {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} | . \end{matrix}

(A5)

On the other hand, we have on event

A

, written as follows:

\begin{matrix} \frac{1}{n} | X_{i}^{(r) ⊤} (Y_{i} - X_{i} {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} | & \leq & \frac{1}{n} | X_{i}^{(r) ⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} + X_{i}^{(r) ⊤} E_{i} {(Σ_{i}^{- 1})}_{\cdot s} | \end{matrix}

\begin{matrix} \leq & \frac{1}{n} | X_{i}^{(r) ⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} | + \frac{1}{n} | | X_{i}^{⊤} W_{i} {| |}_{\infty} \end{matrix}

\begin{matrix} \leq & \frac{1}{n} | X_{i}^{(r) ⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} | + \frac{λ}{2} . \end{matrix}

(A6)

Combining Equations (A5) and (A6) together, we can obtain the following:

\frac{λ}{2} \leq \frac{1}{n} | X_{i}^{(r) ⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} | .

Therefore, for any platform i, the following is written:

\begin{matrix} M_{1} (\hat{B_{i}}) & \leq \frac{4}{λ^{2} n^{2}} \sum_{r s \in J_{1} ({\hat{B}}_{i})} {| X_{i}^{(r) ⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) {(Σ_{i}^{- 1})}_{\cdot s} |}^{2} \end{matrix}

\begin{matrix} \leq \frac{4}{λ^{2} n^{2}} | | X_{i}^{⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1} {| |}_{F}^{2} . \end{matrix}

(A7)

This completes the proof of Lemma A2. □

Now, we detail the proof of Theorem 1.

Proof of Theorem 1.

By the first inequality (A1) in Lemma A2 and letting

B = B^{*}

, we have, on event

A

, the following:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] \end{matrix}

\begin{matrix} \leq 4 λ \sum_{i = 1}^{k} \sum_{r s \in J_{1} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} | + 4 \sum_{r \in J_{2} (B^{*})} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 4 λ u^{\frac{1}{2}} \sum_{i = 1}^{k} | | {({\hat{B}}_{i} - B_{i}^{*})}_{J_{1} (B_{i}^{*})} {| |}_{F} + 4 {(\sum_{r \in J_{2} (B^{*})} λ_{g}^{2})}^{\frac{1}{2}} | | {(\hat{B} - B^{*})}_{J_{2} (B^{*})} {| |}_{F}, \end{matrix}

(A8)

which is derived by Cauchy–Schwarz inequality. On event

A

, we also have the following:

\begin{matrix} λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} + 2 \sum_{r = 1}^{p} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 4 λ \sum_{i = 1}^{k} \sum_{r s \in J_{1} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} | + 4 \sum_{r \in J_{2} (B^{*})} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F}, \end{matrix}

(A9)

which, together with the fact that

\begin{matrix} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} = \sum_{r s \in J_{1} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} | + \sum_{r s \in J_{1}^{c} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} |, \end{matrix}

(A10)

\begin{matrix} \sum_{r = 1}^{p} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} = \sum_{r \in J_{2} (B^{*})} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} + \sum_{r \in J_{2}^{c} (B^{*})} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F}, \end{matrix}

(A11)

yield the following inequality,

\begin{matrix} λ \sum_{i = 1}^{k} \sum_{r s \in J_{1}^{c} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} | + 2 \sum_{r \in J_{2}^{c} (B^{*})} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 3 λ \sum_{i = 1}^{k} \sum_{r s \in J_{1} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} | + 2 \sum_{r \in J_{2} (B^{*})} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} . \end{matrix}

(A12)

Thus the inequality (22) in Assumption 3 holds with

Δ_{i} = \hat{B_{i}} - B_{i}^{*}, i = 1, \dots, k

and

ρ_{r} = λ_{g} / λ

. Therefore, we obtain the following:

\begin{matrix} \sum_{i = 1}^{k} | | {(\hat{B_{i}} - B_{i}^{*})}_{J_{1} (B^{*})} {| |}_{F} \leq \frac{\sum_{i = 1}^{k} {(tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}])}^{\frac{1}{2}}}{n^{1 / 2} κ_{1}}, \end{matrix}

(A13)

\begin{matrix} | | {(\hat{B} - B^{*})}_{J_{2} (B^{*})} {| |}_{F} \leq \frac{\sum_{i = 1}^{k} {(tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}])}^{\frac{1}{2}}}{n^{1 / 2} κ_{2}} . \end{matrix}

(A14)

Plugging Equations (A13) and (A14) into (A8), we have the following:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] \end{matrix}

\begin{matrix} \leq (\frac{4 λ u^{\frac{1}{2}}}{n^{\frac{1}{2}} κ_{1}} + \frac{4 {(\sum_{r \in J_{2} (B^{*})} λ_{g}^{2})}^{\frac{1}{2}}}{n^{\frac{1}{2}} κ_{2}}) \sum_{i = 1}^{k} {(tr [X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}])}^{\frac{1}{2}} \end{matrix}

\begin{matrix} = (\frac{4 λ u^{\frac{1}{2}}}{n^{\frac{1}{2}} κ_{1}} + \frac{4 λ s_{{\tilde{ρ}}^{*}}^{\frac{1}{2}}}{n^{\frac{1}{2}} κ_{2}}) \sum_{i = 1}^{k} {(tr [X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}])}^{\frac{1}{2}}, \end{matrix}

(A15)

which yields that, by Cauchy–Schwarz inequality,

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] \leq 16 λ^{2} k {(\frac{u^{\frac{1}{2}}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{\frac{1}{2}}}{κ_{2}})}^{2} . \end{matrix}

(A16)

Next, we prove the second term (26) in Theorem 1. Define

{| | B | |}_{2, 1} = \sum_{i = 1}^{k} | | B_{i} {| |}_{1} + \sum_{r = 1}^{p} | | Θ^{(r)} {| |}_{F}

. Hence, the following is calculated:

| | \hat{B} - B^{*} {| |}_{2, 1} = \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} + \sum_{r = 1}^{p} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \leq 2 \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} .

Then we have the following:

\begin{matrix} (λ + λ ρ_{r}) | | \hat{B} - B^{*} {| |}_{2, 1} & = λ | | \hat{B} - B^{*} {| |}_{2, 1} + ρ_{r} λ | | \hat{B} - B^{*} {| |}_{2, 1} \end{matrix}

\begin{matrix} \leq 3 λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} + ρ_{r} λ \sum_{r = 1}^{p} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 3 λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} + \sum_{r = 1}^{p} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 3 [λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} + 2 \sum_{r = 1}^{p} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F}] . \end{matrix}

(A17)

By (A9), we obtain the following:

\begin{matrix} \frac{1 + ρ_{r}}{3} λ | | \hat{B} - B^{*} {| |}_{2, 1} & \leq λ \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} + 2 \sum_{r = 1}^{p} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 4 λ \sum_{i = 1}^{k} \sum_{r s \in J_{1} (B_{i}^{*})} | {\hat{θ}}_{i r s} - θ_{i r s}^{*} | + 4 \sum_{r \in J_{2} (B^{*})} λ_{g} | | {\hat{Θ}}^{(r)} - Θ^{(r) *} {| |}_{F} \end{matrix}

\begin{matrix} \leq 4 λ u^{1 / 2} \sum_{i = 1}^{k} | | {({\hat{B}}_{i} - B_{i}^{*})}_{J_{1} (B_{i}^{*})} {| |}_{F} + 4 {(\sum_{r \in J_{2} (B^{*})} λ_{g}^{2})}^{1 / 2} | | {(\hat{B} - B^{*})}_{J_{2} (B^{*})} {| |}_{F} \end{matrix}

\begin{matrix} \leq (\frac{4 λ u^{1 / 2}}{n^{1 / 2} κ_{1}} + \frac{4 λ s_{{\tilde{ρ}}^{*}}^{1 / 2}}{n^{1 / 2} κ_{2}}) \sum_{i = 1}^{k} {(tr [X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}])}^{1 / 2} \end{matrix}

\begin{matrix} \leq (\frac{4 λ u^{1 / 2}}{n^{1 / 2} κ_{1}} + \frac{4 λ s_{{\tilde{ρ}}^{*}}^{1 / 2}}{n^{1 / 2} κ_{2}}) 4 {(k n)}^{1 / 2} λ (\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}}) \end{matrix}

\begin{matrix} = 16 k^{1 / 2} λ^{2} {(\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}})}^{2} . \end{matrix}

(A18)

Therefore, we obtain the following:

\begin{matrix} \sum_{i = 1}^{k} | | {\hat{B}}_{i} - B_{i}^{*} {| |}_{1} \leq \frac{1}{2} | | \hat{B} - B^{*} {| |}_{2, 1} \leq \frac{24 λ k^{1 / 2}}{1 + ρ} {(\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}})}^{2} . \end{matrix}

(A19)

Finally, we prove the third term (27) in Theorem 1. By (A2) in Lemma A2, we can obtain the following:

\begin{matrix} M_{1} (\hat{B_{i}}) & = \frac{4}{λ^{2} n^{2}} | | X_{i}^{⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1} {| |}_{F}^{2} \end{matrix}

(A20)

\begin{matrix} = \frac{4}{λ^{2} n^{2}} tr (Σ_{i}^{- 1} {(B_{i}^{*} - {\hat{B}}_{i})}^{⊤} X_{i}^{⊤} X_{i} X_{i}^{⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1}) \end{matrix}

\begin{matrix} \leq \frac{4}{λ^{2} n} ψ_{max} tr (Σ_{i}^{- 1} {(B_{i}^{*} - {\hat{B}}_{i})}^{⊤} X_{i}^{⊤} X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1}) \end{matrix}

\begin{matrix} = \frac{4}{λ^{2} n} ψ_{max} tr (X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 2} {(B_{i}^{*} - {\hat{B}}_{i})}^{⊤} X_{i}^{⊤}) \end{matrix}

\begin{matrix} \leq \frac{4}{λ^{2} n} \frac{ψ_{max}}{C_{1} - λ^{*}} tr (X_{i} (B_{i}^{*} - {\hat{B}}_{i}) Σ_{i}^{- 1} {(B_{i}^{*} - {\hat{B}}_{i})}^{⊤} X_{i}^{⊤}) . \end{matrix}

Since a common sparsity structure is shared across all platforms, the following is assumed:

\begin{matrix} M_{1} (\hat{B_{i}}) & \leq \frac{4 ψ_{max}}{λ^{2} n} \frac{1}{k (C_{1} - λ^{*})} \sum_{i = 1}^{k} tr [(X_{i} (B_{i}^{*} - {\hat{B}}_{i})) Σ_{i}^{- 1} {(X_{i} (B_{i}^{*} - {\hat{B}}_{i}))}^{⊤}] \end{matrix}

\begin{matrix} \leq \frac{64 ψ_{max}}{(C_{1} - λ^{*})} {(\frac{u^{1 / 2}}{κ_{1}} + \frac{s_{{\tilde{ρ}}^{*}}^{1 / 2}}{κ_{2}})}^{2} . \end{matrix}

(A21)

This completes the proof of Theorem 1. □

References

Gao, X.; Carroll, R.J. Data integration with high dimensionality. Biometrika 2017, 104, 251–272. [Google Scholar] [CrossRef]
Liu, Q.; Xu, Q.; Zheng, V.W.; Xue, H.; Cao, Z.; Yang, Q. Multi-task learning for cross-platform siRNA efficacy prediction: An in-silico study. BMC Bioinform. 2010, 11, 1–16. [Google Scholar] [CrossRef]
Zhang, K.; Gray, J.W.; Parvin, B. Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics 2010, 26, i97–i105. [Google Scholar] [CrossRef]
Zhong, Y.; Xu, W.; Gao, X. Heterogeneous multi-task feature learning with mixed ℓ_2,1 regularization. Mach. Learn. 2024, 113, 891–932. [Google Scholar] [CrossRef]
Yang, S.; Kim, J.K. Statistical data integration in survey sampling: A review. Jpn. J. Stat. Data Sci. 2020, 3, 625–650. [Google Scholar] [CrossRef]
Gao, X.; Zhong, Y. FusionLearn: A biomarker selection algorithm on cross-platform data. Bioinformatics 2019, 35, 4465–4468. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.; Wu, Y.; Gao, X. Bayesian Model Selection via Composite Likelihood for High-dimensional Data Integration. Can. J. Stat. 2024, 52, 924–938. [Google Scholar] [CrossRef]
Dai, G.; Müller, U.U.; Carroll, R.J. Data integration in high dimension with multiple quantiles. Stat. Sin. 2023, 33, 169–191. [Google Scholar] [CrossRef]
Simon, N.; Friedman, J.; Hastie, T. A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. arXiv 2013, arXiv:1311.6529. [Google Scholar] [CrossRef]
Qian, J.; Tanigawa, Y.; Li, R.; Tibshirani, R.; Rivas, M.A.; Hastie, T. Large-scale multivariate sparse regression with applications to UK biobank. Ann. Appl. Stat. 2022, 16, 1891–1918. [Google Scholar] [CrossRef]
Peng, J.; Zhu, J.; Bergamaschi, A.; Han, W.; Noh, D.Y.; Pollack, J.R.; Wang, P. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 2010, 4, 53–77. [Google Scholar] [CrossRef]
Li, Y.; Nan, B.; Zhu, J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 2015, 71, 354–363. [Google Scholar] [CrossRef]
Molstad, A.J.; Zhang, X. Conditional probability tensor decompositions for multivariate categorical response regression. J. Am. Stat. Assoc. 2025, 1–25. [Google Scholar] [CrossRef]
Chen, B.; Chen, C. Fast optimization methods for high-dimensional row-sparse multivariate quantile linear regression. J. Stat. Comput. Simul. 2024, 94, 69–102. [Google Scholar] [CrossRef]
Petrella, L.; Raponi, V. Joint estimation of conditional quantiles in multivariate linear regression models with an application to financial distress. J. Multivar. Anal. 2019, 173, 70–84. [Google Scholar] [CrossRef]
Zhang, Q.; Huang, W.; Jin, C.; Zhao, P.; Shu, Y.; Shen, L.; Tao, D. Multinoulli Extension: A Lossless Yet Effective Probabilistic Framework for Subset Selection over Partition Constraints. In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Zhong, Y.; Gao, X.; Xu, W. Robust multitask feature learning with adaptive Huber regressions. Can. J. Stat. 2025, 53, e70022. [Google Scholar] [CrossRef]
Kawano, S.; Fukushima, T.; Nakagawa, J.; Oshiki, M. Multivariate regression modeling in integrative analysis via sparse regularization. Jpn. J. Stat. Data Sci. 2025, 1–28. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R.; Narasimhan, B.; Tay, J.K.; Simon, N.; Yang, J. glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R Package Version 4.1-7. 2023. Available online: https://CRAN.R-project.org/package=glmnet (accessed on 14 January 2026).
Cox, D.R.; Reid, N. A note on pseudolikelihood constructed from marginal densities. Biometrika 2004, 91, 729–737. [Google Scholar] [CrossRef]
Gao, X.; Song, P.X.K. Composite likelihood EM algorithm with applications to multivariate hidden Markov model. Stat. Sin. 2011, 21, 165–185. [Google Scholar]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]

Table 1. Data structure of response matrix

Y_{i} \in R^{n \times m}

, design matrix

X_{i} \in R^{n \times p}

and parameter matrix

B_{i} \in R^{p \times m}

for platform i, where

y_{i j s}

and

X_{i j r}

denote the j-th observed value of response s and predictor r on platform i, respectively. The corresponding regression coefficient of predictor r on response s is

β_{i r s}

,

i = 1, \dots, k; j = 1, \dots, n; s = 1, \dots, m; r = 1, \dots, p

.

Table 1. Data structure of response matrix

Y_{i} \in R^{n \times m}

, design matrix

X_{i} \in R^{n \times p}

and parameter matrix

B_{i} \in R^{p \times m}

for platform i, where

y_{i j s}

and

X_{i j r}

denote the j-th observed value of response s and predictor r on platform i, respectively. The corresponding regression coefficient of predictor r on response s is

β_{i r s}

,

i = 1, \dots, k; j = 1, \dots, n; s = 1, \dots, m; r = 1, \dots, p

.

	Platform 1	…	Platform k
Parameter matrix	$B_{1} = (\begin{matrix} θ_{111} & \dots & θ_{11 m} \\ ⋮ & ⋱ & ⋮ \\ θ_{1 p 1} & \dots & θ_{1 p m} \end{matrix})$	…	$B_{k} = (\begin{matrix} θ_{k 11} & \dots & θ_{k 1 m} \\ ⋮ & ⋱ & ⋮ \\ θ_{k p 1} & \dots & θ_{k p m} \end{matrix})$
Response matrix	$Y_{1} = (\begin{matrix} Y_{111} & \dots & Y_{11 m} \\ ⋮ & ⋱ & ⋮ \\ Y_{1 n 1} & \dots & Y_{1 n m} \end{matrix})$	…	$Y_{k} = (\begin{matrix} Y_{k 11} & \dots & Y_{k 1 m} \\ ⋮ & ⋱ & ⋮ \\ Y_{k n 1} & \dots & Y_{k n m} \end{matrix})$
Design matrix	$X_{1} = (\begin{matrix} X_{111} & \dots & X_{11 p} \\ ⋮ & ⋱ & ⋮ \\ X_{1 n 1} & \dots & X_{1 n p} \end{matrix})$	…	$X_{k} = (\begin{matrix} X_{k 11} & \dots & X_{k 1 p} \\ ⋮ & ⋱ & ⋮ \\ X_{k n 1} & \dots & X_{k n p} \end{matrix})$

Table 2. Comparison of averaged parameter estimation accuracy (

B_{F}, Σ_{F}

) and prediction accuracy (RMSE) across different methods under various

(n, p)

configurations (standard errors in parentheses).

Table 2. Comparison of averaged parameter estimation accuracy (

B_{F}, Σ_{F}

) and prediction accuracy (RMSE) across different methods under various

(n, p)

configurations (standard errors in parentheses).

(n, p)	Methods	$B_{F, 1}$	$B_{F, 2}$	$\bar{B_{F}}$	RMSE₁	RMSE₂	$\bar{RMSE}$	$Σ_{F, 1}$	$Σ_{F, 2}$	$\bar{Σ_{F}}$
(100, 100)	MM-HLR	0.5653	0.5830	0.5742	0.5598	0.5719	0.5659	0.4486	0.4328	0.4407
	MM-HLR	(0.13)	(0.13)	(0.09)	(0.13)	(0.11)	(0.09)	(0.24)	(0.18)	(0.13)
	FusionLearn	0.7115	0.7485	0.7300	0.7065	0.7444	0.7254	0.5298	0.5157	0.5228
	FusionLearn	(0.11)	(0.13)	(0.08)	(0.11)	(0.12)	(0.08)	(0.24)	(0.19)	(0.15)
	GroupLasso	1.0402	1.0629	1.0516	1.0522	1.0782	1.0652	0.9649	1.0862	1.0256
	GroupLasso	(0.13)	(0.14)	(0.08)	(0.14)	(0.14)	(0.09)	(0.41)	(0.36)	(0.30)
(100, 200)	MM-HLR	0.5821	0.5808	0.5814	0.5770	0.5764	0.5767	0.4282	0.4152	0.4217
	MM-HLR	(0.13)	(0.13)	(0.09)	(0.13)	(0.13)	(0.10)	(0.24)	(0.24)	(0.17)
	FusionLearn	0.7494	0.7583	0.7539	0.7437	0.7463	0.7450	0.6083	0.5350	0.5717
	FusionLearn	(0.11)	(0.15)	(0.08)	(0.11)	(0.14)	(0.09)	(0.23)	(0.21)	(0.15)
	GroupLasso	1.1937	1.1821	1.1879	1.1925	1.1590	1.1757	1.2531	1.3192	1.2862
	GroupLasso	(0.16)	(0.15)	(0.11)	(0.17)	(0.16)	(0.12)	(0.48)	(0.46)	(0.33)
(100, 500)	MM-HLR	0.5594	0.5944	0.5769	0.5566	0.5864	0.5715	0.3718	0.4083	0.3901
	MM-HLR	(0.11)	(0.12)	(0.09)	(0.12)	(0.12)	(0.09)	(0.20)	(0.23)	(0.15)
	FusionLearn	0.7646	0.7613	0.7629	0.7643	0.7593	0.7618	0.8426	0.7951	0.8189
	FusionLearn	(0.11)	(0.09)	(0.08)	(0.12)	(0.10)	(0.08)	(0.20)	(0.19)	(0.12)
	GroupLasso	1.2484	1.2021	1.2252	1.2593	1.1910	1.2252	1.6113	1.6055	1.6084
	GroupLasso	(0.15)	(0.14)	(0.11)	(0.17)	(0.15)	(0.11)	(0.49)	(0.47)	(0.38)
(200, 200)	MM-HLR	0.3998	0.3861	0.3930	0.3982	0.3864	0.3923	0.2941	0.2429	0.2685
	MM-HLR	(0.08)	(0.08)	(0.06)	(0.08)	(0.08)	(0.06)	(0.15)	(0.12)	(0.10)
	FusionLearn	0.5087	0.5190	0.5138	0.5062	0.5181	0.5122	0.3608	0.3300	0.3454
	FusionLearn	(0.09)	(0.07)	(0.06)	(0.09)	(0.07)	(0.06)	(0.17)	(0.14)	(0.11)
	GroupLasso	0.7809	0.7788	0.7798	0.7768	0.7678	0.7723	0.6529	0.6957	0.6743
	GroupLasso	(0.08)	(0.08)	(0.05)	(0.09)	(0.07)	(0.05)	(0.31)	(0.28)	(0.20)
(200, 500)	MM-HLR	0.3845	0.3871	0.3858	0.3828	0.3857	0.3843	0.2695	0.2657	0.2676
	MM-HLR	(0.08)	(0.07)	(0.05)	(0.08)	(0.08)	(0.05)	(0.13)	(0.17)	(0.10)
	FusionLearn	0.5516	0.5473	0.5495	0.5511	0.5428	0.5469	0.5278	0.5431	0.5355
	FusionLearn	(0.07)	(0.09)	(0.05)	(0.07)	(0.09)	(0.05)	(0.17)	(0.17)	(0.11)
	GroupLasso	0.8847	0.8629	0.8738	0.8806	0.8539	0.8673	1.0466	1.0352	1.0409
	GroupLasso	(0.10)	(0.10)	(0.06)	(0.10)	(0.10)	(0.07)	(0.41)	(0.42)	(0.32)
(200, 1000)	MM-HLR	0.4068	0.4044	0.4056	0.4084	0.3985	0.4035	0.2506	0.2579	0.2542
	MM-HLR	(0.09)	(0.09)	(0.06)	(0.09)	(0.08)	(0.06)	(0.14)	(0.15)	(0.10)
	FusionLearn	0.5893	0.5762	0.5827	0.5893	0.5707	0.5800	0.6546	0.6876	0.6711
	FusionLearn	(0.09)	(0.10)	(0.07)	(0.09)	(0.09)	(0.07)	(0.16)	(0.15)	(0.10)
	GroupLasso	0.9324	0.9642	0.9474	0.9336	0.9592	0.9464	1.1231	1.0740	1.0985
	GroupLasso	(0.10)	(0.11)	(0.08)	(0.11)	(0.11)	(0.09)	(0.44)	(0.48)	(0.35)

Table 3. Comparison of averaged feature selection accuracy (sensitivity and specificity) across different methods under various

(n, p)

configurations (standard errors in parentheses).

Table 3. Comparison of averaged feature selection accuracy (sensitivity and specificity) across different methods under various

(n, p)

configurations (standard errors in parentheses).

(n, p)	Sensitivity			Specifitity
(n, p)	MM-HLR	FusionLearn	GroupLasso	MM-HLR	FusionLearn	GroupLasso
(100, 100)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	0.9996(0.00)	0.9166(0.02)	0.7362(0.08)
(100, 200)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	0.9996(0.00)	0.9257(0.02)	0.8152(0.06)
(100, 500)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	0.9496(0.01)	0.8944(0.04)
(200, 200)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	0.9326(0.02)	0.8392(0.05)
(200, 500)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	0.99996(0.00)	0.9394(0.01)	0.8962(0.04)
(200, 1000)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	1.0000(0.00)	0.9524(0.01)	0.9396(0.02)

Table 4. The platform-specific prediction errors

(P E_{i}, i = 1, 2)

, the average prediction error (PE), and the number of selected predictors (support).

Table 4. The platform-specific prediction errors

(P E_{i}, i = 1, 2)

, the average prediction error (PE), and the number of selected predictors (support).

Method	PE₁	PE₂	PE	Support
MM-HLR	0.0993	0.0663	0.0828	6
FusionLearn	0.8895	0.8912	0.8903	91
GroupLasso	0.1257	0.0877	0.1067	75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.