Multi-Task Learning for Compositional Data via Sparse Network Lasso

Multi-task learning is a statistical methodology that aims to improve the generalization performances of estimation and prediction tasks by sharing common information among multiple tasks. On the other hand, compositional data consist of proportions as components summing to one. Because components of compositional data depend on each other, existing methods for multi-task learning cannot be directly applied to them. In the framework of multi-task learning, a network lasso regularization enables us to consider each sample as a single task and construct different models for each one. In this paper, we propose a multi-task learning method for compositional data using a sparse network lasso. We focus on a symmetric form of the log-contrast model, which is a regression model with compositional covariates. Our proposed method enables us to extract latent clusters and relevant variables for compositional data by considering relationships among samples. The effectiveness of the proposed method is evaluated through simulation studies and application to gut microbiome data. Both results show that the prediction accuracy of our proposed method is better than existing methods when information about relationships among samples is appropriately obtained.


Introduction
Multi-task learning is a methodology that assumes a different model for each task and estimates these models.It is used in various fields of research, in particular life science (Argyriou et al (2008); Lengerich et al (2018)).For example, the factors of a disease may vary from patient to patient (Cowie et al, 1997).In such a case, a model that is common to all patients cannot sufficiently extract the factors of the disease.In multi-task learning, each patient is considered as a single task, and different models can be built for each patient to extract both patient-specific and common factors for the disease (Xu et al, 2015).Yamada et al (2017) proposed the localized lasso, which performs multi-task learning using network lasso regularization (Hallac et al, 2015).By treating each sample as a single task, localized lasso simultaneously performs multi-task learning and clustering in the framework of a regression model.
On the other hand, compositional data, which consist of the proportions of a composition, are used in the fields of geology and life science for microbiome analysis.Compositional data are constrained to always take positive values summing to one.Due to these constraints, it is difficult to apply existing multi-task learning methods to compositional data.In the field of microbiome analysis, studies on gut microbiomes (Wu et al (2011); Dillon et al (2016)) have suggested that there are multiple types of gut microbiome clusters that vary from individual to individual (Arumugam et al, 2011).In the case of such data where multiple clusters may exist, it is difficult to extract sufficient information using existing regression models for compositional data.
In this paper, we propose a multi-task learning method for compositional data, focusing on network lasso regularization and the symmetric form of the log-contrast model (Aitchison and Bacon-Shone, 1984), which is a linear regression model with compositional covariates.The symmetric form is extended to the locally symmetric form in which each sample has a different regression coefficient vector.These regression coefficient vectors are clustered by network lasso regularization.In addition, because dimensionality of features of compositional data has been increasing (Lin et al, 2014), we use ℓ 1 regularization (Tibshirani, 1996) to perform variable selection.Estimation of the parameters included in the model is performed using an estimation algorithm based on the alternating direction method of multipliers (Boyd et al, 2011), because the model includes non-differentiable points in the ℓ 1 regularization term and zero-sum constraints on the parameters.The constructed model includes regularization parameters, which are determined by cross-validation (CV).
The remainder of this paper is organized as follows.Section 2 introduces multi-task learning based on a network lasso.In Section 3, we describe the regression models for compositional data.We propose a multi-task learning method for compositional data and its estimation algorithm in Section 4. In Section 5, we discuss the effectiveness of the proposed method through Monte Carlo simulations.An application to gut microbiome data is presented in Section 6.Finally, Section 7 summarizes this paper and discusses future work.
2 Multi-task learning based on a network lasso Suppose that we have n observed p-dimensional data {x i ; i = 1, . . ., n} and n observed data for the response variable {y i ; i = 1, . . ., n}, and these pairs {(y i , x i ), i = 1, . . . ,n} are given independently.The graph R = R T ∈ R n×n is also given, where (R) ij = r i,j ≥ 0 represents the relationship between the sample pair (y i , x i ) and (y j , x j ), and thus the diagonal components are zero.
We consider the linear regression model where w i = (w i1 , . . ., w ip ) ∈ R p is the p-dimensional regression coefficient vector for sample x i , and ǫ i is an error term distributed as N (0, σ 2 ) independently.Model (1) comprises a different model for each sample.In general, the regression coefficient vectors are assumed to be the identical (i.e., For Model (1), we consider the minimization problem where λ (> 0) is a regularization parameter.The second term in ( 2) is the network lasso regularization term (Hallac et al, 2015).For coefficient vectors w i and w j , the network lasso regularization term induces w i = w j .If these vectors are estimated to be the same, then the i-th and j-th samples are interpreted as belonging to the same cluster.In the framework of multi-task learning, the minimization problem (2) considers one sample as one task by setting a coefficient vector for each sample.This allows us to extract the information of the regression coefficient vectors separately for each task.In addition, by clustering the regression coefficient vectors using the network lasso regularization term, we can extract the common information among tasks.Yamada et al (2017) proposed the localized lasso for minimization problem (2) by adding an ℓ 1,2 -norm regularization term (Kong et al, 2014), which induces sparsity and group structure.The localized lasso is used for multi-task learning and variable selection.

Regression modeling for compositional data
The p-dimensional compositional data x = (x 1 , . . ., x p ) T are defined as proportional data in the simplex space S p−1 = (x 1 , . . ., x p ) : x j > 0 (j = 1, . . ., p), p j=1 x j = 1 . (3) This structure imposes dependence between the features of the compositional data.Thus, statistical methods defined for spaces of real numbers cannot be applied (Aitchison, 1982).To overcome this problem, Aitchison and Bacon-Shone (1984) proposed the log-contrast model, which is a linear regression model with compositional covariates.Suppose that we have n observed p-dimensional compositional data {x i ; i = 1, . . ., n} and n objective variable data {y i ; i = 1, . . ., n}, and these pairs {(y i , x i ), i = 1 . . ., n} are given independently.The log-contrast model is represented as follows: where β = (β 1 , . . ., β p−1 ) T ∈ R p−1 is a regression coefficient vector.Because the model uses an arbitrary variable as a reference for all other variables, the solution changes depending on the selection of the reference.By introducing β p = − p j=1 β j , the log-contrast model is equivalently expressed in symmetric form as where z i = (log x i1 , . . ., log x ip ) T , and β = (β 1 , . . ., β p ) T ∈ R p is a regression coefficient vector.Lin et al (2014) proposed the minimization problem to select relevant variables in symmetric form by adding an ℓ 1 regularization term (Tibshirani, 1996): Other models that extend this symmetric form of the problem have also been proposed (Shi et al (2016); Wang and Zhao (2017); Bien et al (2021); Combettes and Müller (2021)).

Proposed method
In this section, we propose a multi-task learning method for compositional data based on the network lasso and the symmetric form of the log-contrast model.

Model
We consider the locally symmetric form of the log-contrast model where w i = (w i1 , . . ., w ip ) T is the regression coefficient vector for i-th sample x i .For Model (7), we consider the minimization problem min where λ 1 , λ 2 (> 0) are regularization parameters.The second term is the network lasso regularization term, which performs the clustering of the regression coefficient vectors.The third term is the ℓ 1 regularization term, which performs variable selection.A regression coefficient vector w i * = (w i * 1 , . . ., w i * p ) T for future data z i * is obtained from the solution to the constrained Weber problem min where w i is the estimated regression coefficient vector for the i-th sample.

Estimation algorithm
To construct the estimation algorithm for the proposed method, we rewrite minimization problem (8) as follows: where 1 p is the p-dimensional vector of ones.The augmented Lagrangian for (10) is formulated as where s i,j , t i , u i are the Lagrange multipliers and ρ, φ, ψ (> 0) are the tuning parameters.For simplicity of notation, the parameters in the model w i , a i,j , b i are collectively denoted as Θ, the Lagrange multipliers as Q, and the tuning parameters as Ω.
The update algorithm for Θ with the alternating direction method of multipliers (ADMM) is obtained from the minimization problem where k denotes the repetition number.By repeating the updates (12) and the update for Q, the estimation algorithm for (10) is represented by Algorithm 1.The estimation algorithm for ( 9) is represented by Algorithm 2 with the update from ADMM.The details of the derivations of the estimation algorithms are presented in Appendices A and B.
Algorithm 2 Estimation algorithm for constrained Weber problem (9) via ADMM end while Ensure: w i *

Simulation studies
In this section, we report simulation studies conducted with our proposed method using artificial data.
Table 1 shows the results for the mean and standard deviation of MSE.The proposed method and SNL show better prediction accuracy than CL in all settings.The reason for this may be that CL assumes only a single regression coefficient vector and thus fails to capture the true structure, which consists of three clusters.A comparison between the proposed method and SNL shows that the proposed method has higher prediction accuracy than SNL when P R = 0.99, 0.95, and 0.90, whereas SNL shows better results in most cases when P R = 0.80, 0.70.This means that the proposed method is superior to SNL when the structure of the graph R is similar to the true structure.On the whole, prediction accuracy deteriorates as P R decreases for both the proposed method and SNL, but this deterioration is more extreme for the proposed method.For both the proposed method and SNL, which assume multiple regression coefficient vectors, standard deviation is somewhat large.

Application to gut microbiome data
The gut microbiome affects human health/disease, for example, in terms of obesity.Gut microbiomes may be exposed from inter-individual heterogeneity from environmental factors such as diet, as well as from hormonal factors and age (Haro et al (2016); Saraswati and Sitaraman (2015)).In this section, we applied our proposed method to the real dataset reported by Wu et al (2011).This dataset is from a cross-sectional study of 98 healthy volunteers conducted at the University of Pennsylvania to investigate the connections between long-term dietary patterns and gut microbiome composition.Stool samples were collected from the subjects, and DNA samples were analyzed by 454/Roche pyrosequencing of 16S rRNA gene segments of the V1-V2 region.In the results, the counts for more than 17,000 species-level OTUs were obtained.Demographic data such as body mass index (BMI), sex, and age were also obtained.
We used centered BMI as the response and the compositional data of gut microbiome as the explanatory variable.In order to reduce the number of the OTUs, we used the single-linkage clustering based on an available phylogenetic tree to aggregate the OTUs, which is provided as the function tip glom in the R package "phyloseq" (see McMurdie and Holmes (2013)).In this process, some closely related OTUs defined on the leaf nodes of the phylogenetic tree are aggregated into one OTU when the cophenetic distances between the OTUs are smaller than a threshold.We set the threshold at 0.5.As a result, 166 OTUs were obtained by the aggregation.Since some OTUs had zero counts making it impossible to take the logarithm, these counts were replaced by the value one before converting them to compositional data.
We computed the graph R ∈ R 98×98 by where D ij is the distance between the i-th and j-th samples.Distance D ij was calculated in the following two ways.(i) Gower distance (Gower, 1971) was calculated using the sex and age data of the subjects.(ii) Log-ratio distance (e.g., see Greenacre ( 2018)) was calculated using the explanatory variable, as follows: where Using these two ways of calculating distance, we obtained two different R and estimation results.We refer to these two ways as Proposed (i) and Proposed (ii), respectively.
To evaluate the prediction accuracy of our proposed method, we calculated MSE by randomly splitting the dataset into 90 samples as the training data and 8 samples as the validation data.We again used the method of Lin et al. (2014) referred to as CL as a comparison method.The regularization parameters were determined by five-fold CV for both our proposed method and CL.The mean and standard deviation of MSE were calculated from 100 repetitions.
Table 2 shows the mean and standard deviation of MSE in the real data analysis.We observe that Proposed (i) has the smallest mean and standard deviation of MSE.This indicates that our proposed method captures the inherent structure of the compositional data by providing an appropriate graph R.However, standard deviation is large even for Proposed (i), which indicates that prediction accuracy may strongly depend on what method is used to split the dataset.Table 3 shows the coefficient of determination R 2 using leave-one-out crossvalidation (LOOCV) for Proposed (i) and CL.The fittings of the observed and predicted BMI are plotted in Figures 1 (a) and (b) for Proposed (i) and CL, respectively.The horizontal axis represents the centered observed BMI values and the vertical axis represents the corresponding predicted BMI.As shown, CL does not predict data with centered observed values between −10 and −5 as being in that interval, whereas Proposed (i) predicts these data correctly to some extent.Figure 2 shows the regression coefficients estimated by Proposed (i) for all samples, where the regularization parameters are determined by LOOCV.For the results in Figure 2, we used hierarchical clustering to group together similar regression coefficient vectors, setting the number of clusters as five.With our proposed method, many of estimated regression coefficients were not exactly zero but close to zero.Thus, we will treat estimated regression coefficients | w ij | < 0.05 as being exactly zero to simplify the interpretation.Figure 3 shows only those coefficients that satisfy | w ij | ≥ 0.05 in at least one sample, where the corresponding variables are listed in Table 4.
It is reported that the human gut microbiome can be classified into three clusters called enterotypes, which are characterized by three predominant genera, Bacteroides, Prevotella, and Ruminococcus (Arumugam et al, 2011).In the dataset, OTUs of genus-levels Prevotella and Ruminococcus were aggregated into the OTUs of family-levels Prevotellaceae and Ruminococcaceae by the single-linkage clustering.In Figure 3, Bacteroides correspond to OTU5, 6, 7, 8, 9, and 10, Prevotellaceae to OTU12, and Ruminococcaceae to OTU30 and 31.For these OTUs, the differences are clear between OTU6, 9, 10 and OTU30, 31 among samples 81-90, in which only Bacteroides are correlated to the response.On the other hand, the differences among samples 65-74 are also indicated, in which only Bacteroides do not affect BMI.These results suggest that Bacteroides, Prevotellaceae, and Ruminococcaceae may have different effects on BMI that are associated with enterotypes.In addition, it is reported that women with a higher abundance of Prevotellaceae are more obese (Cuevas-Sierra et al, 2020).The regression coefficients of non-zero Prevotellaceae are all positive, and the corresponding 8 samples are all females.On the other hand, in OTU29 indicating Roseburia, 9 samples out of 10 are negatively associated with BMI.Roseburia is also reported to be negatively correlated with indicators of bodyweight (Zeng et al, 2019).

Conclusion
We proposed a multi-task learning method for compositional data based on a network lasso and log-contrast model.By imposing a zero-sum constraint on the model corresponding to each sample, we could extract the information of latent clusters in the regression coefficient vectors for compositional data.
In the results of simulations, the proposed method worked well when clusters existed for the compositional data and an appropriate graph R was obtained.
In a human gut microbiome analysis, our proposed method provided high prediction accuracy compared with the existing method by considering the heterogeneity from age and sex.In addition, cluster-specific OTUs such as ones related to enterotypes were detected in terms of effects on BMI.
Although our proposed method shrinks some regression coefficients that do not affect response to zero, many coefficients close to zero remain.Furthermore, in both the simulations and human gut microbiome analysis, the prediction accuracy of the proposed method deteriorated significantly when the obtained R did not capture the true structure.Moreover, the standard deviations of MSE were high in almost all cases.We leave these as topics of future research.
Acknowledgments. S. K. was supported by JSPS KAKENHI Grant Numbers JP19K11854 and JP20H02227.Supercomputing resources were provided by the Human Genome Center (the Univ. of Tokyo).

Appendix A Derivations of update formulas in ADMM
A.1 Update of w In the update of w, we minimize the terms of the augmented Lagrangian (11) depending on w i as follows: (A1) From ∂L ∂wi = 0, we obtain the update

B.3 Update of u and v
The updates for Lagrange multipliers u i and v are obtained by gradient descent as follows: ), i = 1, . . ., n, )
u i , v are Lagrange multipliers and µ, η (> 0) are tuning parameters.B.1 Update of w i *In the update of w i * , we minimize the terms of augmented Lagrangian (B11) depending on w i * as follows:In the update of m i , we minimize the terms of augmented Lagrangian (B11) depending on m i as follows: Ramakrishnan (2015), because the right-hand side of (B15) is the proximal map of the function f (m i ) = r i * ,i ρ m i − w i * 2 by using Moreau's decomposition (e.g.,Parikh and Boyd (2014)), the updates of m are obtained

Table 1 :
Mean and deviation of MSE for simulations.

Table 2 :
Mean and standard deviation of MSE for real data analysis (100 repetitions)

Table 3 :
Coefficients of determination using LOOCV