Abstract
The Least Absolute Shrinkage and Selection Operator (LASSO) regression technique has proven to be a valuable tool for fitting and reducing linear models. The trend of applying LASSO to compositional data is growing, thereby expanding its applicability to diverse scientific domains. This paper aims to contribute to this evolving landscape by undertaking a comprehensive exploration of the -norm for the penalty term of a LASSO regression in a compositional context. This implies first introducing a rigorous definition of the compositional -norm, as the particular geometric structure of the compositional sample space needs to be taken into account. The focus is subsequently extended to a meticulous data-driven analysis of the dimension reduction effects on linear models, providing valuable insights into the interplay between penalty term norms and model performance. An analysis of a microbial dataset illustrates the proposed approach.
MSC:
62J07; 62P10; 62H99
1. Introduction
Linear regression serves as a powerful framework for modelling relationships between variables, as it aims to capture the underlying patterns that govern the variability in the response variable. For instance, in the microbiome domain, there is a particular interest in identifying which taxa are associated with a variable of interest, for example, the inflammatory parameter sCD14. To address such complex problems, adopting LASSO regression methods [1] has emerged as a popular choice for variable selection. LASSO regression strategically applies the Euclidean -norm penalisation to the model coefficients, wherein the -norm represents the sum of the absolute values of these coefficients. The penalised term shrinks some regression parameters towards zero, facilitating variable selection.
While conventional regression models assume independence among covariates, this assumption fails when dealing with compositional explanatory variables. These variables are called parts of a whole, and are usually expressed in proportions, percentages, or ppm. Historically [2], the sample space of the compositional data (CoDa) is designed as the D-part unit simplex . The fundamental idea in the analysis of CoDa is that the information is relative, and is primarily contained in the ratios between parts, not the absolute amounts of the parts. Therefore, the use of log-ratios is advocated. The analysis of CoDa, pioneered by [2], has witnessed increasing significance across such diverse fields as environmental science, geochemistry, microbiology, and economics. However, the integration of CoDa as covariates in regression models introduces particular challenges. The existing literature addresses these challenges, providing methodologies for regression model simplification with CoDa. The first works on penalised regression with compositional covariates [3,4,5,6] restricted the Euclidean -norm on the centered log-ratio () subspace when defining the penalty term. Saperas et al. [7] introduced a new norm, called the pairwise log-ratio (-plr), as a part of a methodology on penalised regression to simplify the log-ratios on the explanatory side of the model. These log-ratios are also known as balances [8].
The primary objective of this article lies in comprehensive comparison of the effects of different norms on the penalty term within LASSO regression with different compositional explanatory variables. The choice of a norm in the penalty term is a pivotal aspect that significantly influences the regularization mechanism, and consequently the characteristics of the resulting models. To accomplish this, a precise and rigorous definition of the induced -norms for CoDa (CoDa -norms) within the compositional space is necessary. A comparison between the CoDa -norm and other norms for compositions established in the literature is provided. Through this analysis, we seek to contribute valuable insights into the characteristics and implications of these norms in penalised regression.
The rest of this article is organised as follows. In Section 2, fundamental concepts related to the geometric structure of CoDa are outlined. In addition, some popular measures of central tendency are written as the solution of a variational problem using -norms in real space. Section 3 is devoted to defining the CoDa -norms on the compositional space. In Section 4, after describing the basic concepts of standard penalty regression, we analyse LASSO regression with compositional covariates using three different -norms in the penalty term, with the CoDa -norm among them. A comparison of the different norms is illustrated in Section 5 using a microbiome dataset. Finally, Section 6 concludes with some closing remarks.
2. Some Basic Concepts
2.1. Elements of the Aitchison Geometry
CoDa conveys relative information because the variables describe relative contributions to a given total [2]. The formal geometrical framework for the analysis of CoDa, coined the Aitchison geometry, first appeared in [9,10]. The Aitchison geometry is based on two specific operations on , called perturbation and powering, respectively defined as and for , . In order to interpret the results of these operations, one can perform closure on the result, that is, normalise the resulting vector to a unit sum by dividing each component by its total sum. Note that the closure operation provides a compositionally equivalent vector. With a vector space structure, a metric structure can be easily defined using the -scores of a D-part composition [2]:
where is the geometric mean of the composition. Note that -scores are collinear, because it holds that .
The basic metric elements of the Aitchison geometry are the inner product (), -norm (), and distance (), defined as follows:
where “” means the Aitchison geometry, “E” the typical Euclidean geometry, and “⊖” the perturbation difference . Log-ratios, like -scores, have become a cornerstone of CoDa analysis; nevertheless, in the literature the concept of balance between two non-overlapping groups of parts is frequently used. A balance is defined as the log-ratio between the geometric means of the parts within each group multiplied by a constant that depends on the number of parts in each group [8].
An important scale-invariant function is the log-contrast, which plays the typical role of the linear combination of variables. Given a D-part composition , a log-contrast is defined as any linear combination of the logarithms of the compositional parts
Given a dependent variable y and an explanatory D-part composition , the definition of a linear regression model in terms of a log-contrast [11] is
whereas in terms of metric elements the model formulation [12] is
where is the compositional gradient vector. The expressions in both Equations (2) and (3) are equivalent when one considers and . Because the sum of the -scores is zero (), the inner product of transformed vectors is equal to
For simplicity and to avoid overloading the notation, we denote , and write the linear regression model in terms of the Euclidean inner product as
2.2. Norms and Measures of Central Tendency
The most popular measures of the central tendency of a real variable are the median and the arithmetic mean. Both can be defined as solving a variational problem [13]; indeed, the median of a dataset is the value that minimises the average absolute deviation . The arithmetic mean of a dataset is the value that minimises the mean squared deviation . In addition, the mid-range of a dataset is the value that minimises the maximum absolute deviation . These definitions can be generalised to any -norm [13] (Chapter 3).
Definition 1.
Let be a dataset and let ; furthermore, let be the p-measure of central tendency that minimises the total p-deviation function :
where , .
With this definition, the median (), arithmetic mean (), and mid-range () follow as special cases for the norms , , and , respectively.
Remark 1.
Convexity of the total p-deviation function :
- For , the average absolute deviation is a convex function of λ; however, it is not strictly convex. Thus, the median may be a non-unique value.
- For , the total p-deviation function is strictly convex; thus, if exists, this is unique.
3. -Norms on the Compositional Space
To define induced -norms on the compositional space (CoDa -norms) in a compatible way with the Aitchison geometry, one must capture the geometric structure of the [14]. To achieve this objective, we initially define the induced -norm within the quotient space . Following Brezis [15] (Chapter 11.2), an induced -norm on the quotient space can be defined by inducing the Euclidean -norm in on . The underlying idea is to assign to an equivalence class the minimum value of the -norm among the elements belonging to the same equivalence class.
Definition 2.
Let be a log-composition. The induced -norm, denoted by , is
where and denotes the typical -norm in the real space.
Using the logarithmic isomorphism [14], the -norm can be extended to the compositional space.
Definition 3.
Let be a composition. The CoDa -norm, denoted by , is
Proposition 1.
The CoDa -norm on is , and verifies the properties of the Aitchison geometry [16]:
- Scale invariance: , .
- Permutation invariance: .
- Subcompositional dominance: , where denotes any subset formed by d parts of .
Proof of Proposition 1.
The proof directly follows from the Definition 3. □
Following Definition 3 and the measures of central tendency described in Section 2.2, the CoDa -norms , , and can be developed:
- The CoDa -norm on iswhere and are the median of the sets and , respectively. As the logarithm function is strictly increasing, as per Definition 1, the set of points that serve as solutions to the variational problem when applied to log-transformed values precisely corresponds to the log-transformed set of points that are solutions to the variational problem when applied to the raw data, that is, .Wu et al. [17] proposed the median of a D-part composition as an alternative denominator to the geometric mean in an attempt to extend the definition of -scores. In general, the performance of the median as a robust estimator of the midpoint of a dataset is better when the data have high asymmetry. The CoDa -norm captures the distance between two points when movement is restricted to paths that run parallel to the -axes (), as is the case in a grid or city street network (Manhattan distance, Figure 1). The CoDa -norm has an equivalent expression that captures the information about the ratio between the components of a composition; indeed, the median is the central point that divides a set into two equal parts, with half of the values falling below this central position and half above it. Therefore, half of the log-ratios are positive and the other half are negative. If we rearrange the parts of a composition in increasing order (small to large), i.e., , then the CoDa -norm can be written in the following manner:
Figure 1. The Manhattan distance based on the CoDa -norm in the simplex : the distance between two points and in a grid-based system, where is the closure operation, is represented by three paths (red, orange, and blue) of the same length (five units).- ∗
- if ;
- ∗
- if .
Thus, the CoDa -norm is a balance between the large parts and small parts. - The CoDa -norm on iswhere is the geometric mean of the set . Because -subspace, the CoDa -norm is the restricted Euclidean -norm on the -subspace. This norm is commonly referred to as Aitchison’s norm [18].
- The CoDa -norm on iswhere and are respectively the mid-range and geometric mid-range of the sets and . Note that ; thus, . The CoDa -norm can be interpreted as a form of log-pairwise, as the CoDa -norm represents half of the log-pairwise between the largest part against the smallest part. This log-pairwise is the greatest among all log-pairwise in the composition:
4. Penalised Regression with a Compositional Covariate
The LASSO regression model is formulated as the combination of the -norm cost function and the -norm regularisation term [1]. For a real dataset with n observations and D predictors along with a real response vector of length n, the LASSO regression model can be formulated as follows:
where is the intercept, the vector is the gradient, and is the penalty parameter that controls the amount of regularisation. Note that and refer to the Euclidean and norms in real space, respectively. For , the LASSO regression model (Equation (5)) provides the classical least squares regression model. The larger the value of , the greater the number of coefficients in that is forced to be zero. The optimal value of can be chosen based on cross-validation techniques and related methods [19].
In the case of CoDa, additional considerations must be taken into account in order to respect the compositional nature of both the covariate and the intercept . In variable selection, [3,20] wrote the LASSO model in terms of and the log-transformed data instead of the -scores; consequently, a linear constraint on the compositional gradient coefficient is necessary:
Most of the literature addressing the topic of penalised regression with a compositional covariate has predominantly employed the Euclidean -norm in the penalty term, leading to a -variable selection ([3,4,5,6,20,21,22]).
In Equation (6), the constraint can be incorporated in the minimising function. The constraint forces the parameter to be an element in the -subspace. Therefore, per Equation (4), the inner product is equivalent to . Thus, the constrained LASSO (Equation (6)) is equivalent to the following definition.
Definition 4.
Given , , the sample of the response variable , and the matrix whose rows for contain the compositional sample, the - LASSO estimator is defined as
where
The key innovation here is that the linear constraint becomes embedded in the penalty term through the - norm. This change in approach is not merely an algebraic or formal change; rather it implies a deeper understanding of the variable selection process in CoDa. The penalty term imposes a constraint on the sum of the absolute values of -scores within the gradient vector . This constraint compels the model to shrink or eliminate certain -scores, effectively driving them to zero. Consequently, this results in a balance selection. Without loss of generality, let us assume that the balance is zero. This implies that the corresponding balance has no influence on the response variable y. Therefore, the maximum variation in y is concentrated in the subspace orthogonal to the balance , i.e., the subspace of balances among the subcomposition . This selective regularization process facilitates variable selection, as does not influence the response variable y.
In order to establish a unified framework, the generalised LASSO problem [23] can be adapted to penalised linear models with a compositional covariate:
where and respectively refer to the Euclidean and norms in real space. The generalised LASSO problem allows for a broader range of applications by considering a matrix associated with the penalty term. The matrix is related to the -norm considered in the penalty term. The choice of one norm over another determines the type of regularization. Different models can be formulated within the framework of the generalised LASSO problem and addressed through convex optimization algorithms. Solving each of these different penalised regression models yields distinct coefficients, each characterised by unique properties. Indeed, Definition 4 can be expressed as a generalised LASSO problem in the following manner:
where and is the centering matrix on the -subspace, with .
On the other hand, following [7], it is possible to consider the matrix equal to , that is, the matrix associated with the linear transformation . Note that , which is a log-pairwise. In this case, the penalty term in a generalised LASSO problem can be written as , meaning that the generalised LASSO problem results in the following:
The model can be defined as follows.
Definition 5.
Given , , the sample of the response variable , and the matrix whose rows, for contain the compositional sample, the -plr LASSO estimator is defined as
where
Importantly, because , the penalty term shrinks the absolute value of the differences of the -scores within the gradient vector, which forces some pairwise differences of -scores to be zero, i.e., it forces equality on some -scores. Therefore, each set of equal -scores defines a subcomposition with non-influential balances within its parts. This selective regularization process facilitates balanced selection [7].
Finally, using the CoDa -norm introduced in Section 3, it is possible to define another generalised LASSO problem.
Definition 6.
Given , , the sample of the response variable , and the matrix whose rows for contain the compositional sample, the CoDa -norm LASSO estimator is defined as
where
In this case, the penalty term compels certain parts to be equal to the median of the parts, ensuring equality among them in particular. Consequently, the effect produced is also a balance selection, as in the previous case; however, unlike the -plr LASSO estimator, with the CoDa -norm estimator there is only one set of equal -scores, and all non-influential balances belong to a single subcomposition.
As there is no algebraic formula to express the median, , it is necessary to include a new variable in the penalty term when formulating the minimization problem (Equation (12)) as a generalised LASSO problem:
where , , and the matrix associated with the transformation .
5. Study Case
We used the microbial dataset analyzed in [24,25] to compare the different -norms in a CoDa LASSO regression problem. The dataset, collected and explained in [24], comprises the compositions of taxa spanning various taxonomic levels (e.g., g for genus, f for family, o for order, and k for kingdom) within a set of 151 individuals. The dependent variable y is an inflammatory parameter, specifically, the levels of soluble CD14 (sCD14 variable) measured for each individual. These data are available in the R package [26]. An individual having a zero value recorded for some parts indicates that certain taxa were not detected. A zero value prevents the application of the log-ratio methodology. Following a more analogous procedure than in [26], the genus Thalassospira, unclassified genus of the class Alphaproteobacteria, and unclassified genus of the family Porphyromonadaceae, all with more than of zeros, were removed. The rest of the zeros recorded in the remaining 57 taxa were replaced by a small value using an imputation method [27,28]. Because the zeros are of count type, it is appropriate to apply methods based on Dirichlet-multinomial duality [29].
To solve the convex optimizations problems in Equations (9), (10), and (13), we first select the optimal parameter for the penalised model by performing a ten-fold cross-validation. Each iteration involves dividing the data into ten equal parts, training the model on nine of them, and then evaluating it on the remaining part to produce the lowest Mean Squared Error (MSE). With the parameter selected, we proceeded to solve the optimization problem in order to find the parameters and . The CVXR package in R version 4.3.2 [30] offers an interface for defining and solving convex optimization problems. CVXR utilises a domain-specific language, making it user-friendly and allowing users to express optimization problems. The package supports various solvers, enabling users to choose the one that best suits their needs. In our case, we opted for the Operator Splitting Quadratic Program (OSQP). The OSQP is a solver for quadratic programming problems and employs an operator-splitting method [31]. This solver is highly efficient even in cases where the matrices are not full-rank, such as our situation, because the -scores are used. Referring to the procedure detailed above, we have outlined an Algorithm 1 for a generalised LASSO method below.
The algorithm is applied in the three cases discussed in the previous section, namely, when considering the three different norms in the penalty term, i.e., the - (Definition 4), -plr (Definition 5), and CoDa -norms (Definition 6).
For the - estimator, the LASSO regression algorithm is applied iteratively within the cross-validation framework. Figure 2 illustrates the cross-validated MSE across different values. The optimal is determined by selecting the point on the curve where the mean squared error is minimised: = 35,769.42.
| Algorithm 1 Generalised LASSO for CoDa |
|
Figure 2.
-: cross-validation MSE curve for different log-transformed values of the penalty parameter (). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) indicate the mean ± stdev value, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 35,769.42.
For = 35,769.42, the generalised LASSO (Equation (9)) identifies which ones among all the are set to , particularly ensuring equality among them. Importantly, when computing the representative -subspace, we find that some -scores are equal to zero. Therefore, the regularization process effectively splits the composition into two subcompositions. The first subcomposition represents the 33 non-influential parts, where coefficients are driven to zero, contributing to model simplicity. The second subcomposition identifies the 24 parts that actively contribute to the influential balances on the response variable y (see Table A1). The intercept is equal to . Figure 3a shows the non-zero -scores for parameter . We highlight that the most influential pairwise is formed by the genus Subdoligranulum and the unclassified genus of the family Lachnospiraceae.
Figure 3.
Comparison of the parameter, with the taxon order maintained on the vertical axis to facilitate comparison: (a) -scores for the - LASSO estimator, (b) -scores for the -CoDa LASSO estimator, and (c) -scores for the -plr LASSO estimator.
For the CoDa -norm estimator (Equation (13)), the LASSO regression algorithm is applied with the same cross-validation partition used in the - estimator. Figure 4 illustrates the model’s performance across different regularization parameters . The optimal value is = 45,582.21
Figure 4.
CoDa -norm: cross-validation MSE curve for different log-transformed values of the penalty parameter (). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) represent the value mean ± stdev, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 45,582.21.
For = 45,582.21, the generalised LASSO (Equation (13)) identifies which ones among all the are set to the median of , particularly ensuring equality among them. This equality among some indicates that the balances involving their respective parts have a non-influential role in the response variable y. However, in contrast to the - scenario when computing the representative , in general, all -scores are non-zero. Consequently, variable selection cannot be performed in this case. The regularization process effectively splits the composition into two subcompositions. The first subcomposition represents the 36 internally independent parts, that is, a subcomposition in which the balances between the respective parts do not influence y [32]. The coefficients are driven to the median of (), contributing to model simplicity. The second subcomposition identifies the 21 parts that actively contribute to the influential balances on the response variable y (see Table A1).
To highlight the model’s simplicity, it is crucial to accurately summarise the information contained in the first subcomposition. Without loss of generality, let be an internally independent subcomposition. The linear model in -scores is
As explained by [16] in Chapter 4, the best approach to represent a subcomposition is through its geometric mean, denoted as . Therefore, the linear model is
where . This model has degrees of freedom, as opposed to the degrees of freedom of the general linear model. The intercept value is , and Figure 3b shows the -scores of .
- regularization creates a subcomposition that is both internally and externally independent [32], that is, both the balances within the parts of the subcomposition and the full balance between the parts of the subcomposition and the rest of the parts are all non-influential. In contrast, CoDa -norm regularization relaxes the conditions and establishes only one subcomposition that is internally independent. In this context, the CoDa -norm is somewhat more permissive. When comparing the results, we observe that both are quite similar; what stands out is the significance of the new variable in the CoDa -norm penalised linear model. - regularization eliminates the balance without prior analysis. This observation prompts us to consider that the direct application of - regularization might be premature. Furthermore, when dealing with a penalised model, it is always possible to subsequently test the nullity of any parameter [33].
- and CoDa -norm regularization share the fact that both shrink the difference between coefficients and a central measure, respectively, the mean and the median; consequently, each regularization technique generates a unique subcomposition with certain properties related to its influence on the dependent variable y. Because the goal of a CoDa analysis is to describe the subcompositional structure of the data, the use of the - and CoDa -norms in the penalty term leads to a result that has to be considered as limited. To overcome this limitation, the -plr norm enables the construction of more than one internally independent subcomposition, which can better capture the subcompositional structure of the data regarding the variable y [7].
With the same data partition as executed in previous cases, we performed cross-validation to find the optimal lambda value for the -plr estimator. The optimal parameter is = 69,669.31 (Figure 5).
Figure 5.
-plr: cross-validation MSE curve for different log-transformed values of the penalty parameter (). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) represent the value mean ± stdev, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 69,669.31.
For = 69,669.31, the generalised LASSO (Equation (13)) splits the composition into six distinct subcompositions: five internally independent subcompositions on response variable y and one subcomposition comprising 14 parts actively contributing to the influential balances on the response variable y (see Table A1 to compare -plr estimator with - and -CoDa estimators, and Table A2 to explore its subcompositional structure). Each of the five internally independent subcompositions related to y contributes to reducing the dimension of the linear model. This reduction is achieved by substituting each subcomposition with its geometric mean (), following the approach outlined in the CoDa -norm estimator. The intercept value is and Figure 3c shows the -scores of .
The -plr estimator is the simplest and provides us with the most information about the subcompositional structure of the composition as regards the variable y.
6. Discussion
This paper has rigorously defined CoDa -norms, providing a foundation for their application. The specific cases of the CoDa , , and norms have been studied, interpreting these metrics in terms of log-ratios to enhance the reader’s understanding. Additionally, a unified treatment of three distinct -norms tailored for compositional data has been presented in the context of a generalised LASSO problem. Through a detailed examination of the regularization effects of each norm, we have uncovered valuable insights. The - norm is well suited for variable selection, creating a unique subcomposition that is both internally and externally independent. The CoDa -norm, on the other hand, emphasises internal independence. Lastly, the -plr norm showcases a balance selection effect. Consequently, the -plr norm enables more detailed study of the subcompositional structure of the compositional covariate in relation to the explained variable y.
In this article, we have expanded the methodological toolkit for performing penalised regression with compositional covariates. For low dimensions, our recommendation is to run penalised regression with the -plr norm. However, we cannot ignore that variable selection becomes imperative for higher dimensions. Therefore, we suggest conducting an initial examination using the CoDa -norm or -plr norm to gain insights into the subcompositional structure. Following this analysis, it is possible to proceed with penalised regression employing the - norm.
As part of our future work, we aim to investigate penalised regression models that effectively integrate both the -plr and - norms into the penalty term. This research is expected to offer deeper insight into the underlying structure of compositional data, allowing for a more thorough understanding. Moreover, our aim is to improve the flexibility of modelling, especially in datasets with high dimensionality. This holistic approach will contribute to advancing the applicability and effectiveness of penalised regression techniques in the context of compositional data analysis.
Author Contributions
Conceptualization, J.A.M.-F., J.S.-R. and G.M.-F.; Formal analysis, J.A.M.-F., J.S.-R. and G.M.-F.; Methodology, J.A.M.-F., J.S.-R. and G.M.-F.; Software, J.S.-R.; Supervision, J.A.M.-F. and G.M.-F. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Agency for Administration of University and Research grant number 2021SGR01197, and Ministerio de Ciencia e Innovación grant number PID2021-123833OB-I00, and Ministerio de Ciencia e Innovación grant number PRE2019-090976.
Data Availability Statement
Data are contained within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| LASSO | Least Absolute Shrinkage and Selection Operator |
| CoDa | Compositional Data |
| clr | Centered Log-Ratio |
| -plr | Pairwise Log-Ration norm |
| - | Centered Log-Ration norm |
| TD | Total p-Deviation Function |
| MSE | Mean Squarred Error |
| OSQP | Operator-Splitting Quadratic Program |
Appendix A
Table A1.
-scores for the three LASSO estimators grouped into subcompositions.
Table A1.
-scores for the three LASSO estimators grouped into subcompositions.
| Taxon | L1-clr | L1-CoDa | L1-plr |
|---|---|---|---|
| Intercept | 6563.19 | 7023.88 | 7244.88 |
| g_Subdoligranulum | 455.51 | 409.93 | 249.27 |
| f_Lachnospiraceae_g_Incertae_Sedis | 245.69 | 77.35 | |
| g_Dialister | 229.23 | 182.53 | |
| g_Bacteroides | 193.46 | 186.23 | |
| g_Dorea | 159.97 | 104.06 | |
| g_Desulfovibrio | 88.32 | 53.57 | |
| f_Peptostreptococcaceae_g_Incertae_Sedis | 70.37 | 37.38 | 26.54 |
| g_Faecalibacterium | 63.74 | 33.07 | |
| g_Paraprevotella | 45.15 | 38.71 | |
| f_Defluviitaleaceae_g_Incertae_Sedis | 39.35 | 34.48 | 54.86 |
| g_Alistipes | 27.18 | 14.05 | |
| g_Clostridium_sensu_stricto_1 | 20.52 | ||
| g_Brachyspira | 4.33 | ||
| g_Elusimicrobium | 2.25 | ||
| g_Butyricimonas | 0.16 | ||
| k_Bacteria_g_unclassified | −7.48 | −33.79 | −41.56 |
| g_Streptococcus | −11.71 | ||
| g_Catenibacterium | −21.83 | −12.10 | −19.66 |
| g_Succinivibrio | −34.64 | −31.51 | −10.83 |
| g_Lachnospira | −40.80 | −4.21 | −29.65 |
| g_Parabacteroides | −52.41 | −0.28 | |
| o_Clostridiales_g_unclassified | −111.82 | −107.08 | −85.66 |
| g_Mitsuokella | −219.27 | −192.13 | −144.62 |
| g_Collinsella | −233.08 | −235.68 | −228.94 |
| g_Bifidobacterium | −263.23 | −225.95 | −154.34 |
| f_Lachnospiraceae_g_unclassified | −626.20 | −563.00 | −365.84 |
| g_subcomp | 231.83 | ||
| g_subcomp_1 | 212.52 | ||
| g_subcomp_2 | 248.34 | ||
| g_subcomp_3 | 41.25 | ||
| g_subcomp_4 | 255.78 | ||
| g_subcomp_5 | −27.66 |
Table A2.
Details of the subcompositional structure for the -plr LASSO estimator.
Table A2.
Details of the subcompositional structure for the -plr LASSO estimator.
| Taxon | |
|---|---|
| g_Subdoligranulum | 249.27 |
| g_subcomp_1: | 212.52 |
| g_Bacteroides, g_Dialister | |
| f_Defluviitaleaceae_g_Incertae_Sedis | 54.86 |
| g_subcomp_2: | 248.34 |
| f_Lachnospiraceae_g_Incertae_Sedis, g_Dorea, g_Faecalibacterium, | |
| g_Alistipes, g_Desulfovibrio, g_Paraprevotella | |
| f_Peptostreptococcaceae_g_Incertae_Sedis | 26.54 |
| g_Clostridium_sensu_stricto_1 | 20.52 |
| g_subcomp_3: | 41.25 |
| g_Escherichia-Shigella, f_Ruminococcaceae_g_unclassified, g_Butyricimonas | |
| g_subcomp_4: | 255.78 |
| g_Brachyspira, g_Barnesiella, g_Blautia, f_Rikenellaceae_g_unclassified, | |
| g_Odoribacter, f_Erysipelotrichaceae_g_unclassified, g_Streptococcus, | |
| g_Anaerostipes, g_Phascolarctobacterium, g_Acidaminococcus, | |
| g_Anaerovibrio, g_Roseburia, g_Alloprevotella, | |
| f_Erysipelotrichaceae_g_Incertae_Sedis, g_Megasphaera, g_Coprococcus, | |
| g_Intestinimonas, g_Solobacterium, g_Oribacterium, g_Anaeroplasma, | |
| g_Victivallis, f_Ruminococcaceae_g_Incertae_Sedis, o_NB1-n_g_unclassified, | |
| g_Sutterella, o_Bacteroidales_g_unclassified, g_Prevotella, g_RC9_gut_group, | |
| f_Christensenellaceae_g_unclassified, g_Anaerotruncus | |
| g_Parabacteroides | −0.28 |
| g_subcomp_5: | −27.66 |
| g_Ruminococcus, g_Elusimicrobium, f_vadinBB60_g_unclassified | |
| g_Succinivibrio | −10.83 |
| g_Catenibacterium | −19.66 |
| g_Lachnospira | −29.65 |
| k_Bacteria_g_unclassified | −41.56 |
| o_Clostridiales_g_unclassified | −85.66 |
| g_Mitsuokella | −144.62 |
| g_Bifidobacterium | −154.34 |
| g_Collinsella | −228.94 |
| f_Lachnospiraceae_g_unclassified | −365.84 |
References
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Aitchison, J. The Statistical Analysis of Compositional Data; Chapman & Hall: London, UK, 1986. [Google Scholar]
- Lin, W.; Shi, R.; Feng, R.; Li, H. Variable selection in regression with compositional covariates. Biometrika 2014, 101, 785–797. [Google Scholar] [CrossRef]
- Shi, P.; Zhang, A.; Li, H. Regression analysis for microbiome compositional data. Ann. Appl. Stat. 2016, 10, 1019–1040. [Google Scholar] [CrossRef]
- Lu, J.; Shi, P.; Li, H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics 2019, 75, 235–244. [Google Scholar] [CrossRef] [PubMed]
- Susin, A.; Wang, Y.; Lê Cao, K.A.; Calle, M.L. Variable selection in microbiome compositional data analysis. NAR Genom. Bioinform. 2020, 2, lqaa029. [Google Scholar] [CrossRef] [PubMed]
- Saperas-Riera, J.; Martín-Fernández, J.; Mateu-Figueras, G. Lasso regression method for a compositional covariate regularised by the norm L1 pairwise logratio. J. Geochem. Explor. 2023, 255, 107327. [Google Scholar] [CrossRef]
- Egozcue, J.J.; Pawlowsky-Glahn, V. Groups of parts and their balances in compositional data analysis. Math. Geol. 2005, 37, 795–828. [Google Scholar] [CrossRef]
- Pawlowsky-Glahn, V.; Egozcue, J.J. Geometric approach to statistical analysis on the simplex. Stoch. Environ. Res. Risk Assess. 2001, 15, 384–398. [Google Scholar] [CrossRef]
- Billheimer, D.; Guttorp, P.; Fagan, W.F. Statistical Interpretation of Species Composition. J. Am. Stat. Assoc. 2001, 96, 1205–1214. [Google Scholar] [CrossRef]
- Aitchison, J.; Bacon-Shone, J. Log contrast models for experiments with mixtures. Biometrika 1984, 71, 323–330. [Google Scholar] [CrossRef]
- Van der Boogaart, K.G.; Tolosana, R. Analyzing Compositional Data with R; Use R! Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Dave, A. Measurement of Central Tendency. In Applied Statistics for Economics; Horizon Press: Toronto, ON, Canada, 2014; Chapter 3. [Google Scholar]
- Barceló-Vidal, C.; Martín-Fernández, J.A. The Mathematics of Compositional Analysis. Austrian J. Stat. 2016, 45, 57–71. [Google Scholar] [CrossRef]
- Brezis, H. Functional Analysis, Sobolev Spaces and Partial Differential Equations; Universitext; Springer: New York, NY, USA, 2011. [Google Scholar]
- Pawlowsky-Glahn, V.; Egozcue, J.J.; Tolosana-Delgado, R. Modeling and Analysis of Compositional Data; John Wiley & Sons: Chichester, UK, 2015. [Google Scholar]
- Wu, J.R.; Macklaim, J.M.; Genge, B.L.; Gloor, G.B. Finding the Centre: Compositional Asymmetry in High-Throughput Sequencing Datasets. In Advances in Compositional Data Analisys; Springer: Berlin/Heidelberg, Germany, 2021; Chapter 17; pp. 329–342. [Google Scholar] [CrossRef]
- Martín-Fernández, J. Measures of Difference and Non-Parametric Classification of Compositional Data. Ph.D. Thesis, Department of Applied Mathematics, Universitat Politècnica de Catalunya, Barcelona, Spain, 2001. [Google Scholar]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. Introduction to Statistical Learning, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar]
- Bates, S.; Tibshirani, R. Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 2019, 75, 613–624. [Google Scholar] [CrossRef]
- Monti, G.; Filzmoser, P. Sparse least trimmed squares regression with compositional covariates for high-dimensional data. Bioinformatics 2021, 37, 3805–3814. [Google Scholar] [CrossRef] [PubMed]
- Monti, G.; Filzmoser, P. Robust logistic zero-sum regression for microbiome compositional data. Adv. Data Anal. Classif. 2022, 16, 301–324. [Google Scholar] [CrossRef]
- Tibshirani, R.; Taylor, J. The solution path of the generalized lasso. Ann. Statist. 2011, 39, 1335–1371. [Google Scholar] [CrossRef]
- Noguera-Julian, M.; Rocafort, M.; Guillén, Y.; Rivera, J.; Casadellà, M.; Nowak, P.; Hildebrand, F.; Zeller, G.; Parera, M.; Bellido, R.; et al. Gut Microbiota Linked to Sexual Preference and HIV Infection. eBioMedicine 2016, 5, 135–146. [Google Scholar] [CrossRef] [PubMed]
- Rivera-Pinto, J.; Egozcue, J.J.; Pawlowsky-Glahn, V.; Paredes, R.; Noguera-Julian, M.; Calle, M.L. Balances: A new perspective for microbiome analysis. mSystems 2018, 3, e00053-18. [Google Scholar] [CrossRef] [PubMed]
- Calle, M.; Susin, T.; Pujolassos, M. coda4microbiome: Compositional Data Analysis for Microbiome Studies; R Package Version 0.2.1. BMC Bioinf. 2023; 24, 82. [Google Scholar]
- Palarea-Albaladejo, J.; Martín-Fernández, J.A. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemom. Intell. Lab. Syst. 2015, 143, 85–96. [Google Scholar] [CrossRef]
- Palarea-Albaladejo, J.; Martin-Fernandez, J. zCompositions: Treatment of Zeros, Left-Censored and Missing Values in Compositional Data Sets; R Package Version 1.5. 2023. Available online: https://cran.r-project.org/web/packages/zCompositions/zCompositions.pdf (accessed on 13 March 2024).
- Martín-Fernández, J.; Hron, K.; Templ, M.; Filzmoser, P.; Palarea-Albaladejo, J. Bayesian-multiplicative treatment of count zeros in compositional data sets. Stat. Model. 2015, 15, 134–158. [Google Scholar] [CrossRef]
- Fu, A.; Narasimhan, B.; Boyd, S. CVXR: An R Package for Disciplined Convex Optimization. J. Stat. Softw. 2020, 94, 1–34. [Google Scholar] [CrossRef]
- Stellato, B.; Banjac, G.; Goulart, P.; Bemporad, A.; Boyd, S. OSQP: An Operator Splitting Solver for Quadratic Programs. Math. Program. Comput. 2020, 12, 637–672. [Google Scholar] [CrossRef]
- Boogaart, K.; Filzmoser, P.; Hron, K.; Templ, M.; Tolosana-Delgado, R. Classical and robust regression analysis with compositional data. Math. Geosci. 2021, 53, 823–858. [Google Scholar] [CrossRef]
- Hyun, S.; G’Sell, M.; Tibshirani, R.J. Exact post-selection inference for the generalized lasso path. Electron. J. Stat. 2018, 12, 1053–1097. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).