Lp-Norm for Compositional Data: Exploring the CoDa L1-Norm in Penalised Regression

Jordi Saperas-Riera; Glòria Mateu-Figueras; Josep Antoni Martín-Fernández

doi:10.3390/math12091388

,

and

Department of Computer Science, Applied Mathematics and Statistics, University of Girona, 17003 Girona, Spain

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(9), 1388;https://doi.org/10.3390/math12091388

This article belongs to the Special Issue Multivariate Statistical Analysis and Application

Version Notes

Order Reprints

Abstract

The Least Absolute Shrinkage and Selection Operator (LASSO) regression technique has proven to be a valuable tool for fitting and reducing linear models. The trend of applying LASSO to compositional data is growing, thereby expanding its applicability to diverse scientific domains. This paper aims to contribute to this evolving landscape by undertaking a comprehensive exploration of the

L^{1}

-norm for the penalty term of a LASSO regression in a compositional context. This implies first introducing a rigorous definition of the compositional

L^{p}

-norm, as the particular geometric structure of the compositional sample space needs to be taken into account. The focus is subsequently extended to a meticulous data-driven analysis of the dimension reduction effects on linear models, providing valuable insights into the interplay between penalty term norms and model performance. An analysis of a microbial dataset illustrates the proposed approach.

Keywords:

Aitchison’s geometry; compositional data; L^p-norm; balance selection

MSC:

62J07; 62P10; 62H99

1. Introduction

Linear regression serves as a powerful framework for modelling relationships between variables, as it aims to capture the underlying patterns that govern the variability in the response variable. For instance, in the microbiome domain, there is a particular interest in identifying which taxa are associated with a variable of interest, for example, the inflammatory parameter sCD14. To address such complex problems, adopting LASSO regression methods [1] has emerged as a popular choice for variable selection. LASSO regression strategically applies the Euclidean

L^{1}

-norm penalisation to the model coefficients, wherein the

L^{1}

-norm represents the sum of the absolute values of these coefficients. The penalised term shrinks some regression parameters towards zero, facilitating variable selection.

While conventional regression models assume independence among covariates, this assumption fails when dealing with compositional explanatory variables. These variables are called parts of a whole, and are usually expressed in proportions, percentages, or ppm. Historically [2], the sample space of the compositional data (CoDa) is designed as the D-part unit simplex

S^{D} = {x \in R^{D} : x_{j} > 0; \sum x_{j} = 1; j = 1, \dots, D}

. The fundamental idea in the analysis of CoDa is that the information is relative, and is primarily contained in the ratios between parts, not the absolute amounts of the parts. Therefore, the use of log-ratios is advocated. The analysis of CoDa, pioneered by [2], has witnessed increasing significance across such diverse fields as environmental science, geochemistry, microbiology, and economics. However, the integration of CoDa as covariates in regression models introduces particular challenges. The existing literature addresses these challenges, providing methodologies for regression model simplification with CoDa. The first works on penalised regression with compositional covariates [3,4,5,6] restricted the Euclidean

L^{1}

-norm on the centered log-ratio (

clr

) subspace when defining the penalty term. Saperas et al. [7] introduced a new norm, called the pairwise log-ratio (

L^{1}

-plr), as a part of a methodology on penalised regression to simplify the log-ratios on the explanatory side of the model. These log-ratios are also known as balances [8].

The primary objective of this article lies in comprehensive comparison of the effects of different norms on the penalty term within LASSO regression with different compositional explanatory variables. The choice of a norm in the penalty term is a pivotal aspect that significantly influences the regularization mechanism, and consequently the characteristics of the resulting models. To accomplish this, a precise and rigorous definition of the induced

L^{p}

-norms for CoDa (CoDa

L^{p}

-norms) within the compositional space is necessary. A comparison between the CoDa

L^{1}

-norm and other norms for compositions established in the literature is provided. Through this analysis, we seek to contribute valuable insights into the characteristics and implications of these norms in penalised regression.

The rest of this article is organised as follows. In Section 2, fundamental concepts related to the geometric structure of CoDa are outlined. In addition, some popular measures of central tendency are written as the solution of a variational problem using

L^{p}

-norms in real space. Section 3 is devoted to defining the CoDa

L^{p}

-norms on the compositional space. In Section 4, after describing the basic concepts of standard penalty regression, we analyse LASSO regression with compositional covariates using three different

L^{1}

-norms in the penalty term, with the CoDa

L^{1}

-norm among them. A comparison of the different norms is illustrated in Section 5 using a microbiome dataset. Finally, Section 6 concludes with some closing remarks.

2. Some Basic Concepts

2.1. Elements of the Aitchison Geometry

CoDa conveys relative information because the variables describe relative contributions to a given total [2]. The formal geometrical framework for the analysis of CoDa, coined the Aitchison geometry, first appeared in [9,10]. The Aitchison geometry is based on two specific operations on

S^{D}

, called perturbation and powering, respectively defined as

x \oplus y = (x_{1} y_{1}, x_{2} y_{2}, \dots, x_{D} y_{D})

and

α ⊙ x = (x_{1}^{α}, x_{2}^{α}, \dots, x_{D}^{α})

for

x, y \in S^{D}

,

α \in R

. In order to interpret the results of these operations, one can perform closure on the result, that is, normalise the resulting vector to a unit sum by dividing each component by its total sum. Note that the closure operation provides a compositionally equivalent vector. With a vector space structure, a metric structure can be easily defined using the

clr

-scores of a D-part composition

x = (x_{1}, \dots, x_{D})

[2]:

clr (x) = (clr {(x)}_{1}, \dots, clr {(x)}_{D}) = (ln \frac{x_{1}}{g (x)}, \dots, ln \frac{x_{D}}{g (x)}),

where

g (\cdot)

is the geometric mean of the composition. Note that

clr

-scores are collinear, because it holds that

\sum_{j = 1}^{D} clr {(x)}_{j} = 0

.

The basic metric elements of the Aitchison geometry are the inner product (

< \cdot, \cdot >_{A}

),

L^{2}

-norm (

| | \cdot {| |}_{A}

), and distance (

d_{A} (\cdot, \cdot)

), defined as follows:

< x, y >_{A} = < clr (x), clr (y) >_{E} {, ∥ x ∥}_{A}^{2} = < x, x >_{A}, d_{A} (x, y) = {∥ x ⊖ y ∥}_{A}, for x, y \in S^{D},

(1)

where “

A

” means the Aitchison geometry, “E” the typical Euclidean geometry, and “⊖” the perturbation difference

x ⊖ y = x \oplus ((- 1) ⊙ y)

. Log-ratios, like

clr

-scores, have become a cornerstone of CoDa analysis; nevertheless, in the literature the concept of balance between two non-overlapping groups of parts is frequently used. A balance is defined as the log-ratio between the geometric means of the parts within each group multiplied by a constant that depends on the number of parts in each group [8].

An important scale-invariant function is the log-contrast, which plays the typical role of the linear combination of variables. Given a D-part composition

x

, a log-contrast is defined as any linear combination of the logarithms of the compositional parts

\sum_{j = 1}^{D} a_{j} ln x_{j}, with \sum_{j = 1}^{D} a_{j} = 0, a_{j} \in R .

Given a dependent variable y and an explanatory D-part composition

x

, the definition of a linear regression model in terms of a log-contrast [11] is

y = α_{0} + \sum_{j = 1}^{D} α_{j} ln x_{j}, with \sum_{j = 1}^{D} α_{j} = 0, α_{j} \in R,

(2)

whereas in terms of metric elements the model formulation [12] is

y = β_{0} + {⟨ β, x ⟩}_{A} = β_{0} + {⟨ clr (β), clr (x) ⟩}_{E},

(3)

where

β

is the compositional gradient vector. The expressions in both Equations (2) and (3) are equivalent when one considers

α_{0} = β_{0}

and

α = clr (β)

. Because the sum of the

clr

-scores is zero (

\sum_{j = 1}^{D} clr {(β)}_{j} = \sum_{j = 1}^{D} clr {(x)}_{j} = 0

), the inner product of

clr

transformed vectors is equal to

{⟨ clr (β), clr (x) ⟩}_{E} = {⟨ clr (β), ln (x) ⟩}_{E} = {⟨ ln β, clr (x) ⟩}_{E} .

(4)

For simplicity and to avoid overloading the notation, we denote

β^{*} = ln β

, and write the linear regression model in terms of the Euclidean inner product as

y = β_{0} + {⟨ β^{*}, clr (x) ⟩}_{E} .

2.2. Norms and Measures of Central Tendency

The most popular measures of the central tendency of a real variable are the median and the arithmetic mean. Both can be defined as solving a variational problem [13]; indeed, the median

M e d (z)

of a dataset

z = {z_{1}, \dots, z_{D} | z_{i} \in R}

is the value that minimises the average absolute deviation

M e d (z) = arg min_{λ} \frac{1}{D} \sum_{j = 1}^{D} | z_{j} - λ |

. The arithmetic mean

\bar{z}

of a dataset

z = {z_{1}, \dots, z_{D} | z_{i} \in R}

is the value that minimises the mean squared deviation

\bar{z} = arg min_{λ} \frac{1}{D} \sum_{j = 1}^{D} {(z_{j} - λ)}^{2}

. In addition, the mid-range

M R (z)

of a dataset

z

is the value that minimises the maximum absolute deviation

M R (z) = arg min_{λ} (max_{j} | z_{j} - λ |)

. These definitions can be generalised to any

L^{p}

-norm [13] (Chapter 3).

Definition 1.

Let

z = {z_{1}, \dots, z_{D}}

be a dataset and let

p \geq 1

; furthermore, let

μ_{p}

be the p-measure of central tendency that minimises the total p-deviation function

T D_{p} (λ)

:

μ_{p} = arg min_{λ} \{T D_{p} (λ) = {∥ z - Λ ∥}_{p}^{p} = \frac{1}{D} \sum_{j = 1}^{D} {(z_{j} - λ)}^{p}\},

where

Λ = (λ, \dots, λ)

,

λ \in R

.

With this definition, the median (

μ_{1} = M e d (z)

), arithmetic mean (

μ_{2} = \bar{z}

), and mid-range (

μ_{\infty} = M R (z)

) follow as special cases for the norms

L^{1}

,

L^{2}

, and

L^{\infty}

, respectively.

Remark 1.

Convexity of the total p-deviation function

T D_{p} (λ)

:

For $p = 1$ , the average absolute deviation $T D_{1} (λ)$ is a convex function of λ; however, it is not strictly convex. Thus, the median may be a non-unique value.
For $p > 1$ , the total p-deviation function $T D_{p} (λ)$ is strictly convex; thus, if $μ_{p} (Z)$ exists, this is unique.

3. $L^{p}$ -Norms on the Compositional Space

To define induced

L^{p}

-norms on the compositional space (CoDa

L^{p}

-norms) in a compatible way with the Aitchison geometry, one must capture the geometric structure of the

S^{D}

[14]. To achieve this objective, we initially define the induced

L^{p}

-norm within the quotient space

L^{D} = {z + λ 1_{D} | z \in R^{D}, 1_{D} = (1, \dots, 1)}

. Following Brezis [15] (Chapter 11.2), an induced

L^{p}

-norm on the quotient space

L^{D}

can be defined by inducing the Euclidean

L^{p}

-norm in

R^{D}

on

L^{D}

. The underlying idea is to assign to an equivalence class the minimum value of the

L^{p}

-norm among the elements belonging to the same equivalence class.

Definition 2.

Let

z \in L^{D}

be a log-composition. The induced $L^{p}$ -norm, denoted by

{∥ z ∥}_{p, L^{D}}

, is

{∥ z ∥}_{p, L^{D}} = min_{λ} {∥ z + λ 1_{D} ∥}_{p},

where

1_{D} = (1, \dots, 1)

and

{∥ \cdot ∥}_{p}

denotes the typical

L^{p}

-norm in the real space.

Using the logarithmic isomorphism [14], the

L^{p}

-norm can be extended to the compositional space.

Definition 3.

Let

x \in S^{D}

be a composition. The CoDa $L^{p}$ -norm, denoted by

{∥ x ∥}_{p, S^{D}}

, is

{∥ x ∥}_{p, S^{D}} = {∥ ln x ∥}_{p, L^{D}} = min_{λ} {∥ ln x + λ 1_{D} ∥}_{p} .

Proposition 1.

The CoDa

L^{p}

-norm on

S^{D}

is

{∥ x ∥}_{p, S^{D}}

, and verifies the properties of the Aitchison geometry [16]:

Scale invariance: ${∥ x ∥}_{p, S^{D}} = {∥ k x ∥}_{p, S^{D}}$ , $k > 0$ .
Permutation invariance: ${∥ (x_{1}, \dots, x_{i}, \dots, x_{j}, \dots, x_{D}) ∥}_{p, S^{D}} = {∥ (x_{1}, \dots, x_{j}, \dots, x_{i}, \dots, x_{D}) ∥}_{p, S^{D}}$ .
Subcompositional dominance: ${∥ x ∥}_{p, S^{D}} \geq {∥ sub (x) ∥}_{p, S^{d}}$ , where $sub (x) \in S^{d}$ denotes any subset formed by d parts of $x$ .

Proof of Proposition 1.

The proof directly follows from the Definition 3. □

Following Definition 3 and the measures of central tendency described in Section 2.2, the CoDa

L^{p}

-norms

L^{1}

,

L^{2}

, and

L^{\infty}

can be developed:

The CoDa $L^{1}$ -norm on $S^{D}$ is

${∥ x ∥}_{1, S^{D}} = {∥ ln x - M e d (ln x) 1_{D} ∥}_{1} = {∥ln \frac{x}{M e d (x)}∥}_{1} = \sum_{j = 1}^{D} |ln \frac{x_{j}}{M e d (x)}|,$

where $M e d (ln x)$ and $M e d (x)$ are the median of the sets ${ln x_{1}, \dots, ln x_{D}}$ and ${x_{1}, \dots, x_{D}}$ , respectively. As the logarithm function is strictly increasing, as per Definition 1, the set of points that serve as solutions to the variational problem $T D_{1}$ when applied to log-transformed values $μ_{1} = Med (ln (x))$ precisely corresponds to the log-transformed set of points that are solutions to the variational problem $T D_{1}$ when applied to the raw data, that is, $ln (Med (x))$ .
Wu et al. [17] proposed the median of a D-part composition as an alternative denominator to the geometric mean in an attempt to extend the definition of $clr$ -scores. In general, the performance of the median as a robust estimator of the midpoint of a dataset is better when the data have high asymmetry. The CoDa $L^{1}$ -norm captures the distance between two points when movement is restricted to paths that run parallel to the $clr$ -axes ( $ln (\frac{x_{i}}{g (x)})$ ), as is the case in a grid or city street network (Manhattan distance, Figure 1). The CoDa $L^{1}$ -norm has an equivalent expression that captures the information about the ratio between the components of a composition; indeed, the median is the central point that divides a set into two equal parts, with half of the values falling below this central position and half above it. Therefore, half of the log-ratios $ln (\frac{x_{j}}{M e d (x)})$ are positive and the other half are negative. If we rearrange the parts of a composition in increasing order (small to large), i.e., $x_{(1)} \leq \dots \leq x_{(D)}$ , then the CoDa $L^{1}$ -norm can be written in the following manner:

Figure 1. The Manhattan distance based on the CoDa $L^{1}$ -norm in the simplex $S^{3}$ : the distance between two points $x = C [e^{3}, 1, e]$ and $y = C [1, e^{2}, e]$ in a grid-based system, where $C [\cdot]$ is the closure operation, is represented by three paths (red, orange, and blue) of the same length (five units).

∗
${∥ x ∥}_{1, S^{D}} = ln (\frac{x_{(n + 1)} \cdot \dots \cdot x_{(2 n)}}{x_{(1)} \cdot \dots \cdot x_{(n)}})$ if $D = 2 n$ ;
∗
${∥ x ∥}_{1, S^{D}} = ln (\frac{x_{(n + 1)} \cdot \dots \cdot x_{(2 n - 1)}}{x_{(1)} \cdot \dots \cdot x_{(n - 1)}})$ if $D = 2 n - 1$ .
Thus, the CoDa $L^{1}$ -norm is a balance between the large parts and small parts.
The CoDa $L^{2}$ -norm on $S^{D}$ is

${∥ x ∥}_{2, S^{D}} = ∥ ln x - \bar{ln (x)} 1_{D} ∥_{2} = {∥ ln x - ln (g (x)) 1_{D} ∥}_{2} = {∥ln \frac{x}{g (x)}∥}_{2} = {(\sum_{j = 1}^{D} {(ln \frac{x_{j}}{g (x)})}^{2})}^{\frac{1}{2}},$

where $g (x)$ is the geometric mean of the set ${x_{1}, \dots, x_{D}}$ . Because $ln \frac{x}{g (x)} \in clr$ -subspace, the CoDa $L^{2}$ -norm is the restricted Euclidean $L^{2}$ -norm on the $clr$ -subspace. This norm is commonly referred to as Aitchison’s norm ${∥ x ∥}_{A}$ [18].
The CoDa $L^{\infty}$ -norm on $S^{D}$ is

${∥ x ∥}_{\infty, S^{D}} = {∥ ln x - M R (ln x) 1_{D} ∥}_{\infty} = {∥ln \frac{x}{G R (x)}∥}_{\infty} = max_{j} \{| ln \frac{x_{j}}{G R (x)} |\},$

where $M R (ln x)$ and $G R (x)$ are respectively the mid-range and geometric mid-range of the sets ${ln x_{1}, \dots, ln x_{D}}$ and ${x_{1}, \dots, x_{D}}$ . Note that $M R (ln (x)) = ln (G R (x))$ ; thus, $G R (x) = {(max_{i} {x_{i}} \cdot min_{j} {x_{j}})}^{\frac{1}{2}}, i, j = 1, \dots, D$ . The CoDa $L^{\infty}$ -norm can be interpreted as a form of log-pairwise, as the CoDa $L^{\infty}$ -norm represents half of the log-pairwise between the largest part against the smallest part. This log-pairwise is the greatest among all log-pairwise in the composition:

${∥ x ∥}_{\infty, S^{D}} = \frac{1}{2} ln (\frac{max_{i} {x_{i}}}{min_{j} {x_{j}}}) = \frac{1}{2} max_{i, j} \{ln \frac{x_{i}}{x_{j}}\} .$

4. Penalised Regression with a Compositional Covariate

The LASSO regression model is formulated as the combination of the

L^{2}

-norm cost function and the

L^{1}

-norm regularisation term [1]. For a real dataset

Z

with n observations and D predictors along with a real response vector

Y

of length n, the LASSO regression model can be formulated as follows:

min \{\frac{1}{2} ∥ Y - β_{0} - {⟨ β, Z ⟩}_{E} ∥_{2}^{2} + λ {∥ β ∥}_{1}\}

(5)

where

β_{0}

is the intercept, the vector

β

is the gradient, and

λ

is the penalty parameter that controls the amount of regularisation. Note that

{∥ \cdot ∥}_{2}

and

{∥ \cdot ∥}_{1}

refer to the Euclidean

L^{2}

and

L^{1}

norms in real space, respectively. For

λ = 0

, the LASSO regression model (Equation (5)) provides the classical least squares regression model. The larger the value of

λ

, the greater the number of coefficients in

β

that is forced to be zero. The optimal value of

λ

can be chosen based on cross-validation techniques and related methods [19].

In the case of CoDa, additional considerations must be taken into account in order to respect the compositional nature of both the covariate

X

and the intercept

β

. In variable selection, [3,20] wrote the LASSO model in terms of

β^{*} = ln β

and the log-transformed data instead of the

clr

-scores; consequently, a linear constraint on the compositional gradient coefficient

β^{*}

is necessary:

min \{\frac{1}{2} ∥ Y - β_{0} - {⟨ β^{*}, ln (X) ⟩}_{E} ∥_{2}^{2} + λ {∥ β^{*} ∥}_{1}\}, subject to \sum_{j = 1}^{D} β_{j}^{*} = 0 .

(6)

Most of the literature addressing the topic of penalised regression with a compositional covariate has predominantly employed the Euclidean

L^{1}

-norm in the penalty term, leading to a

clr

-variable selection ([3,4,5,6,20,21,22]).

In Equation (6), the constraint

\sum_{j = 1}^{D} β_{j}^{*} = 0

can be incorporated in the minimising function. The constraint

\sum_{j = 1}^{D} β_{j}^{*} = 0

forces the

β^{*}

parameter to be an element in the

clr

-subspace. Therefore, per Equation (4), the inner product

{⟨ β^{*}, ln (X) ⟩}_{E}

is equivalent to

{⟨ clr (β), clr (X) ⟩}_{E} = {⟨ β, X ⟩}_{A}

. Thus, the constrained LASSO (Equation (6)) is equivalent to the following definition.

Definition 4.

Given

y_{i}

,

i = 1, \dots, n

, the sample of the response variable

X

, and the

n \times D

matrix whose rows

X_{i} = (x_{i 1}, \dots, x_{i D})

for

i = 1, \dots, n

contain the compositional sample, the

L^{1}

-

clr

LASSO estimator is defined as

β \in \underset{β_{0}, β}{argmin} \{\frac{1}{2} {∥ Y - β_{0} - {⟨ β, X ⟩}_{A} ∥}_{2}^{2} + λ {∥ β ∥}_{1 - c l r}\},

(7)

where

{∥ β ∥}_{1 - c l r} = \sum_{j = 1}^{D} | clr {(β)}_{j} | = \sum_{j = 1}^{D} |ln \frac{β_{j}}{g (β)}| .

The key innovation here is that the linear constraint becomes embedded in the penalty term through the

L^{1}

-

clr

norm. This change in approach is not merely an algebraic or formal change; rather it implies a deeper understanding of the variable selection process in CoDa. The penalty term imposes a constraint on the sum of the absolute values of

clr

-scores within the gradient vector

β

. This constraint compels the model to shrink or eliminate certain

clr

-scores, effectively driving them to zero. Consequently, this results in a balance selection. Without loss of generality, let us assume that the balance

ln \frac{β_{1}}{g (β)}

is zero. This implies that the corresponding balance

ln \frac{x_{1}}{g (x)}

has no influence on the response variable y. Therefore, the maximum variation in y is concentrated in the subspace orthogonal to the balance

ln \frac{x_{1}}{g (x)}

, i.e., the subspace of balances among the subcomposition

(x_{2}, \dots, x_{D})

. This selective regularization process facilitates variable selection, as

x_{1}

does not influence the response variable y.

In order to establish a unified framework, the generalised LASSO problem [23] can be adapted to penalised linear models with a compositional covariate:

min \{\frac{1}{2} ∥ Y - β_{0} - {⟨ β^{*}, clr (X) ⟩}_{E} ∥_{2}^{2} + λ {∥ D \cdot β^{*} ∥}_{1}\},

(8)

where

{∥ \cdot ∥}_{2}

and

{∥ \cdot ∥}_{1}

respectively refer to the Euclidean

L^{2}

and

L^{1}

norms in real space. The generalised LASSO problem allows for a broader range of applications by considering a matrix

D

associated with the penalty term. The matrix

D

is related to the

L^{1}

-norm considered in the penalty term. The choice of one norm over another determines the type of regularization. Different models can be formulated within the framework of the generalised LASSO problem and addressed through convex optimization algorithms. Solving each of these different penalised regression models yields distinct coefficients, each characterised by unique properties. Indeed, Definition 4 can be expressed as a generalised LASSO problem in the following manner:

β^{*} \in \underset{β_{0}, β^{*}}{argmin} \{\frac{1}{2} \sum_{i = 1}^{n} {(y_{i} - β_{0} - {⟨ β^{*}, clr {(X)}_{i} ⟩}_{E})}^{2} + λ {∥ C_{D} β^{*} ∥}_{1}\},

(9)

where

β^{*} = ln β

and

D = C_{D}

is the centering matrix on the

clr

-subspace, with

C_{D} β^{*} = β^{*} - \bar{β^{*}} 1_{D}

.

On the other hand, following [7], it is possible to consider the matrix

D

equal to

F

, that is, the matrix associated with the linear transformation

F (β_{1}^{*}, \dots, β_{D}^{*}) = \frac{1}{D - 1} (β_{1}^{*} - β_{2}^{*}, β_{1}^{*} - β_{3}^{*}, \dots, β_{1}^{*} - β_{D}^{*}, β_{2}^{*} - β_{3}^{*}, \dots, β_{2}^{*} - β_{D}^{*}, \dots, β_{D - 1}^{*} - β_{D}^{*})

. Note that

β_{i}^{*} - β_{j}^{*} = ln (\frac{β_{i}}{β_{j}})

, which is a log-pairwise. In this case, the penalty term in a generalised LASSO problem can be written as

{∥ F \cdot β^{*} ∥}_{1}

, meaning that the generalised LASSO problem results in the following:

β^{*} \in \underset{β_{0}, β^{*}}{argmin} \{\frac{1}{2} \sum_{i = 1}^{n} {(y_{i} - β_{0} - {⟨ β^{*}, clr {(X)}_{i} ⟩}_{E})}^{2} + λ {∥ F β^{*} ∥}_{1}\} .

(10)

The model can be defined as follows.

Definition 5.

Given

y_{i}

,

i = 1, \dots, n

, the sample of the response variable

X

, and the

n \times D

matrix whose rows,

X_{i} = (x_{i 1}, \dots, x_{i D})

for

i = 1, \dots, n

contain the compositional sample, the

L^{1}

-plr LASSO estimator is defined as

β \in \underset{β_{0}, β}{argmin} \{\frac{1}{2} ∥ Y - β_{0} - ⟨ β, X) {⟩_{A} ∥}_{2}^{2} + λ {∥ β ∥}_{1 - p l r}\},

(11)

where

{∥ β ∥}_{1 - p l r} = \frac{1}{D - 1} \sum_{i < j} |ln (\frac{β_{i}}{β_{j}})| .

Importantly, because

ln \frac{β_{i}}{β_{j}} = clr {(β)}_{i} - clr {(β)}_{j}

, the penalty term shrinks the absolute value of the differences of the

clr

-scores within the gradient vector, which forces some pairwise differences of

clr

-scores to be zero, i.e., it forces equality on some

clr

-scores. Therefore, each set of equal

clr

-scores defines a subcomposition with non-influential balances within its parts. This selective regularization process facilitates balanced selection [7].

Finally, using the CoDa

L^{1}

-norm introduced in Section 3, it is possible to define another generalised LASSO problem.

Definition 6.

Given

y_{i}

,

i = 1, \dots, n

, the sample of the response variable

X

, and the

n \times D

matrix whose rows

X_{i} = (x_{i 1}, \dots, x_{i D})

for

i = 1, \dots, n

contain the compositional sample, the CoDa

L^{1}

-norm LASSO estimator is defined as

β \in \underset{β_{0}, β}{argmin} \{\frac{1}{2} ∥ Y - β_{0} - ⟨ β, X) {⟩_{A} ∥}_{2}^{2} + λ {∥ β ∥}_{1, S^{D}}\},

(12)

where

{∥ β ∥}_{1, S^{D}} = \sum_{j = 1}^{D} |ln \frac{β_{j}}{M e d (β)}| .

In this case, the penalty term compels certain parts

β_{j}

to be equal to the median of the parts, ensuring equality among them in particular. Consequently, the effect produced is also a balance selection, as in the previous case; however, unlike the

L^{1}

-plr LASSO estimator, with the CoDa

L^{1}

-norm estimator there is only one set of equal

clr

-scores, and all non-influential balances belong to a single subcomposition.

As there is no algebraic formula to express the median,

M e d (β)

, it is necessary to include a new variable

m \in R

in the penalty term when formulating the minimization problem (Equation (12)) as a generalised LASSO problem:

β^{*} \in \underset{β_{0}, β^{*}, m}{argmin} \{\frac{1}{2} \sum_{i = 1}^{n} {(y_{i} - β_{0} - {⟨ β^{*}, clr {(X)}_{i} ⟩}_{E})}^{2} + λ {∥ G \cdot (β^{*}, m) ∥}_{1}\},

(13)

where

β^{*} = ln β

,

D = G

, and the matrix associated with the transformation

G (β^{*}, m) = (β_{1}^{*} - m, β_{2}^{*} - m, \dots, β_{D}^{*} - m)

.

5. Study Case

We used the microbial dataset analyzed in [24,25] to compare the different

L^{1}

-norms in a CoDa LASSO regression problem. The dataset, collected and explained in [24], comprises the compositions of

D = 60

taxa spanning various taxonomic levels (e.g., g for genus, f for family, o for order, and k for kingdom) within a set of

n =

151 individuals. The dependent variable y is an inflammatory parameter, specifically, the levels of soluble CD14 (sCD14 variable) measured for each individual. These data are available in the R package [26]. An individual having a zero value recorded for some parts indicates that certain taxa were not detected. A zero value prevents the application of the log-ratio methodology. Following a more analogous procedure than in [26], the genus Thalassospira, unclassified genus of the class Alphaproteobacteria, and unclassified genus of the family Porphyromonadaceae, all with more than

80 %

of zeros, were removed. The rest of the zeros recorded in the remaining 57 taxa were replaced by a small value using an imputation method [27,28]. Because the zeros are of count type, it is appropriate to apply methods based on Dirichlet-multinomial duality [29].

To solve the convex optimizations problems in Equations (9), (10), and (13), we first select the optimal

λ

parameter for the penalised model by performing a ten-fold cross-validation. Each iteration involves dividing the data into ten equal parts, training the model on nine of them, and then evaluating it on the remaining part to produce the lowest Mean Squared Error (MSE). With the parameter

λ

selected, we proceeded to solve the optimization problem in order to find the parameters

β_{0}

and

β^{*}

. The CVXR package in R version 4.3.2 [30] offers an interface for defining and solving convex optimization problems. CVXR utilises a domain-specific language, making it user-friendly and allowing users to express optimization problems. The package supports various solvers, enabling users to choose the one that best suits their needs. In our case, we opted for the Operator Splitting Quadratic Program (OSQP). The OSQP is a solver for quadratic programming problems and employs an operator-splitting method [31]. This solver is highly efficient even in cases where the matrices are not full-rank, such as our situation, because the

clr

-scores are used. Referring to the procedure detailed above, we have outlined an Algorithm 1 for a generalised LASSO method below.

The algorithm is applied in the three cases discussed in the previous section, namely, when considering the three different

L^{1}

norms in the penalty term, i.e., the

L^{1}

-

clr

(Definition 4),

L^{1}

-plr (Definition 5), and CoDa

L^{1}

-norms (Definition 6).

For the

L^{1}

-

clr

estimator, the LASSO regression algorithm is applied iteratively within the cross-validation framework. Figure 2 illustrates the cross-validated MSE across different

λ

values. The optimal

λ

is determined by selecting the point on the curve where the mean squared error is minimised:

λ_{m i n}

= 35,769.42.

Algorithm 1 Generalised LASSO for CoDa

1.: Fit the generalised LASSO model with tuning parameter $λ$ (Equations (9), (10) or (13)).
2.: Calculate the $clr$ -representative: $β^{*} - \bar{β^{*}}$
3.: Express the generalised LASSO model in terms of $clr$ -scores.

Figure 2.

L^{1}

-

clr

: cross-validation MSE curve for different log-transformed values of the penalty parameter (

ln (λ)

). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) indicate the mean ± stdev value, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 35,769.42.

For

λ_{m i n}

= 35,769.42, the generalised LASSO (Equation (9)) identifies which ones among all the

β_{i}^{*}

are set to

\bar{β^{*}}

, particularly ensuring equality among them. Importantly, when computing the representative

β^{*} - \bar{β^{*}} 1_{D} \in clr

-subspace, we find that some

clr

-scores are equal to zero. Therefore, the regularization process effectively splits the composition into two subcompositions. The first subcomposition represents the 33 non-influential parts, where coefficients

clr {(β)}_{k}, k = 1, \dots, 33

are driven to zero, contributing to model simplicity. The second subcomposition identifies the 24 parts that actively contribute to the influential balances on the response variable y (see Table A1). The intercept

β_{0}

is equal to

6563.19

. Figure 3a shows the non-zero

clr

-scores for parameter

β

. We highlight that the most influential pairwise is formed by the genus Subdoligranulum and the unclassified genus of the family Lachnospiraceae.

Figure 3. Comparison of the

clr (β)

parameter, with the taxon order maintained on the vertical axis to facilitate comparison: (a)

clr

-scores for the

L^{1}

-

clr

LASSO estimator, (b)

clr

-scores for the

L^{1}

-CoDa LASSO estimator, and (c)

clr

-scores for the

L^{1}

-plr LASSO estimator.

For the CoDa

L^{1}

-norm estimator (Equation (13)), the LASSO regression algorithm is applied with the same cross-validation partition used in the

L^{1}

-

clr

estimator. Figure 4 illustrates the model’s performance across different regularization parameters

λ

. The optimal value is

λ_{m i n}

= 45,582.21

Figure 4. CoDa

L^{1}

-norm: cross-validation MSE curve for different log-transformed values of the penalty parameter (

ln (λ)

). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) represent the value mean ± stdev, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 45,582.21.

For

λ_{m i n}

= 45,582.21, the generalised LASSO (Equation (13)) identifies which ones among all the

β_{i}^{*}

are set to the median of

β^{*}

, particularly ensuring equality among them. This equality among some

β_{i}^{*}

indicates that the balances involving their respective parts

x_{i}

have a non-influential role in the response variable y. However, in contrast to the

L^{1}

-

clr

scenario when computing the representative

β^{*} - \bar{β^{*}}

, in general, all

clr

-scores are non-zero. Consequently, variable selection cannot be performed in this case. The regularization process effectively splits the composition into two subcompositions. The first subcomposition represents the 36 internally independent parts, that is, a subcomposition in which the balances between the respective parts do not influence y [32]. The coefficients

clr {(β)}_{k}, k = 1, \dots, 36

are driven to the median of

β^{*}

(

6.44

), contributing to model simplicity. The second subcomposition identifies the 21 parts that actively contribute to the influential balances on the response variable y (see Table A1).

To highlight the model’s simplicity, it is crucial to accurately summarise the information contained in the first subcomposition. Without loss of generality, let

(x_{1}, \dots, x_{k})

be an internally independent subcomposition. The linear model in

clr

-scores is

y = β_{0} + \sum_{j = 1}^{k} clr {(β)}_{j} ln x_{i} + \sum_{j = k + 1}^{D} clr {(β)}_{j} ln x_{j}, clr {(β)}_{1} = \dots = clr {(β)}_{k} .

As explained by [16] in Chapter 4, the best approach to represent a subcomposition is through its geometric mean, denoted as

g_{s u b} = {(x_{1} \cdot \dots \cdot x_{k})}^{\frac{1}{k}}

. Therefore, the linear model is

y = β_{0} + k clr (β_{s u b}) ln g_{s u b} + \sum_{j = k + 1}^{D} clr {(β)}_{j} ln x_{j},

where

clr (β_{s u b}) = clr {(β)}_{1} = \dots = clr {(β)}_{k}

. This model has

D - k

degrees of freedom, as opposed to the

D - 1

degrees of freedom of the general linear model. The intercept value is

β_{0} = 7023.879

, and Figure 3b shows the

clr

-scores of

β

.

L^{1}

-

clr

regularization creates a subcomposition that is both internally and externally independent [32], that is, both the balances within the parts of the subcomposition and the full balance between the parts of the subcomposition and the rest of the parts are all non-influential. In contrast, CoDa

L^{1}

-norm regularization relaxes the conditions and establishes only one subcomposition that is internally independent. In this context, the CoDa

L^{1}

-norm is somewhat more permissive. When comparing the results, we observe that both are quite similar; what stands out is the significance of the new variable

g_{s u b}

in the CoDa

L^{1}

-norm penalised linear model.

L^{1}

-

clr

regularization eliminates the balance

ln \frac{g_{s u b}}{g (x)}

without prior analysis. This observation prompts us to consider that the direct application of

L^{1}

-

clr

regularization might be premature. Furthermore, when dealing with a penalised model, it is always possible to subsequently test the nullity of any parameter [33].

L^{1}

-

clr

and CoDa

L^{1}

-norm regularization share the fact that both shrink the difference between

β_{i}^{*}

coefficients and a central measure, respectively, the mean and the median; consequently, each regularization technique generates a unique subcomposition with certain properties related to its influence on the dependent variable y. Because the goal of a CoDa analysis is to describe the subcompositional structure of the data, the use of the

L^{1}

-

clr

and CoDa

L^{1}

-norms in the penalty term leads to a result that has to be considered as limited. To overcome this limitation, the

L^{1}

-plr norm enables the construction of more than one internally independent subcomposition, which can better capture the subcompositional structure of the data regarding the variable y [7].

With the same data partition as executed in previous cases, we performed cross-validation to find the optimal lambda value for the

L^{1}

-plr estimator. The optimal parameter is

λ

= 69,669.31 (Figure 5).

Figure 5.

L^{1}

-plr: cross-validation MSE curve for different log-transformed values of the penalty parameter (

ln (λ)

). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) represent the value mean ± stdev, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 69,669.31.

For

λ

= 69,669.31, the generalised LASSO (Equation (13)) splits the composition into six distinct subcompositions: five internally independent subcompositions on response variable y and one subcomposition comprising 14 parts actively contributing to the influential balances on the response variable y (see Table A1 to compare

L^{1}

-plr estimator with

L^{1}

-

clr

and

L^{1}

-CoDa estimators, and Table A2 to explore its subcompositional structure). Each of the five internally independent subcompositions related to y contributes to reducing the dimension of the linear model. This reduction is achieved by substituting each subcomposition with its geometric mean (

g - s u b c o m p_{k}, k = 1, \dots, 5

), following the approach outlined in the CoDa

L^{1}

-norm estimator. The intercept value is

β_{0} = 7023.879

and Figure 3c shows the

clr

-scores of

β

.

The

L^{1}

-plr estimator is the simplest and provides us with the most information about the subcompositional structure of the composition as regards the variable y.

6. Discussion

This paper has rigorously defined CoDa

L^{p}

-norms, providing a foundation for their application. The specific cases of the CoDa

L^{1}

,

L^{2}

, and

L^{\infty}

norms have been studied, interpreting these metrics in terms of log-ratios to enhance the reader’s understanding. Additionally, a unified treatment of three distinct

L^{1}

-norms tailored for compositional data has been presented in the context of a generalised LASSO problem. Through a detailed examination of the regularization effects of each norm, we have uncovered valuable insights. The

L^{1}

-

clr

norm is well suited for variable selection, creating a unique subcomposition that is both internally and externally independent. The CoDa

L^{1}

-norm, on the other hand, emphasises internal independence. Lastly, the

L^{1}

-plr norm showcases a balance selection effect. Consequently, the

L^{1}

-plr norm enables more detailed study of the subcompositional structure of the compositional covariate

x

in relation to the explained variable y.

In this article, we have expanded the methodological toolkit for performing penalised regression with compositional covariates. For low dimensions, our recommendation is to run penalised regression with the

L^{1}

-plr norm. However, we cannot ignore that variable selection becomes imperative for higher dimensions. Therefore, we suggest conducting an initial examination using the CoDa

L^{1}

-norm or

L^{1}

-plr norm to gain insights into the subcompositional structure. Following this analysis, it is possible to proceed with penalised regression employing the

L^{1}

-

clr

norm.

As part of our future work, we aim to investigate penalised regression models that effectively integrate both the

L^{1}

-plr and

L^{1}

-

clr

norms into the penalty term. This research is expected to offer deeper insight into the underlying structure of compositional data, allowing for a more thorough understanding. Moreover, our aim is to improve the flexibility of modelling, especially in datasets with high dimensionality. This holistic approach will contribute to advancing the applicability and effectiveness of penalised regression techniques in the context of compositional data analysis.

Author Contributions

Conceptualization, J.A.M.-F., J.S.-R. and G.M.-F.; Formal analysis, J.A.M.-F., J.S.-R. and G.M.-F.; Methodology, J.A.M.-F., J.S.-R. and G.M.-F.; Software, J.S.-R.; Supervision, J.A.M.-F. and G.M.-F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Agency for Administration of University and Research grant number 2021SGR01197, and Ministerio de Ciencia e Innovación grant number PID2021-123833OB-I00, and Ministerio de Ciencia e Innovación grant number PRE2019-090976.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LASSO	Least Absolute Shrinkage and Selection Operator
CoDa	Compositional Data
clr	Centered Log-Ratio
$L^{1}$ -plr	Pairwise Log-Ration norm
$L^{1}$ - $clr$	Centered Log-Ration norm
TD	Total p-Deviation Function
MSE	Mean Squarred Error
OSQP	Operator-Splitting Quadratic Program

Appendix A

Table A1.

clr

-scores for the three LASSO estimators grouped into subcompositions.

Table A1.

clr

-scores for the three LASSO estimators grouped into subcompositions.

Taxon	L1-clr	L1-CoDa	L1-plr
Intercept	6563.19	7023.88	7244.88
g_Subdoligranulum	455.51	409.93	249.27
f_Lachnospiraceae_g_Incertae_Sedis	245.69	77.35
g_Dialister	229.23	182.53
g_Bacteroides	193.46	186.23
g_Dorea	159.97	104.06
g_Desulfovibrio	88.32	53.57
f_Peptostreptococcaceae_g_Incertae_Sedis	70.37	37.38	26.54
g_Faecalibacterium	63.74	33.07
g_Paraprevotella	45.15	38.71
f_Defluviitaleaceae_g_Incertae_Sedis	39.35	34.48	54.86
g_Alistipes	27.18	14.05
g_Clostridium_sensu_stricto_1			20.52
g_Brachyspira	4.33
g_Elusimicrobium		2.25
g_Butyricimonas	0.16
k_Bacteria_g_unclassified	−7.48	−33.79	−41.56
g_Streptococcus	−11.71
g_Catenibacterium	−21.83	−12.10	−19.66
g_Succinivibrio	−34.64	−31.51	−10.83
g_Lachnospira	−40.80	−4.21	−29.65
g_Parabacteroides	−52.41		−0.28
o_Clostridiales_g_unclassified	−111.82	−107.08	−85.66
g_Mitsuokella	−219.27	−192.13	−144.62
g_Collinsella	−233.08	−235.68	−228.94
g_Bifidobacterium	−263.23	−225.95	−154.34
f_Lachnospiraceae_g_unclassified	−626.20	−563.00	−365.84
g_subcomp		231.83
g_subcomp_1			212.52
g_subcomp_2			248.34
g_subcomp_3			41.25
g_subcomp_4			255.78
g_subcomp_5			−27.66

Table A2. Details of the subcompositional structure for the

L^{1}

-plr LASSO estimator.

Table A2. Details of the subcompositional structure for the

L^{1}

-plr LASSO estimator.

Taxon	$clr (β)$
g_Subdoligranulum	249.27
g_subcomp_1:	212.52
g_Bacteroides, g_Dialister
f_Defluviitaleaceae_g_Incertae_Sedis	54.86
g_subcomp_2:	248.34
f_Lachnospiraceae_g_Incertae_Sedis, g_Dorea, g_Faecalibacterium,
g_Alistipes, g_Desulfovibrio, g_Paraprevotella
f_Peptostreptococcaceae_g_Incertae_Sedis	26.54
g_Clostridium_sensu_stricto_1	20.52
g_subcomp_3:	41.25
g_Escherichia-Shigella, f_Ruminococcaceae_g_unclassified, g_Butyricimonas
g_subcomp_4:	255.78
g_Brachyspira, g_Barnesiella, g_Blautia, f_Rikenellaceae_g_unclassified,
g_Odoribacter, f_Erysipelotrichaceae_g_unclassified, g_Streptococcus,
g_Anaerostipes, g_Phascolarctobacterium, g_Acidaminococcus,
g_Anaerovibrio, g_Roseburia, g_Alloprevotella,
f_Erysipelotrichaceae_g_Incertae_Sedis, g_Megasphaera, g_Coprococcus,
g_Intestinimonas, g_Solobacterium, g_Oribacterium, g_Anaeroplasma,
g_Victivallis, f_Ruminococcaceae_g_Incertae_Sedis, o_NB1-n_g_unclassified,
g_Sutterella, o_Bacteroidales_g_unclassified, g_Prevotella, g_RC9_gut_group,
f_Christensenellaceae_g_unclassified, g_Anaerotruncus
g_Parabacteroides	−0.28
g_subcomp_5:	−27.66
g_Ruminococcus, g_Elusimicrobium, f_vadinBB60_g_unclassified
g_Succinivibrio	−10.83
g_Catenibacterium	−19.66
g_Lachnospira	−29.65
k_Bacteria_g_unclassified	−41.56
o_Clostridiales_g_unclassified	−85.66
g_Mitsuokella	−144.62
g_Bifidobacterium	−154.34
g_Collinsella	−228.94
f_Lachnospiraceae_g_unclassified	−365.84

References

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Aitchison, J. The Statistical Analysis of Compositional Data; Chapman & Hall: London, UK, 1986. [Google Scholar]
Lin, W.; Shi, R.; Feng, R.; Li, H. Variable selection in regression with compositional covariates. Biometrika 2014, 101, 785–797. [Google Scholar] [CrossRef]
Shi, P.; Zhang, A.; Li, H. Regression analysis for microbiome compositional data. Ann. Appl. Stat. 2016, 10, 1019–1040. [Google Scholar] [CrossRef]
Lu, J.; Shi, P.; Li, H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics 2019, 75, 235–244. [Google Scholar] [CrossRef] [PubMed]
Susin, A.; Wang, Y.; Lê Cao, K.A.; Calle, M.L. Variable selection in microbiome compositional data analysis. NAR Genom. Bioinform. 2020, 2, lqaa029. [Google Scholar] [CrossRef] [PubMed]
Saperas-Riera, J.; Martín-Fernández, J.; Mateu-Figueras, G. Lasso regression method for a compositional covariate regularised by the norm L¹ pairwise logratio. J. Geochem. Explor. 2023, 255, 107327. [Google Scholar] [CrossRef]
Egozcue, J.J.; Pawlowsky-Glahn, V. Groups of parts and their balances in compositional data analysis. Math. Geol. 2005, 37, 795–828. [Google Scholar] [CrossRef]
Pawlowsky-Glahn, V.; Egozcue, J.J. Geometric approach to statistical analysis on the simplex. Stoch. Environ. Res. Risk Assess. 2001, 15, 384–398. [Google Scholar] [CrossRef]
Billheimer, D.; Guttorp, P.; Fagan, W.F. Statistical Interpretation of Species Composition. J. Am. Stat. Assoc. 2001, 96, 1205–1214. [Google Scholar] [CrossRef]
Aitchison, J.; Bacon-Shone, J. Log contrast models for experiments with mixtures. Biometrika 1984, 71, 323–330. [Google Scholar] [CrossRef]
Van der Boogaart, K.G.; Tolosana, R. Analyzing Compositional Data with R; Use R! Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Dave, A. Measurement of Central Tendency. In Applied Statistics for Economics; Horizon Press: Toronto, ON, Canada, 2014; Chapter 3. [Google Scholar]
Barceló-Vidal, C.; Martín-Fernández, J.A. The Mathematics of Compositional Analysis. Austrian J. Stat. 2016, 45, 57–71. [Google Scholar] [CrossRef]
Brezis, H. Functional Analysis, Sobolev Spaces and Partial Differential Equations; Universitext; Springer: New York, NY, USA, 2011. [Google Scholar]
Pawlowsky-Glahn, V.; Egozcue, J.J.; Tolosana-Delgado, R. Modeling and Analysis of Compositional Data; John Wiley & Sons: Chichester, UK, 2015. [Google Scholar]
Wu, J.R.; Macklaim, J.M.; Genge, B.L.; Gloor, G.B. Finding the Centre: Compositional Asymmetry in High-Throughput Sequencing Datasets. In Advances in Compositional Data Analisys; Springer: Berlin/Heidelberg, Germany, 2021; Chapter 17; pp. 329–342. [Google Scholar] [CrossRef]
Martín-Fernández, J. Measures of Difference and Non-Parametric Classification of Compositional Data. Ph.D. Thesis, Department of Applied Mathematics, Universitat Politècnica de Catalunya, Barcelona, Spain, 2001. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. Introduction to Statistical Learning, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar]
Bates, S.; Tibshirani, R. Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 2019, 75, 613–624. [Google Scholar] [CrossRef]
Monti, G.; Filzmoser, P. Sparse least trimmed squares regression with compositional covariates for high-dimensional data. Bioinformatics 2021, 37, 3805–3814. [Google Scholar] [CrossRef] [PubMed]
Monti, G.; Filzmoser, P. Robust logistic zero-sum regression for microbiome compositional data. Adv. Data Anal. Classif. 2022, 16, 301–324. [Google Scholar] [CrossRef]
Tibshirani, R.; Taylor, J. The solution path of the generalized lasso. Ann. Statist. 2011, 39, 1335–1371. [Google Scholar] [CrossRef]
Noguera-Julian, M.; Rocafort, M.; Guillén, Y.; Rivera, J.; Casadellà, M.; Nowak, P.; Hildebrand, F.; Zeller, G.; Parera, M.; Bellido, R.; et al. Gut Microbiota Linked to Sexual Preference and HIV Infection. eBioMedicine 2016, 5, 135–146. [Google Scholar] [CrossRef] [PubMed]
Rivera-Pinto, J.; Egozcue, J.J.; Pawlowsky-Glahn, V.; Paredes, R.; Noguera-Julian, M.; Calle, M.L. Balances: A new perspective for microbiome analysis. mSystems 2018, 3, e00053-18. [Google Scholar] [CrossRef] [PubMed]
Calle, M.; Susin, T.; Pujolassos, M. coda4microbiome: Compositional Data Analysis for Microbiome Studies; R Package Version 0.2.1. BMC Bioinf. 2023; 24, 82. [Google Scholar]
Palarea-Albaladejo, J.; Martín-Fernández, J.A. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemom. Intell. Lab. Syst. 2015, 143, 85–96. [Google Scholar] [CrossRef]
Palarea-Albaladejo, J.; Martin-Fernandez, J. zCompositions: Treatment of Zeros, Left-Censored and Missing Values in Compositional Data Sets; R Package Version 1.5. 2023. Available online: https://cran.r-project.org/web/packages/zCompositions/zCompositions.pdf (accessed on 13 March 2024).
Martín-Fernández, J.; Hron, K.; Templ, M.; Filzmoser, P.; Palarea-Albaladejo, J. Bayesian-multiplicative treatment of count zeros in compositional data sets. Stat. Model. 2015, 15, 134–158. [Google Scholar] [CrossRef]
Fu, A.; Narasimhan, B.; Boyd, S. CVXR: An R Package for Disciplined Convex Optimization. J. Stat. Softw. 2020, 94, 1–34. [Google Scholar] [CrossRef]
Stellato, B.; Banjac, G.; Goulart, P.; Bemporad, A.; Boyd, S. OSQP: An Operator Splitting Solver for Quadratic Programs. Math. Program. Comput. 2020, 12, 637–672. [Google Scholar] [CrossRef]
Boogaart, K.; Filzmoser, P.; Hron, K.; Templ, M.; Tolosana-Delgado, R. Classical and robust regression analysis with compositional data. Math. Geosci. 2021, 53, 823–858. [Google Scholar] [CrossRef]
Hyun, S.; G’Sell, M.; Tibshirani, R.J. Exact post-selection inference for the generalized lasso path. Electron. J. Stat. 2018, 12, 1053–1097. [Google Scholar] [CrossRef]

Figure 1. The Manhattan distance based on the CoDa

L^{1}

-norm in the simplex

S^{3}

: the distance between two points

x = C [e^{3}, 1, e]

and

y = C [1, e^{2}, e]

in a grid-based system, where

C [\cdot]

is the closure operation, is represented by three paths (red, orange, and blue) of the same length (five units).

Figure 2.

L^{1}

-

clr

: cross-validation MSE curve for different log-transformed values of the penalty parameter (

ln (λ)

). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) indicate the mean ± stdev value, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 35,769.42.

Figure 3. Comparison of the

clr (β)

parameter, with the taxon order maintained on the vertical axis to facilitate comparison: (a)

clr

-scores for the

L^{1}

-

clr

LASSO estimator, (b)

clr

-scores for the

L^{1}

-CoDa LASSO estimator, and (c)

clr

-scores for the

L^{1}

-plr LASSO estimator.

Figure 4. CoDa

L^{1}

-norm: cross-validation MSE curve for different log-transformed values of the penalty parameter (

ln (λ)

). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) represent the value mean ± stdev, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 45,582.21.

Figure 5.

L^{1}

-plr: cross-validation MSE curve for different log-transformed values of the penalty parameter (

ln (λ)

). The circle (∘) is the arithmetical mean of the ten-fold CV. The red lines (above and below the mean) represent the value mean ± stdev, where stdev is the standard deviation of the ten-fold CV. The vertical line represents the log-transformed values of lambda.min = 69,669.31.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

L^p-Norm for Compositional Data: Exploring the CoDa L¹-Norm in Penalised Regression

Abstract

1. Introduction

2. Some Basic Concepts

2.1. Elements of the Aitchison Geometry

2.2. Norms and Measures of Central Tendency

3. $L^{p}$ -Norms on the Compositional Space

4. Penalised Regression with a Compositional Covariate

5. Study Case

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Article Metrics

Citations

Article Access Statistics

Lp-Norm for Compositional Data: Exploring the CoDa L1-Norm in Penalised Regression

Abstract

1. Introduction

2. Some Basic Concepts

2.1. Elements of the Aitchison Geometry

2.2. Norms and Measures of Central Tendency

3. L p -Norms on the Compositional Space

4. Penalised Regression with a Compositional Covariate

5. Study Case

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Article Metrics

Citations

Article Access Statistics

L^p-Norm for Compositional Data: Exploring the CoDa L¹-Norm in Penalised Regression

3. $L^{p}$ -Norms on the Compositional Space