Principal Amalgamation Analysis for Microbiome Data

In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study.


Introduction
Microbiome, defined as the set of microorganisms inhabiting a specific biological niche, plays a critical role in the development, nutrition, immunity, and health of their host organisms when the microorganisms are host associated [1]. The human gut microbiome, for instance, is known to not only control digestion but also affect the immune and nervous systems of human [2][3][4]. Whether host associated or free living, microbial communities are indispensable components of their ecosystems, and a deep understanding of their community structure and their interactions with the environment could lead to important biological and ecological insights. Indeed, in recent years, microbiome studies have become increasingly prevalent and large-scale, in part due to the rapid advances in high-throughput sequencing technologies. Raw sequencing reads can now be processed by well-established analytical tools such as quantitative insights into microbial ecology (QIIME) and mothur, to produce abundance tables of operational taxonomic units (OTU) [5][6][7]. Since the number of sequencing reads per sample (i.e., library size) may vary dramatically, proper sampling and normalization procedures are further adopted to produce relative abundance data of the OTUs [8,9], from which downstream analysis is performed.
Microbiome data is complex in nature, subject to constraints such as compositionality, high dimensionality, zero inflation, overdispersion, and taxonomic hierarchy. Specifically, microbiome data, as presented in relative abundances or proportions, are compositional; each compositional vector resides in a simplex that does not admit the standard Euclidean geometry. Second, the data are often very sparse with a large portion of zeros, arising from either under-sampling or true absence of the corresponding taxa or outlier mechanism defined by Kaul et al. [10]. Third, the number of OTUs or taxa is often much larger than the number of samples, making the data analysis prone to many curses of dimensionality. In addition, a unique feature of microbiome data is the presence of the evolutionary history

Existing Work on Dimension Reduction of Microbiome Data
Microbiome data are often normalized as compositions [8,9], which reside in a simplex that does not admit the standard Euclidean geometry. It creates significant challenges for statistical analysis, as many standard methods do not directly apply. There have been developments on compositional data analysis based on transformations and the so-called Aitchison geometry [16,22,[25][26][27]. However, these transformations could be inadequate to accommodate the unique features of microbiome data such as zero inflation, over dispersion and the presence of taxonomic tree structure among microbes.
The existing data reduction approaches for microbiome compositional data can be summarized to four categories, namely, the indexing approach, the selection approach, the transformation-based approach, and the amalgamation-based approach. The indexing approach, which represents data by some diversity or complexity indices, typically disables taxon-level analysis and results in oversimplification of data [28][29][30]. The selection approach [31,32], which only keeps a subset of "dominant" compositions, often ignores intrinsic relations due to compositionality and can be vulnerable to the extremely low abundances of many OTUs. In practice, such selection could trivially end up with a few most prevalent ones [33]. The transformation-based approach conveniently utilizes existing reduction methods such as PCA after transforming data to the Euclidean space [25,[34][35][36][37]. The required transformations usually involve logarithm operations and cannot be directly applied on the excessive zeros in the microbiome data. An alternative is to use power transformations such as square-root transformation [36,38], which avoids zero replacement and puts the data onto the unit sphere to enable manifold-based PCA. However, these PCA approaches may compromise interpretation of the data in terms of individual taxon and impede incorporation of the extrinsic taxonomic tree structure. Other proximity-based methods such as PCoA [39,40] could accommodate several special features of the data but fail to pinpoint specific taxon that drive data reduction. While Aitchison's formal terminology of "amalgamation" may not be as widely spread as it should be, the operation itself has nevertheless been widely used in microbiome data analysis, although often as a pragmatic and rather ad-hoc way of dealing with the most rare compositional components in the data. For example, rare taxon with excessive number of zeros or low abundances at lower taxonomic ranks are aggregated to a higher rank for analysis [11,41]. It is also common in the microbiome analysis that rare taxon are simply removed by some ad-hoc filtering process [42]. These naive approaches may lead to unwanted information loss and potential conflicts between analyses performed at different ranks or with different filter rules.
Not until recently, a few studies on amalgamation-based dimension reduction emerged. Greenacre [43] and Greenacre et al. [23] argued that amalgamation provides an interpretable way to reduce the dimensionality of compositions and could make substantive sense in practical applications, despite the non-linearity in the Aitchison geometry of the simplex and its possibility to distort between-sample distances. They advocated for expert-driven amalgamation, i.e., the use of domain-knowledge to perform amalgamation, and proposed amalgamation-based hierarchical clustering with log-ratio transformed data. Quinn and Erb [24] further discussed the usage of amalgamation as an alternative to the commonlyadopted dimension reduction methods and proposed an optimization approach to preserve a suitable between-sample distance measure with centered log-ratio transformation. However, the method does not work without zero-replacement, and its genetic algorithm can be extremely computational intensive and hinders the incorporation of extrinsic information such as the taxonomic tree structure.

Setup with an Illustrative Example
To illustrate the proposed taxonomy-guided dimension reduction, we consider a preterm infant study conducted at a Neonatal Intensive Care Unit (NICU) in the northeast region of the U.S. Fecal samples of preterm infants were collected daily when available during the infant's first month of postnatal age. Bacterial DNA was isolated and extracted from each sample [44,45]; V4 regions of the 16S rRNA gene were sequenced using the Illumina platform and clustered and analyzed using QIIME [6] to produce microbiome abundance data. When infant reached 36-38 weeks of postmenstrual age, neurobehavioral outcomes were measured using the NICU Network Neurobehavioral Scale (NNNS) [45]. The main interest was examining the gut-brain axis, i.e., whether and how gut microbiome compositions during early postnatal stage impact later neurobehavioral outcomes.
The raw microbiome data is longitudinal and has more than one thousand operational taxonomic units (OTU); these OTUs were classified up to the genus level using the Ribosomal Database Project (RDP) Classifier [46]. For the purpose of illustration, we consider the average compositions over the postnatal period at the genus level, which results in a single dataset with n = 34 subjects and p = 62 taxa. Figure 1a displays a heatmap of the data and Figure 1b shows the relative abundance barplot of the data. Figure 2 displays the taxonomic tree of the 62 taxa up to the genus level. Taxon 1  Taxon 2  Taxon 3  Taxon 4  Taxon 5  Taxon 6  Taxon 7  Taxon 8  Taxon 9  Taxon 10  Taxon 11  Taxon 12  Taxon 13  Taxon 14  Taxon 15  Taxon 16  Taxon 17  Taxon 18  Taxon 19  Taxon 20  Taxon 21  Taxon 22  Taxon 23  Taxon 24  Taxon 25  Taxon 26  Taxon 27  Taxon 28  Taxon 29  Taxon 30  Taxon 31  Taxon 32  Taxon 33  Taxon 34  Taxon 35  Taxon 36  Taxon 37  Taxon 38  Taxon 39  Taxon 40  Taxon 41  Taxon 42  Taxon 43  Taxon 44  Taxon 45  Taxon 46  Taxon 47  Taxon 48  Taxon 49  Taxon 50  Taxon 51  Taxon 52  Taxon 53  Taxon 54  Taxon 55  Taxon 56  Taxon 57  Taxon 58  Taxon 59  Taxon 60  Taxon 61  Taxon 62 Subject  To set up, suppose we observe n independent compositional samples on p taxa; let x = (x 1 , . . . , x n ) T = ( x 1 , . . . , x p ) = [x ij ] n×p be the observed n × p compositional data matrix, where each row x i , i = 1, . . . , n, lies in S p−1 , with S p−1 = {x ∈ R p : ∑ p j=1 x j = 1, x j ≥ 0, j = 1, . . . , p} representing a (p − 1)-simplex in R p . As seen from the heatmap in Figure 1, microbiome data is often very sparse with the presence of many rare taxa; even after being aggregated to the genus level, the percentage of zero entries in the data is still close to 40%.
In addition, we assume the availability of a taxonomic tree structure of the p taxa. Some general terminologies of a tree structure are defined as follows. Let T represent a p-leafed taxonomic tree, I(T) the set of internal nodes, and |T| the total number of nodes in a tree. Each leaf node of the tree corresponds to a taxon, and each internal node corresponds to a group of taxa. We follow the commonly used notions of child, parent, sibling, descendant, and ancestor to describe relations between nodes. Let the depth of a node E, denoted as D(E), be the number of ancestors from the node to the root, and let the depth of a tree, denoted as D * (T), be the maximum depth of its leaf nodes. For a leaf node E, we use A * (E) to denote its lowest multi-child ancestor that has more than one child. For example, in Figure 2, the depth of Taxon 1 is 1, while that of Taxon 2 is 5. The lowest multi-child ancestor of Taxon 12 is its parent, while that of Taxon 13 is its grandparent as its parent has only one child; that is, they share the same lowest multi-child ancestor. Taxon 26 and Taxon 27, on the other hand, do not share the same parent nor the lowest multi-child ancestor. We remark that we do not require the tree to be "complete", which means that different taxa could be classified at different taxonomic ranks. For instance, Taxon 1 is classified at the Phylum level, and Taxa 10 is only identified at the Class level.  Figure 2. The NICU data: Taxonomic structure of the 62 taxa at the genus level. Taxon 2 and Taxon 3 (dark green nodes) share the same parent and are at the same taxonomic rank. Taxon 12 and Taxon 13 (light green nodes) are at different ranks but they share the same lowest multi-child ancestor. Taxon 26 and Taxon 27 (red nodes) do not share the same lowest multi-child ancestor.

Taxonomy-Guided Principal Amalgamation Analysis
Analogous to PCA, which finds a number of principal components to best preserve the total variation in the data, PAA aims to aggregate the compositions to a smaller number of principal compositions, guided by the available taxonomic structure, to best preserve a proper measure of information in the data.
In this section, we start with a general framework of PAA as an information-preserving optimization procedure in Section 4.1. To achieve scalable computation and conveniently utilize the taxonomic structure, a hierarchical agglomerative PAA (HPAA) algorithm is developed in Section 4.2. The computational details with various loss functions of interest are provided in Section 4.3, and the graphical tools for visualizations of PAA are illustrated in Section 4.4.

Framework
The amalgamation is a fundamental operation for compositions; it is formally defined as follows.
Then, y = Rx gives an amalgamation of x when R ∈ M 0 (k, p).
The matrix R = (r 1 , . . . , r k ) T ∈ R k×p is called an amalgamation matrix. It is clear that the k many r j vectors represent k mutually exclusive and exhaustive subsets of the p components in x, and each r T j x then computes a sum of the components of x in the jth subset. The operation of amalgamation reduces the original compositional vector in S p−1 to a simplex with dimension at most k, as it is possible that ∑ p j=1 r ij = 0 for some i = 1, . . . , k. Apparently, this operation may result in a loss of information whenever k < p.
Naturally, given the observed data X, PAA can be formulated as a set of optimization problems: for k = 2, . . . , p − 1, where L(·) is a properly specified loss function that measures certain information reduction from X to X R T . Borrowing the terminology from PCA, we call the resulting R k matrix as the loading matrix or the principal amalgamation matrix and the amalgamated data X R T k as the score matrix. Subsequently, the k amalgamated vectors, X r 1 , . . . , X r k are called principal compositions. The construction of the loss function is flexible, and it is tied to the choice of how to measure the information in the data. For microbiome data or compositional data in general, of particular interest in practice is to measure the information loss by the reduction in some diversity index, for preserving a specific aspect of diversity in the original data as much as possible. Popular choices include the family of α diversity such as Simpson's diversity index (SDI) and Shannon-Wiener index (SWI), which measures within-sample diversity, and the family of β diversity such as Bray-Curtis dissimilarity (BC) and Weighted UniFrac (WUF), which measures between-sample diversity [29,[47][48][49]. There are also entropy-based or model-based measures that incorporate different aspects of the above indices [50][51][52][53][54]. Other methods for constructing the loss function include the likelihood approach, i.e., through making distributional assumption of the data (e.g., Dirichlet), and the transformation approach, i.e., through transforming the data to the Euclidean space such that familiar statistics such as sample variance can be used. To focus on the main idea, we defer the detailed discussions on these choices in Section 4.3.
However, due to the complex structure of M 0 (k, p), conducting PAA through exactly solving the optimization problems in Equation (1) would be computationally prohibitive when p is large, not even to mention that utilizing the taxonomic structure may introduce further complications. Intriguingly, we realize that PAA can be pursued from a cluster analysis perspective. With any specified number of principal compositions, the objective of PAA is essentially to search for a cluster pattern of the compositions, such that each cluster of compositions aggregates into a new principal composition.

Hierarchical PAA Algorithm
Motivated by the connection between PAA and clustering, we develop an agglomerative hierarchical PAA (HPAA) algorithm to utilize taxonomic structure and enable scalable computation. Our approach starts from the original compositions and gradually amalgamates them through a sequence of simple amalgamations, i.e., at each step only a single pair of compositions is being aggregated. As such, HPAA generates the entire path of the simple amalgamations for reducing the data from its most informative original form to an utterly non-informative vector of ones. Along this process, a dendrogram of the successive amalgamations and the associated information losses is naturally generated.
We first describe the HPAA algorithm and the growth of the associated dendrogram in its basic form, without consideration of taxonomic guidance. To initialize, let t = 0 and denote X 0 = X = (x 1 , . . . , x n ) T = ( x 1 , . . . , x p ) as the original n × p compositional data matrix. We start from the p taxa in the original data X 0 , each of which forms its own cluster. Let S 0 = {{1}, . . . , {p}} be the initial partition of the p taxa, which correspondingly forms the initial leaf nodes of the dendrogram, denoted as E 0,1 , . . . , E 0,p .
At the tth step, for t = 1, . . . , p − 1, let S t−1 denote the set of |S t−1 | (i.e., p − t + 1) nodes and X t−1 denote the n × |S t−1 | current amalgamated data from the last step. With these inputs, the core problem is to search for a pair of the current nodes, (E t−1, j , E t−1, j ), to be aggregated into a new node such that the information loss of the amalgamated data is minimized. That is, where R(j, j ) is a simple amalgamation matrix in M 0 (|S t |, |S t−1 |) that aggregates the jth and j th columns of X t−1 , and P t−1 is the active set of "legitimate" pairs of nodes that can be amalgamated. For instance, if no restriction is imposed, we set P t−1 = {(j, j ); 1 ≤ j < j ≤ |S t−1 |}, consisting of all possible pairs of the current leaf nodes. With the solution from Equation (2), we then update and denote the reduced set of nodes as E t,1 , . . . , E t,p−t . For example, if at the first step (t = 1), E 0,1 and E 0,2 are chosen to be combined, we have . , x p ), and the new reduced set of nodes are denoted as E 1,1 , . . . , E 1,p−1 .
The above procedure is repeated until only two nodes are left; they are then bound to be combined as a vector of ones. The proposed algorithm is summarized in Algorithm 1.
We propose three levels of taxonomy guidance: unconstrained, weak taxonomic hierarchy, and strong taxonomic hierarchy, which, as the names suggest, produce amalgamation patterns with different degrees of conformity with the taxonomic tree. It all boils down to properly set the active set of the paired nodes P t−1 in solving Equation (2). Moreover, when either weak or strong taxonomic hierarchy is enforced, the successive growth of the dendrogram through guided amalgamations is always coupled with the successive reduction of the taxonomic tree.

•
Unconstrained. In each step, we search over all possible pairs of nodes in solving Equation (2), • Weak taxonomic hierarchy. In each step, we only search over pairs of nodes that share the same lowest multi-child ancestor in the reduced taxonomic tree. That is, Equation (2) is solved over For example, consider the first step of HPAA (t = 1) with the p = 62 leaf nodes in Figure 2. Both (Taxon 2, Taxon 3) and (Taxon 12, Taxon 13) are in P 0 , while (Taxon 26, Taxon 27) is not. • Strong taxonomic hierarchy. In each step, we further restrict the search to be among pairs of nodes that have the largest depth in the taxonomic tree. As a result, taxa at the lowest taxonomic rank will always be aggregated first. That is, Equation (2) is solved over For example, in Figure 2, the pair (Taxon 2, Taxon 3) remains in P 0 , while (Taxon 12, Taxon 13) is no longer in P 0 as they are not of the lowest taxonomic rank.

Construction of Loss Function with Common Diversity Measures
We illustrate the implementation of HPAA using loss functions constructed from several commonly-used α diversity and β diversity measures.
The α diversity measures the richness (number of different entities) and evenness (the homogeneity in abundance of the entities) within each compositional sample. It can be calculated for each sample in the data, i.e., α(x i ) for i = 1, . . . , n. In general, larger α(x i ) indicates larger within-sample diversity among species, and the index is non-increasing along successive amalgamations. Therefore, a general loss function based on α diversity can be constructed as

Simpson's Diversity Index (SDI)
Consider first the Simpson's diversity index (SDI), defined as where x ij represents the abundance of the jth components in the ith sample with x i ∈ S p−1 .
The SDI can be understood as the probability that two individuals randomly selected from a sample will belong to different species. A small SDI indicates that a few components dominate, while a large SDI indicates a diverse and balanced distribution among components. It is seen that SDI is non-increasing along successive amalgamations, as where x j,t=1 denotes the jth column of X t−1 , which is equivalent to find the minimal offdiagonal element of X T t−1 X t−1 within the active set specified by P t−1 . We remark that in each step only two columns of the amalgamated data are affected; this is utilized to simplify the computation.

Shannon-Wiener Index (SWI)
Unlike the SDI which weights more on dominant components, the Shannon-Wiener index (SWI) is equally sensitive to rare and dominant components, defined as The logarithmic transformation is applied entrywisely on the enclosed vector or matrix. As such, to compute SWI, we do need to first replace zeros in the data. The SWI is non-increasing along successive amalgamations since x ij log x ij + x ij log x ij ≤ (x ij + x ij ) log(x ij + x ij ) for x ij , x ij > 0. Therefore, with the loss function form in Equation (3), the general PAA criterion in Equation (1) becomes min R∈M 0 (k,p) tr{RX T log(X R T )}, and the tth step simple amalgamation problem in Equation (2) becomes While the α diversity focuses on within-sample diversity, the β diversity reflects the between-sample differences. It can be calculated for each pair of samples in the data, i.e., β(x i , x i ) for i, i = 1, . . . , n, resulting in a between-sample distance or dissimilarity matrix D(X) = [β(x i , x i )] n×n . As such, PAA aims to best preserve the dissimilarity pattern in the amalgamated data. We thus construct the loss function based on β diversity as the sum of squared differences between the original distance matrix and that of the amalgamated data,

Bray-Curtis Dissimilarity Index (BC)
Consider the Bray-Curtis dissimilarity index (BC) defined as for any pair of samples x i , x i ∈ S p−1 . Note in the context of compositional data, the Bray-Curtis dissimilarity is simplified to the Manhattan (City-Block) distance. It is non-increasing along successive amalgamations as min(x ij , . The tth step simple amalgamation problem in Equation (2) can be expressed as

Weighted UniFrac Distance (WUF)
The weighted UniFrac distance (WUF) further incorporates information from the phylogenetic or taxonomic tree when computing the between-sample distance. It can also be viewed as a plug-in estimate of the Wasserstain distance between two probability distributions defined on the taxonomic tree [55]. It is commonly used in exploratory microbiome data analysis and a number of variants were developed. To mention a few, double principal coordinate analysis (DPCoA) proposed by Pavoine et al. [56] generalized PCA by incorporating the relationship among variables from the phylogenetic structure that can be described using dissimilarity measures like UniFrac or weighted UniFrac. Chen et al. [57] compared the power of statistical tests using a number of variants of UniFrac including unweighted/weighted UniFrac and generalized UniFrac. Randolph et al. [11] proposed a kernel-based regression framework that incorporates the unweighted/weighted UniFrac dissimilarity matrix from the phylogenetic structure.
Here we briefly illustrate HPAA with weighted UniFrac distance. For any pair of samples where l j for j = 1, . . . , p denotes the length of the j branch, i.e., the length between the node for jth entity and its parent, and L j denotes the distance of jth entity from the root node of the phylogenetic tree. Here the length of branches may change with compositions at lower levels of taxonomic tree amalgamated to higher level.
At the tth step, the simple amalgamation problem in Equation (2) can be expressed as where l j,j denotes the the length between the newly formed entity from jth and j th entities and its parent, and L j,j denotes the distance of new entity from the root node of the phylogenetic tree. During the successive amalgamations, the lengths of branches in computing WUF are also getting updated.

Visualization with Examples
We use the NICU data to illustrate the graphical tools developed for visualizing the PAA results. These tools can be extremely useful for visualizing and understanding compositional data, as well as helping to determine the desired number of principal compositions in practice.

Dendrogram
We construct a HPAA dendrogram to simultaneously visualize both the tree diagram of the successive amalgamations and the taxonomic structure of the taxon. To illustrate, Figure 3 shows the HPAA dendrogram from performing HPAA with SDI loss and strong taxonomic hierarchy on the NICU data. The top part of figure shows the dendrogram of amalgamations, where the y-axis shows the percentage decrease in total diversity as measured by SDI (on the log-scale) along the successive amalgamations, from the bottom to the top. As such, any horizontal cut of the dendrogram at a desired level of diversity loss/preservation shows the corresponding amalgamated data. In particular, each red dashed horizontal line indicates the steps at which the original data are aggregated to a higher taxonomic rank. It shows that, for example, aggregating data to the order level (22 taxa or principal compositions left) through HPAA leads to 22.3% loss in total SDI. At the bottom part, we use color bars to show taxonomic structure of the taxa (as shown in Figure 1), where in each horizontal bar taxa of the same color belong to the same category of that rank. Taxon 50  Taxon 48  Taxon 51  Taxon 49  Taxon 54  Taxon 55  Taxon 56  Taxon 52  Taxon 53  Taxon 57  Taxon 61  Taxon 62  Taxon 60  Taxon 58  Taxon 59  Taxon 47  Taxon 45  Taxon 46  Taxon 42  Taxon 43  Taxon 44  Taxon 40  Taxon 41  Taxon 25  Taxon 26  Taxon 38  Taxon 36  Taxon 37  Taxon 29  Taxon 30  Taxon 31  Taxon 24  Taxon 34  Taxon 33  Taxon 35  Taxon 28  Taxon 27  Taxon 32  Taxon 10  Taxon 19  Taxon 20  Taxon 22  Taxon 23  Taxon 17  Taxon 18  Taxon 21  Taxon 15  Taxon 12  Taxon 13  Taxon 14  Taxon 11  Taxon 16  Taxon 39  Taxon 7  Taxon 8  Taxon 9  Taxon 1  Taxon 5  Taxon 6  Taxon 4  Taxon 2  Figures S1 and S2 in Supplementary Information A show HPAA dendrograms with SDI loss and BC loss, respectively, under all three levels of taxonomy guidance. Not surprisingly, the patterns of amalgamations vary under different settings. Without taxonomic constraint, the change in diversity appears to be very smooth along the amalgamations, but the resulting principal compositions may not be easily interpretable, as indicated by the mixed color patterns in the color bars of the taxonomic rank. On the other hand, for the setting of strong taxonomic hierarchy, while the principal compositions are forced to closely follow the taxonomic structure, the percentage change in diversity tends to exhibit dramatic jumps, especially at the steps that the last remaining taxon at a lower taxonomic rank is forced to be aggregated to a higher rank. As a compromise, for the setting of weak taxonomic hierarchy, the resulting principal compositions remain interpretable, and the percentage change in diversity remains smooth and can be quite close to that of the unconstrained setting in the early stage of amalgamations.
The HPAA dendrogram also reveal several interesting insights on the microbiome of preterm infants. As shown in Figure 3, while Taxa 49-56 are all genus of the Enterobacteriaceae family, the pattern of amalgamation suggests that Taxon 50, which is Klebsiella, is distinctive with the rest. It turns out that Klebsiella is a genus of Enterobacteriaceae that has emerged as a significant nosocomial pathogen in neonates [58,59], and its species have been implicated as a cause of various neonatal infections [60,61] and neonatal sepsis [62,63].

Scree Plot
The scree plot shows the percentage change in the diversity loss as a function of the number of principal compositions. Figure 4 shows the scree plots from performing HPAA on the NICU data under different settings. The difference among the three levels of taxonomic guidance is very revealing, which confirms the previous observation from the dendrograms that the setting of weak taxonomic hierarchy reaches a good balance between preserving information and interpretability.

Ordination Plot
We construct an ordination plot to visualize the changes in the between-sample distance patterns before and after HPAA with any given number of principal compositions. Specifically, we perform the non-metric multidimensional scaling (NMDS) analysis with Bray-Curtis dissimilarity on the combined original data and the principal compositions from HPAA, which produces a low-dimensional ordination plot of all samples before and after amalgamation. For each sample, it is represented by a pair of points from either the original data or the principal compositions; the smallest circle that covers the pair is drawn, whose radius then indicates the level of distortion due to HPAA data reduction. Figure 5 shows the ordination plots from performing HPAA on the NICU data with three different loss functions and weak taxonomic hierarchy, in which 20 principal compositions are kept. All three settings preserve the between-sample diversity reasonably well, as indicated by the fact that the circles generally have a small radius; as expected, HPAA with the BC loss performs the best as it directly targets on preserving between-sample diversity. Figure S3 in Supplementary Information A shows the ordination plots from performing HPAA on the NICU data with the BC loss and weak taxonomic hierarchy, with different numbers of principal compositions. As expected, the larger the number of principal compositions, the better the preservation of the between-sample diversity and the less reduction of the size of the data. In practice, such plots, together with the associated statistics, could be very useful in determining the appropriate number of principal compositions.

Comparison with Competing Methods
To the best of our knowledge, a highly relevant competitor of PAA is the approach proposed by Quinn and Erb [24]; it is implemented in an R package amalgam that is available from GitHub repository amalgam https://github.com/tpq/amalgam (accessed on 13 June 2022). For any prespecified dimension of the amalgamated data, the method, referred to herein as amalgam, maximizes the correlation between the two Euclidean distance matrices computed from centered log-ratio transformed data before and after amalgamation. Due to the combinatory and nonconvex nature of the problem, a genetic algorithm was proposed to conduct local optimization. Besides amalgam, a similar amalgamation method by Greenacre [43] is implemented in an R package easyCODA, which is based on preserving the variance of log-ratio transformed data. The method, referred to as ACLUST, performs hierarchical clustering, in which two clusters that give the least loss of variation in the log-ratio transformed data [64] are amalgamated at each step. ACLUST can be regarded as an unconstrained HPAA with a transformation-based loss function that requires zero replacement. Both amalgam and ACLUST algorithms can be extremely computational intensive and hinder the incorporation of the taxonomic tree structure. We also consider naive prevalence-based filtering method that simply discard taxa with low abundance.

Simulation
We first compare the computation efficiency of HPAA, amalgam, and ACLUST. The results show that HPAA is very computationally efficient and scales well with the increase of the dimension or the sample size. In contrast, amalgam and ACLUST are very computationally intensive even for moderately large p or n, making it unsuitable for large-scale microbiome studies.
We also compare different dimension reduction methods on how well they preserve the between-sample distance pattern, which is very importance in many biological applications. The results show that HPAA methods with different loss functions outperform the baseline, the amalgam, and the ACLUST methods. PAA with the Bray-Curtis loss performs the best, as it directly aims at preserving the Bray-Curtis dissimilarity. To our surprise, the amalgam method performs worse than the baseline, which may be due to its requirement of zero-replacement and log-ratio transformation and the slow convergence of its genetic algorithm. Moreover, the ACLUST method performs even worse than the amalgam. See details in Supplementary Information B.

Application: Microbiome and HIV Infection
Understanding the association between microbiome richness and HIV-1 risk may help to design novel interventions to improve HIV-1-associated immune dysfunction. Here we considered a cross-sectional HIV microbiome study conducted in Barcelona, Spain, that included both HIV-infected subjects and HIV-negative controls [65]. Gut microbiome data were obtained from MiSeq 16S rRNA sequence data on fecal microbiomes and bioinformatically processed with mothur. The main goal of the study is to find the association between HIV transmission group (MSM: men who have sex with men vs non-MSM), HIV infection status and relative abundance of microbiome composition. As reported by Noguera-Julian et al. [65], risk factors related with sexual preference such as MSM and non-MSM might greatly affect the gut microbiome composition, and thus the relative abundance of taxa might be able to identify the risk clusters of subjects. Following Quinn and Erb [24], the microbiome abundance data were preprocessed to produce a genus-level relative abundance data matrix for p = 60 taxa and n = 128 HIV-infected subjects, including 60 MSM and 55 non-MSM subjects. The percentage of zeros is 36.6%. The taxonomic tree structure of the p = 60 taxa was also available as extrinsic information.
With this dataset, we compared different dimension reduction methods in terms of their performance on preserving the between-sample distance and on the classification accuracy of the MSM factor of subjects with the reduced data. For amalgam, the number of amalgamations is fixed at k = 20. We omitted ACLUST as its implementation in the R package easyCODA becomes computationally infeasible for the dimension p = 60. To be comparable, we use HPAA with BC loss and weak taxonomic hierarchy to produce k = 20 principal compositions. We also include a simple prevalence-based filtering method that only keeps the k = 20 taxa with the highest prevalence. Figure 6 shows the ordination plots of the three methods. The average Euclidean distance (with standard deviation in brackets) between the points representing the original compositions and principal compositions are 0.05 (0.04), 0.09 (0.07), 0.11 (0.09) for HPAA, the naive method, and the amalgam method, respectively. It is revealing that HPAA performs the best in preserving the between-sample distances of the data, which is partly owing to the proper use of the taxonomic structure. To evaluate the predictive performance of the dimension reduction methods in downstream classification analysis, we conduct an out-of-sample random splitting procedure. For completeness, we also included transformation-based methods, namely, PCA and PCoA with log-ratio transformed data and k = 20. In each run, we random split the original data into a training set of 80% samples and a testing set of 20% samples. Each of the three methods is applied on the training data to produce k = 20 features, which are then used as predictors to train a logistic regression model of MSM status. The trained feature construction approach and the logistic regression model are then applied to the testing data, for which the classification performance is measured by the value of the area under the receiver operating characteristic curve (AUC). The whole procedure is repeated 100 times. The average AUC values are 0.84 (0.07), 0.83 (0.08), 0.82 (0.07), 0.82 (0.07), 0.83 (0.08) for HPAA, the naive method, the amalgam method, PCA, and PCoA, respectively. We see that the principal compositions from HPAA leads to slightly improved classification. The results showcase the potential of HPAA for improving downstream statistical analysis on both interpretability and prediction.

Discussion
We have developed a new approach, principal amalgamation analysis, to perform dimension reduction of microbiome compositional data. The proposed method aggregates the compositions to a smaller number of principal compositions, by minimizing a userspecified loss function subject to conformity to the taxonomic structure. We hope to advocate using it as a preprocessing tool to reduce the dimension of highly-sparse OTUlevel relative abundance data.
It is also of particular interest to generalize the current framework to other microbiome data, e.g., metagenomes, metaproteomes, among others. Although we mainly focus on the microbiome compositional data in the proposed principal amalgamation analysis, the proposed framework is applicable as long as the amalgamation operation is meaningful for a particular data type and an appropriate loss function can be used to measure the information loss before and after the amalgamation.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13071139/s1, Section A: Visualization Example; Section B: Simulation; Figure S1: HPAA dendrograms with SDI and different constraints on taxonomic hierarchy for the NICU data; Figure S2: HPAA dendrograms with Bray-Curtis and different constraints on taxonomic hierarchy for the NICU data; Figure S3: 2D NMDS ordination plots for comparing original data and different numbers of principal compositions from HPAA with Bray-Curtis and weak taxonomic hierarchy for the NICU data; Figure S4: Simulation on average running time (in second) of HPAA methods with Simpson's index (SDI), Shannon's index (SWI) and Bray-Curtis dissimilarity (BC), and the amalgam method by Quinn and Erb [24]. Figure S5: Simulation on accuracy in preserving between-sample Bray-Curtis dissimilarity after dimension reduction.
Author Contributions: Conceptualization and methodology, K.C. and G.L.; analysis design, K.C.; method implementation and numerical analysis, Y.L. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Source code and the HIV data for numerical studies are available in a public GitHub repository https://github.com/LiYanStat/paaPack(accessed on 13 June 2022). The NICU data from preterm infant study that support the findings in this paper are provided by Dr. Xiaomei Cong. Restrictions apply to the availability of these data, which were used under license for this study. Data are available at https://figshare.com/s/8f0d7f9a5078c2030c2a (accessed on 13 June 2022). with the permission of Dr. Xiaomei Cong.

Conflicts of Interest:
The authors declare no conflict of interest.