Impact of Experimental Bias on Compositional Analysis of Microbiome Data

Microbiome data are subject to experimental bias that is caused by DNA extraction and PCR amplification, among other sources, but this important feature is often ignored when developing statistical methods for analyzing microbiome data. McLaren, Willis, and Callahan (2019) proposed a model for how such biases affect the observed taxonomic profiles; this model assumes the main effects of bias without taxon–taxon interactions. Our newly developed method for testing the differential abundance of taxa, LOCOM, is the first method to account for experimental bias and is robust to the main effect biases. However, there is also evidence for taxon–taxon interactions. In this report, we formulated a model for interaction biases and used simulations based on this model to evaluate the impact of interaction biases on the performance of LOCOM as well as other available compositional analysis methods. Our simulation results indicate that LOCOM remained robust to a reasonable range of interaction biases. The other methods tend to have an inflated FDR even when there were only main effect biases. LOCOM maintained the highest sensitivity even when the other methods could not control the FDR. We thus conclude that LOCOM outperforms the other methods for compositional analysis of microbiome data considered here.


Introduction
Experimental bias is a pervasive feature of microbiome data because each step in the sequencing experiment workflow (i.e., DNA extraction, PCR amplification, amplicon sequencing, and bioinformatics processing) favors certain taxa over others [1][2][3].Absent experimental procedures that produce unbiased data, it is therefore necessary to account for experimental bias when analyzing microbiome data.Fortunately, McLaren, Willis, and Callahan (MWC) [4] have proposed a model explaining how experimental bias affects microbiome data.In this model, the observed relative abundance of each taxon is a product of the taxon's true relative abundance and a taxon-specific bias factor, normalized over all taxa observed in the sample.Each taxon-specific bias factor represents the accumulation of multiplicative biases over all steps in the experimental pipeline so that multiple sources of bias are described by a single factor.With the MWC model in mind, we developed LOCOM [5], a logistic regression-based compositional analysis method for detecting differentially abundant taxa, that only estimates parameters that are free from bias, i.e., are not affected by bias factors.To the best of our knowledge, LOCOM was the first method that accounted for experimental bias and was shown both analytically and numerically to be fully robust to any bias that follows the MWC model; none of the other existing methods for testing taxon differential abundance have considered experimental bias.
The MWC model assumes no between-taxon "interaction" bias, i.e., the presence or abundance of one taxon does not affect the bias factors of any other taxa in the sample.LOCOM may not be bias-robust if the "main effect" biases of the MWC model were supplemented by interaction biases.In fact, Zhao and Satten [6] investigated the existence of interaction biases by expanding the MWC model to include interaction terms and then fitting this generalized model to mock community samples [7].They found a small amount of evidence for interaction biases, albeit with smaller magnitudes than the main effect biases in the MWC model.This finding is the motivation for this study.Here, we formulate a new model for generating interaction biases and use simulations based on this model to evaluate the impact of these interaction biases on the performance of LOCOM, as well as a number of existing approaches (ANCOM [8], ANCOM-BC [9], fastANCOM [10], ALDEx2 [11], WRENCH [12], DACOMP (v1.26) [13], LinDA [14], and the Wilcoxon ranksum test of log-ratio transformed data after adding pseudo counts of either 0.5 or 1).Thus, this report provides the first assessment of the impact of interaction biases on the analysis of microbiome data.Note that this assessment can only be achieved by simulations instead of real data since we cannot control the nature of the bias even using model community data.

Methods
We generalize the MWC model using the framework of Zhao and Satten [6] by adding an interaction bias θ jj to the log-linear model: where π ij is the true relative abundance of taxon j in sample i, p ij is the expected value of the observed relative abundance, γ j is the main effect bias for taxon j in the MWC model, and α i is the sample-specific normalization factor that ensures the compositional constraint on p ij .We followed [6] to introduce the effects of covariates X i on taxon j by replacing log(π ij ) in (1) with log(π 0 j ) + X T i β j , where β j contains the effect sizes, and π 0 j is the true relative abundance of taxon j when X i = 0. Note that, similar to all compositional analysis methods, the effect size parameters β j here characterize differences in "absolute" abundances rather than "relative" abundances; relative abundances are further modulated by the normalization factor α i which is a function of all the β j s.The interaction θ jj determines the extent to which taxon j affects the bias factor for taxon j.The assumption underlying the MWC model is θ jj = 0 for all j and j .We adopt the following model for generating the interaction biases relative to the main effect biases in our simulation studies: where φ is a constant for all taxa pairs that we call the magnitude of interaction biases, and jj is a non-negative error term with a mean of one.The value of φ and the minus sign in (2) reflect the findings of Zhao and Satten [6], who considered a model similar to (1) but with an interaction that depended on the presence of a taxon rather than its relative abundance.In particular, they considered the model where θ jj determines the extent to which the presence of taxon j affects the bias factor for taxon j.Using data from all seven taxa in the Brooks mock community samples, they estimated the main effect biases γ j s and interaction biases θ jj s and presented the results in their Table 4.For the convenience of our readers, we display their results in Figure 1.First, we observe that the interaction biases have smaller magnitudes than the main effect biases; we calculated the mean of the interaction-main effect ratios (i.e., | θ jj /γ j |) to be 0.204.Because θ jj in model ( 3) corresponds to the product θ jj π ij in our model ( 1) and because most of the Brooks samples have three taxa with equal proportions, i.e., π ij = 1/3, we obtained the (approximate) mean of |θ jj /γ j | and, thus, an estimate for φ of 0.612.In our simulations, we varied the value of φ from 0 to 4, which includes our empirical estimate of 0.612 while also extending the range of φ to allow for stronger interaction biases than those found in the Brooks data.In addition, Figure 1 shows that the interaction bias in taxon j caused by taxon j has an opposite sign to the sign of the main effect bias in taxon j , except for some very small interaction biases.This implies that if taxon j has a measured abundance higher than its true abundance, it is associated with decreased measured abundances of other taxa, which seems plausible if we assume that only a relatively constant total number of amplicons are sequenced.Finally, interaction biases are not necessarily symmetric between a pair of taxa, as also noted by Zhao and Satten [6], so our model (2) also does not guarantee such symmetry.

Main effect biases & interaction biases
Bias value Figure 1.Main effect biases (red dots) of the seven taxa in the Brooks mock community samples and interaction biases (black dots) that each taxon (in the x-axis) caused to the other six taxa, as estimated by Zhao and Satten [6].
Our simulation studies were based on models (1) and ( 2) and data on 856 taxa of the upper-respiratory tract (URT) microbiome by Charlson et al. [15].We considered three scenarios for the distribution of interaction biases.In the first scenario (referred to as S-nondiff), we sampled jj /2 from N(0.5, 0.1 2 ) for all taxa pairs j and j .In the second scenario (referred to as S-diff-causal), we modified S-nondiff to sample jj /2 from Beta(0.5, 0.5) when taxon j was "causal" (i.e., associated with the trait of interest) and for all j .Both distributions have a mean of 0.5; however, N(0.5, 0.1 2 ) has one mode at 0.5, whereas Beta(0.5, 0.5) has two modes 0 and 1 and, hence, a variance of larger than 0.1 2 .In the third scenario (referred to as S-diff-half), we used N(0.5, 0.1 2 ) for half of the randomly selected taxa js and Beta(0.5, 0.5) for the remaining half of the taxa.Unlike S-nondiff, S-diff-causal has a modest (but trait-related) variation in jj , while S-diff-half has a large variation in jj that is not trait-related.The distribution of the bias factor due to taxon-taxon interactions, η ij = ∑ j =j θ jj π ij , in each taxon j across all samples is displayed in Figures S1 and S2 and has the following features.In all scenarios and for all taxa js, the mean of η ij was approximately zero due to the averaging of contributions with different directionalities.The variance of η ij increased as γ j increased at a given φ.S-nondiff and S-diff-causal had the same η ij values at the null taxa and different η ij values at the causal taxa.Finally, in S-diff-half, η ij had a larger variance in one half of the taxa than in the other half of the taxa.
Additional details of the simulation settings generally follow the simulations in [5].Specifically, we considered both binary and continuous traits of interest without any confounding covariates; we also considered a binary confounder when the trait was binary.We used the two sets of "causal" taxa (i.e., taxa that are associated with the trait) that were used in [5] and referred to as M1 and M2.In M1, 20 taxa were randomly sampled to be causal from the set of taxa having mean relative abundances in the URT data [15] greater than 0.005 (but excluding the most prevalent taxon).In M2, the top five most abundant taxa (having mean relative abundances of 0.105, 0.062, 0.054, 0.050, and 0.049) were selected to be causal.For simulations with a binary confounder, we assumed that the confounder was associated with 20 taxa under M1 (10 sampled at random from the causal taxa and 10 from the null taxa) and 5 taxa under M2 (2 from the causal taxa and 3 from the null taxa).We set the sample size to 100.Let T i denote the trait and C i denote the confounder for the ith sample so that X i = (T i , C i ) T .To generate a binary trait, we selected an equal number of samples with T i = 1 and T i = 0.When a binary confounder was present, we drew C i from Bernoulli(0.2) in samples with T i = 0 and from Bernoulli(0.8) in samples with T i = 1.To generate a continuous trait, we sampled T i from U[−1, 1].To simulate the read count data for the 856 taxa, we first sampled the baseline relative abundances of all taxa for each sample from Dirichlet(π, θ), in which π and θ took the estimated mean and overdispersion (0.02) from fitting the Dirichlet-Multinomial model to the URT data.We formed the true relative abundances π ij for all taxa by spiking the causal taxon j with an exp(β j ,1 T i ) fold change, spiking the confounder-associated taxon j with an exp(β j ,2 C i ) fold change, and normalizing the relative abundances, so that We formed the observed relative abundances p ij by additionally multiplying the bias factor exp(γ j + ∑ j θ jj π ij ), so that Note that β j,1 = 0 for null taxa and β j,2 = 0 for confounder-independent taxa.For simplicity, we set β j,1 = β, a common effect size of the trait, on all causal taxa.We generated the main effect bias γ j from N(0, 0.8 2 ), which gave a range between 0.2 and 5 for most (95%) fold changes (exp γ j ) caused by the main effect bias.The scheme for generating the interaction bias θ jj has been described earlier.Finally, we generated the taxon count data using the Multinomial model with the mean (p i1 , p i2 , . . ., p iJ ) and a library size sampled from an N(10,000, (10,000/3) 2 ) distribution that was left-truncated at 2000.Following [5], we applied the following compositional analysis methods: LOCOM, ANCOM, ANCOM-BC, ALDEx2, DACOMP (v1.26),WRENCH, Wilcox-alr-half, and Wilcoxalr-one.The latter two are Wilcoxon rank-sum tests of additive-log-ratio transformed count data after adding pseudocount 0.5 or 1 to all count data and using the most abundant null taxon (known in simulated data) as the reference; the p-values were adjusted for multiple testing by the Benjamini-Hochberg [16] procedure.In addition, we included two newly developed methods, fastANCOM and LinDA.LOCOM requires that the taxa present in fewer than 20% of samples are filtered out.Recall that, since LOCOM fits logistic regression models to taxa count data, zero read counts are naturally handled as possible outcome values.We applied all methods to each replicate of simulated data after using this filter.For ANCOM, ANCOM-BC, fastANCOM, and LinDA, we also applied them to data using their own filter with 10% presence as the cutoff and denoted them as ANCOM o , ANCOM-BC o , fastANCOM o , and LinDA o .The empirical FDR and sensitivity (proportion of truly causal taxa that were detected) were evaluated at the nominal level of 20% based on 1000 replicates of data.A relatively high nominal FDR level was chosen due to the small numbers of causal taxa in both M1 and M2.

Results
The empirical FDR and sensitivity of all aforementioned compositional analysis methods for detecting the "causal" taxa (i.e., taxa that are associated with the trait), for simulations with a binary trait and no confounder under scenarios M1 and M2, are displayed in Figures 2 and 3, respectively (results for simulations with confounders and continuous traits showed similar patterns and are therefore deferred to Figures S3-S6 in Supplementary Materials).Note that φ = 0 corresponds to no interaction bias at any taxa (i.e., main effect biases only), and β = 0 corresponds to no differential abundance at any taxa (i.e., the global null).In all cases, the FDR of LOCOM remained at or close to the nominal level as long as the magnitude of the interaction bias φ < 1, which is substantially larger than the empirical value of 0.612 observed in the Brooks data [6]; under the global null in particular, LOCOM always controlled the FDR regardless of the value of φ.It was only when both φ and β became unrealistically large that we observed moderate inflation in the FDR of LOCOM.The FDR inflation of LOCOM was similar in S-nondiff and S-diff-causal because the interaction biases were similarly distributed at the majority of taxa, which were null taxa; the inflation was (slightly) larger in S-diff-half because the interaction biases had the largest variability among the three scenarios at a given φ.The interaction biases caused some loss of sensitivity for LOCOM, but the drop was relatively small, and LOCOM maintained the highest sensitivity among all methods in all of our simulation scenarios.The results of the other methods showed a similar trend to LOCOM with the FDR increasing and the sensitivity decreasing as φ increased, as well as an increase in FDR inflation as β increased.
The other methods had very different performances even when there were only main effect biases (φ = 0).ANCOM-BC and fastANCOM performed the best among the other methods, controlling the FDR for the full range of β values we have considered when φ was small and having similar FDR inflation as LOCOM when both β and φ became large; however, this performance came at the cost of a substantially lower sensitivity than that of LOCOM.ANCOM had a moderately inflated FDR when β was small but performed better when β was increased; nevertheless, its sensitivity was among the lowest.ALDEx2 had an inflated FDR and poor sensitivity when β was large in M1 but not in M2.WRENCH, Wilcox-alr-half, and Wilcox-alr-one had highly inflated FDRs except in the global null case, and LinDA had a highly inflated FDR in all cases including the global null.These results were based on data after applying the filter of rare taxa as recommended by LOCOM; in general, ANCOM, ANCOM-BC, fastANCOM, and LinDA had worse FDR control when a less stringent filter was applied.Note that, unlike the LOCOM paper containing results from the older version (v1.1) of DACOMP, we used the latest version (v1.26) here, which yielded a highly inflated FDR and low sensitivity in M2 because some causal taxa were incorrectly selected to be among the reference set.FDR and sensitivity at the nominal FDR 20% (black dotted line) of LOCOM and other compositional methods for data simulated under M1 and with a binary trait.With superscript o , the filter for rare taxa that are present in less than 10% of samples as adopted by the original programs was used; without o , the more stringent filter with 20% cutoff as recommended by LOCOM was used.
The robust performance of all methods to the interaction bias is likely due to each term θ jj π ij in (1) that governs the effect of taxon j on the bias factor of taxon j being small because the relative abundance π ij is generally small.Further, even if all contributions from different taxa to the total interaction bias ∑ j θ jj π ij have the same sign, any nonzero mean interaction bias across taxa (or individuals) would be automatically included in the normalization factor α i (or the main effect bias γ j ).Thus, the interaction bias is "automatically centered" by these constraints, which may also decrease the apparent effect of the interaction bias.We used the Brooks mock community data to motivate our simulation studies, which may have two limitations.First, the Brooks samples contain at most seven taxa, all with equal true relative abundance; other mock community datasets, especially datasets that contain a large number of taxa and that mimic real microbiome data, should be used to study the interaction bias.Second, bacteria in a real microbial community may have different interactions from bacteria in a mock community.Therefore, the study of experimental biases and their impact on downstream analysis continues to be an important topic for the foreseeable future.

•
Microbiome data are subject to experimental bias, which not only takes the form of taxon-specific main effects but also taxon-taxon interactions.

•
LOCOM is robust to all main effect biases and a reasonable range of interaction biases.• With the exception of LOCOM, the currently available methods tend to have inflated FDRs even when there are only main effect biases.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes14091777/s1, Figure S1 Funding: This research was funded by the National Institutes of Health awards R01GM116065 (Hu) and R01GM141074 (Hu).

Figure 2 .
Figure 2.FDR and sensitivity at the nominal FDR 20% (black dotted line) of LOCOM and other compositional methods for data simulated under M1 and with a binary trait.With superscript o , the filter for rare taxa that are present in less than 10% of samples as adopted by the original programs was used; without o , the more stringent filter with 20% cutoff as recommended by LOCOM was used.

:
Distribution of the bias factor due to taxon-taxon interactions; Figure S2: Distribution of the bias factor due to taxon-taxon interactions, based on one replicate of data that were generated under M2 with exp(β) = 3 and φ = 4 and had a binary trait; Figure S3: Similar to Figure 2 except that the data were simulated under M1 and had a binary trait and a binary confounder; Figure S4: Similar to Figure 2 except that the data were simulated under M2 and had a binary trait and a binary confounder; Figure S5: Similar to Figure 2 except that the data were simulated under M1 and had a continuous trait; Figure S6: Similar to Figure 2 except that the data were simulated under M2 and had a continuous trait.Author Contributions: Conceptualization, Y.-J.H. and G.A.S.; methodology, Y.-J.H., G.A.S. and Y.H.; simulations, Y.H.; writing-original draft preparation, Y.-J.H. and G.A.S.; writing-review and editing, Y.H.; funding acquisition, Y.-J.H.All authors have read and agreed to the published version of the manuscript.