MarZIC: A Marginal Mediation Model for Zero-Inflated Compositional Mediators with Applications to Microbiome Data

Background: The human microbiome can contribute to pathogeneses of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis methods are not adequate to analyze the microbiome as a mediator due to the excessive number of zero-valued sequencing reads in the data and that the relative abundances have to sum to one. The two main challenges raised by the zero-inflated data structure are: (a) disentangling the mediation effect induced by the point mass at zero; and (b) identifying the observed zero-valued data points that are not zero (i.e., false zeros). Methods: We develop a novel marginal mediation analysis method under the potential-outcomes framework to address the issues. We also show that the marginal model can account for the compositional structure of microbiome data. Results: The mediation effect can be decomposed into two components that are inherent to the two-part nature of zero-inflated distributions. With probabilistic models to account for observing zeros, we also address the challenge with false zeros. A comprehensive simulation study and the application in a real microbiome study showcase our approach in comparison with existing approaches. Conclusions: When analyzing the zero-inflated microbiome composition as the mediators, MarZIC approach has better performance than standard causal mediation analysis approaches and existing competing approach.


Introduction
Emerging evidence suggests that the human microbiome and the immune system are constantly shaping each other [1]. The human microbiome can contribute to disease pathogeneses by mediating disease-leading causal pathways in complex diseases such as Alzheimer's disease [2] and cancer [3,4]. To study the human microbiome, 16S ribosomal RNA gene sequencing and metagenomic shotgun sequencing have been popular methods to quantify microbiome composition in microbiome studies. A challenging feature of microbiome sequencing data is that it has excessive number of zeros [5]. Many microbiome data sets have more than 50% of the sequencing reads being 0, and it could be as high as 80% or more. These zeros are likely to be a mixture of structural zeros (i.e., true zeros) that represent true absence of microbial taxa and undersampling zeros (i.e., false zeros) that result from failure of detection. The zero-inflated data feature compounded by a compositional structure poses a challenge that needs to be addressed specifically in mediation analyses. Although there have been some exciting efforts to model microbiome as a high-dimensional mediator [6][7][8], it remains a daunting task to address the zero-inflated data structure.
Mediation analysis is an important tool to investigate the role of intermediate variables (i.e., mediators) in a causal pathway where the causal effect partially or completely relies on the mediators. For example, people with higher socioeconomic status tend to have longer life expectancy, but this causal pathway may be explained by many possible mediators including access to better health care, fewer stressors, better living environment and so forth. In a mediation analysis, the indirect effect (i.e., mediation effect) through one or more mediators can be estimated and tested along with the direct effect. This technique was first popularized in psychology and social sciences and it has become a common analysis tool in many research areas such as epidemiology, environmental health sciences, medicine, randomized trials and psychiatry. There are two general types of mediation analysis approaches: potential-outcomes (PO) or counterfactual-outcomes methods [9][10][11] and traditional linear mediation analysis methods [12,13]. The latter approach can be considered as a special case of the former approach that can allow for nonlinear associations and interactions between independent variables and mediators. PO approaches are more flexible because they can allow interaction effects of the independent variable with mediators as well as nonlinear effects. Reviews of mediation analysis approaches and their assumptions can be found in the literature [14][15][16].
Although mediation modeling frameworks have been well established, to the best of our knowledge, there have been few studies to address zero-inflated compositional mediators. In a typical mediation analysis, the total effect of an independent variable can be decomposed into a mediation effect and a direct effect where the mediation effect measures the amount of the total causal effect attributable to change in the mediator caused by the independent variable and the direct effect measures the causal effect due to change in the independent variable while keeping the mediator variable constant. When the mediator has a marginal zero-inflated distribution such as a zero-inflated Beta (ZIB) distribution, we show that its mediation effect can be further decomposed into two parts with one part being the mediation effect attributable to the amount of numeric change in the mediator and the other part being the mediation effect attributable to the binary change of the mediator from zero to a non-zero state. This phenomenon can be explained by the two-part nature of a zero-inflated distribution. For example, a ZIB distribution is essentially a two-component mixture distribution [17]: one component is a degenerate distribution with probability mass of one at zero, and the other component is a Beta distribution. The mediator changing from zero to a positive value results in the discrete jump from zero to a non-zero state as well as the change in the numerical metric of the mediator and thus the mediation effect can be decomposed accordingly. Both changes have important interpretations in microbiome research. What makes it more complicated is that the observed zero-valued data points could be false zeros meaning that the true values are non-zero but observed as zero due to failure of detection. This is similar to a missing data problem and will be addressed here as well.
To fill the research gap in mediation modeling development, we propose a novel marginal mediation analysis approach under the PO framework to deal with zero-inflated compositional mediators. This approach can allow a mixture of truly zero-valued data points and false zeros. Our method is able to decompose the mediation effect into two components that are inherent to zero-inflated mediators: one component is the mediation effect attributable to the numeric change of the mediator on its continuum scale and the other component is the mediation effect attributable to the binary change of the mediator from zero to a non-zero state. So the mediation effect is actually the total mediation effect of the two components each of which can be estimated and tested. An extensive simulation study is conducted to evaluate our approach MarZIC in comparison with a standard PO mediation analysis approach [10] and another approach [6] that can analyze microbiome composition as a mediator. We introduce the model and its associated notations in Section 2. Estimation and inference procedures are provided in Section 3. A simulation study to assess the performance of our model in comparison with existing approaches is presented in Section 4, followed by an application of our model in Section 5, and a discussion in Section 6. Additional details and derivations can be found in the Appendices A-C.

Model and Notation
For simplicity, we suppress the subject index in all notations in this section. Let Y, M = (M 1 , . . . , M K+1 ) and X denote the continuous outcome variable, the compositional mediator variable and the independent variable respectively. For example, M could be the vector of relative abundances (RA) of microbial taxa. Before constructing the model for zeroinflated data, we first describe the model for the special case where the mediator M have no zeros which could happen if investigators choose to impute zeros with a Pseudocount or a small positive number. The model for zero-inflated data will be provided after that.

Model for Data without Zeros
We first consider cases where there are no zeros for the mediator M in the data which is very rare, but it could happen if zeros are replaced by a Pseudocount or a small positive number. We will move to cases with M containing zeros in the next section. As Dirichlet distributions have been widely used for modeling the RA of microbiome taxa [18][19][20][21][22][23], let M follow a (K + 1)−dimensional Dirichlet distribution indexed by its mean parameters µ 1 , . . . , µ K+1 with ∑ K+1 k=1 µ k = 1 and a dispersion parameter φ. We assume the outcome Y depends on M and X through the following regression equation: where the random error follows a normal distribution with mean of 0 and a constant variance, β k , β X and β kk are regression coefficients, and XM k is the interaction term between the independent variable X and the mediator M k . An advantage of using M k instead of log (M k ) in the model is that it does not require imputing zeros with a positive number. All taxa and their interactions with X are included and thus the compositional structure is accounted for in this model. Later, we will show that a marginal model can also account for the compositional structure. Equation (1) implies that the marginal association between Y and any taxon M j , j = 1, . . . , K + 1, has the following form (derivation can be found in the Appendix A): where E X (Y|M j ) is the mean of Y conditional on M j given X, and It is straightforward to see that the full model (1) uniquely determines the marginal association for each taxon. Therefore, without violating model (1), we can construct the following marginal regression model for the association between Y and M j and X such that it is equivalent to model (1): where the random error * has a normal distribution with mean of 0. An advantage of the above marginal model over model (1) is that it is straightforward to interpret the regression coefficient β 1 as a typical regression coefficient, whereas the corresponding regression coefficient β j in Equation (1) does not have such a straightforward interpretation. That is because there has to be at least one M k , k = j, changing when M j changes due to the compositional structure, and thus it is not possible to hold all M k 's, k = j, constant while changing M j to interpret β j as a typical regression coefficient. Another nice feature of the marginal model (3) is that the true values of its regression parameters (β 0 , β 1 , β 2 and β 3 ) are functions of the parameters µ 1 , . . . , µ K+1 of the Dirichlet distribution of M as shown in Equation (2); therefore, the marginal model accounts for the compositional structure.
It is also much more convenient to work on the marginal model (3) due to its simpler form. With that and the above advantages, we propose to use the marginal model (3) for constructing the mediation model. When the vector M has the Dirichlet distribution as previously assumed in this section, M j has a Beta distribution with mean parameter µ j and scale parameter φ. The following equation can be used to model the association between M j and X: Equations (3) and (4) together form our marginal mediation model for the scenario without zeros for M.

Model for Data with Zeros
Now we consider scenarios where the data for M contain zeros. Given the advantages of a marginal model as demonstrated in the above subsection, we will again use a marginal model for the association between Y and any taxon M j to form a mediation model. For any taxon M j , j = 1, · · · , K + 1, we construct the marginal model as follows: where 1 (·) is an indicator function indicating whether M j is 0, the random error follows a normal distribution N(0, δ), and β 1 , β 2 , β 3 , β 4 and β 5 are regression coefficients. This model is fully compatible with allowing interactions between the independent variable and mediators as the two interaction terms: X1 (M j >0) and XM j are included in Equation (5). In practice, investigators can also include only one or no interaction term depending on the hypothesis of interest. For the marginal distribution of M j , it is natural to use a zero-inflated Beta (ZIB) distribution because the marginal of a Dirichlet distribution is a Beta distribution [18,19]. Its two-part density function is given as follows: where ∆ is the probability of being 0, B(·, ·) is the Beta function and µ j and φ are the mean and dispersion parameters respectively of the Beta distribution for the non-zero part [24,25].
To model the association of the mediator M j with X, we use the following equations: ln Equations (5)- (7) together form our mediation model. The parameter α 1 in Equation (6) measures the association between X and the RA level of the mediator and γ 1 in Equation (7) measures the association between X and the binary presence of the mediator. Notice that X is a scalar here, but it is obvious that other covariates such as potential confounders can be included in the model equations.

Mechanism for Observing Zeros of the Mediator
For microbiome abundance data, observations that cannot be detected are set to be zero. Consequently, there are two types of zeros in the observed abundance data: true abundance of zero (i.e., absence) and abundance that is reported as zero as a consequence of the measurement failure. Let M * j denote the observed value of M j . When the observed value is positive (i.e., M * j > 0), we assume that M * j = M j . But when M * j = 0, we don't know whether M j is truly zero or M j is positive but observed as zero. We consider the following mechanism for the probability of observing a zero of the microbial taxon abundance: where L is the library size (i.e., sequencing depth) and the product M j L can be interpreted as the sample absolute abundance (SAA) of the jth taxon in a sample. Under this mechanism, all SAA below 1 have an observed value of zero. Here 1 can be considered as the Limit of Detection (LOD). We refer to this mechanism as "LOD mechanism" hereafter. Since SAA depends on both L and M j , the LOD mechanism is not deterministic conditional on the library size. The probability of observing a zero conditional on L, the library size, is equal

Marginal Mediation Effect and Direct Effect
Under the potential-outcomes (PO) framework [15], we can define the natural indirect effect (NIE), natural direct effects (NDE) and controlled direct effect (CDE) where NIE is the mediation effect. We refer to NIE as the marginal mediation effect because the proposed mediation models are based on marginal models as shown in Section 2. The total effect of X is equal to the summation of NIE and NDE. For any j, j = 1, . . . , The average NIE, NDE and CDE for X changing from x 1 to x 2 are defined as: is a counterfactual outcome. By plugging the Equations (5)-(7) into the above definitions and using Riemann-Stieljes integration [26], we can obtain the following formulas: and dF M j (x) (m) denotes the stieltjes integration [26] with respect to F M j (x) (m). So NIE, NIE 1 , NIE 2 , NDE and CDE can be estimated by plugging the parameter estimates into the formulas. Confidence intervals (CI) are obtained using the multivariate delta method as outlined in the Appendix B. An alternative approach for finding standard errors to construct CI is bootstrapping [27]. NIE 1 can be interpreted as the marginal mediation effect due to the change of the mediator on its numeric scale and NIE 2 can be interpreted as the marginal mediation effect due to the discrete binary change of the mediator from zero to a non-zero status. This decomposition can be also seen in Figure 1 where there are two possible indirect causal pathways from X to Y through the mediator M j .

Sequential Ignorability Assumption
Mediation analyses require assumptions to make causal inference and there have been different forms of assumptions proposed in the liteature [9,[28][29][30][31][32]. The key of the assumptions is to identify the terms involving counterfactual outcomes so that they can be estimated with the observed data. One of the popular assumptions is the sequential ignorability assumption proposed in [28]. In the definition of NIE and NDE, the variable Y x 2 M j (x 1 ) is a counterfactual outcome because M j (x 1 ) can not be observed when X takes the value of x 2 . The sequential ignorability assumption [28] for identifying E Y x 2 M j (x 1 ) can be written as follows in our setting: where x and x are any values in the support of X, m is any value in the support of M j , and Z is a vector of confounders (if any). The first assumption in Equation (9) says the outcome Y at any given value of the vector (X, M j ) and the mediator M j at any given value of X should all be dependent of X conditional on confounders in Z. A randomized trial where X is the random assignment typically makes this assumption automatically satisfied. The second assumption in Equation (10) says the outcome Y at any given value of the vector (X, M j ) is independent of the mediator M j at X = x conditional on X = x and confounders in Z. The second assumption is essentially saying that the mediator M j is effectively randomly assigned given X and Z. A straightforward interpretation for the first assumption is that there are no unmeasured confounders for the X − M j association and the X − Y association. A straightforward interpretation for the second assumption is that there are no unmeasured confounders for the M j − Y association. In our setting, the indicator variable 1 (M j >0) is also considered as a mediator. Because it is completely determined by M j , the above assumptions are enough to ensure the identifiability of E(Y x 2 M j (x 1 ) ) such that it can be estimated by the observed data.

Parameter Estimation
Maximum likelihood estimation (MLE) will be used to estimate the parameters. The data that is needed to estimate the marginal mediation effects for the jth taxon is The estimation challenge is that M j is not always observable due to false zeros. The log-likelihood contribution from those subjects with false zeros cannot be directly calculated. However, given that we know the probability of observing a zero in Equation (8), we can still obtain their log-likelihood contributions by integrating the joint density function over all possible values of M j using Riemann-Stieltjes integration [26]. Let (y i , r i , m * ij , l i , x i ) denote the observed data values of (Y, R, M * j , L, X) for the ith subject in a study and m ij denote the true value of the mediator M j for the ith subject. We use i for subject index hereafter throughout the paper. The subjects can be divided into two groups by whether m * ij is non-zero and we derive the log-likelihood contribution for each group. The first group consists of subjects whose observed value of the mediator is non-zero (i.e., m * ij > 0). Based on the assumptions in the Equations (5)- (7) where is assumed to have a normal distribution, the log-likelihood contribution from the ith subject (if it is in group 1) can be calculated as: are the (conditional) density (or probability mass function) for Y, R and M j respectively, ∆ i = expit(γ 0 + γ 1 x i ) and µ i = expit(α 0 + α 1 x i ). Let F(m|x) denote the (conditional) cumulative distribution function for M j . The second group consists of subjects with m * ij = 0. The log-likelihood contribution from the ith subject (if it is in group 2) can be calculated as: Taken together, we have the complete log-likelihood function given by: The MLE of the parameters can be obtained by maximizing the above complete loglikelihood function. With the parameter estimates and the observed Fisher information matrix, we will be able to calculate NIE, NIE 1 , NIE 2 , NDE and CDE and their CI's.

Simulation
Extensive simulations were carried out to demonstrate the performance of our approach MarZIC in comparison with two existing approaches under two settings. In setting 1 where the mediator was generated by univariate ZIB distributions which is univariate version of Dirichlet distributions, we compared MarZIC with a current standard practice in causal mediation analyses developed by Imai, Keele and Tingley [10] (IKT approach hereafter) which is a PO approach and can be implemented in R using the package "mediation" [33]. The Marginal Structural Models [9] is also a standard PO approach with a very similar definition of indirect effect. These causal mediation analysis approaches were not developed to analyze microbiome data, and thus could have poor performance when applied to microbiome data. In setting 2 where the mediator was generated by multivariate zero-inflated Dirichlet-Multinomial distributions, MarZIC was compared with IKT and CCMM [6] which was developed specifically to model microbiome composition as a mediator.
In all simulation settings, the independent variable X was binary and generated using the Bernoulli distribution Ber(0.5) such that the number of subjects was balanced between the two groups. To mimic the real study data, the library size was generated by randomly picking the library size with replacement from the real study data in Section 5 where the library size ranges from 31,607 to 911,652. The RA data was generated in a way such that it mimicked the distribution of RA in the real data. Multivariate delta method was used to derive confidence intervals in all settings.

Simulation Setting 1: Univariate ZIB Distribution
In this setting, the outcome Y was assumed to be a continuous variable and generated using Equation (5) where β 5 is set to be 0 and other true parameter values can be found in Table 1. Similar to simulation studies in the literature [18,19] where RA were generated individually, we generated individual taxon RA with ZIB distributions (i.e., univariate version of Dirichlet distributions) based on Equations (6) and (7). The LOD mechanism in Equation (8) for observing zero-valued data points of the mediator was used to generate false zeros for the mediator M j . Two scenarios were considered for the taxon RA: low RA (Scenario 1: mean of positive RA is equal to 0.0025) and high RA (Scenario 2: mean of positive RA is equal to 0.5). We generated 100 random data sets for each scenario and the sample size was 200 for each data set. About 20% of all sequencing reads were generated as true zeros (i.e., structured zeros) in both scenarios. Under the LOD mechanism in Equation (8), about 30% sequencing reads were false zeros in Scenario 1 and there were no false zeros in Scenario 2 because the RA in Scenario 2 was high and thus SAA were greater than 1 for all truly non-zero RA. Model performance was evaluated by estimation bias, standard error, coverage probability (CP) of 95% CI of the estimators for parameters and the mediation effects in this comparison. For Scenario 1, the simulation results (Table 1) showed good performance for MarZIC in terms of bias and CP of the mediation effects and the parameter estimates. All the biases were small and the CP were around the desired level of 95%. The IKT approach, however, had a poor performance with a large bias (84.81%) and a small CP (9%). These poor performances were likely due to the false zeros not being appropriately accounted for by the IKT approach. Another disadvantage of IKT is that it cannot decompose the mediation effect into NIE 1 and NIE 2 . For Scenario 2 with high RA where there were no false zeros, MarZIC showed good performance again in terms of the performance measures. IKT also showed satisfactory performance for the estimation of the NIE because there were no false zeros in the data under this scenario, but IKT cannot decompose the mediation effect according to the zero-inflated distribution of mediator.

Simulation Setting 2: Multivariate Zero-Inflated Dirichlet-Multinomial Distribution
The subject index i is suppressed in this subsection for simplicity. The microbiome data was generated using a zero-inflated Dirichlet-multinomial model that can account for variability from both the Dirichlet distribution and the multinomial distribution. The microbiome data generation process can be found in Appendix C. As shown in Table 2, six different scenarios were considered, of which some had the number of taxa smaller than the sample size and the others had the number of taxa larger than the sample size. We generated 100 random data sets for each scenario and the sample size was 200 for each data set. The outcome Y was generated using the following equation: where M 1 and M 2 denote the RA of the first taxon and the second taxon respectively, (β 0 , β 11 , β 12 , β 2 , β 3 , β 4 , β 5 ) = (1, 80, 2, 3, 1, 1, 1) and follows the standard normal distribution.
Notice that the data generation models are different from the analysis models in a few aspects: (a). The data generation model (12) involves both M 1 and M 2 which is different than the marginal model (5) where only one M j is in the model; (b). The relationships between X and M 1 and M 2 in the data generation in Appendix C are different from the data analysis model (6); (c). The zero mechanism for generating false zeros in the data generation as outlined in Appendix C is also different from the proposed mechanism in Section 2.3. Thus, to some extent, this simulation also demonstrated the robustness of our approach with respect to mis-specification of the model and the zero mechanism. Under the data generation model (12), Y has marginal associations with all taxa, but only the first two taxa marginally mediate the effect of X on Y because only their marginal mean values depend on X conditional on their presence according to the data generation in Appendix C. The indicator variable for the second taxon 1 (M 2 >0) also has a mediation effect because it has an impact on Y as shown in Equation (12) and the probability of presence of the second taxon depends on X. In summary, NIE 1 should be significant for M 1 and M 2 , and NIE 2 should be significant for M 2 in the analysis results of this simulation. This setting also mimicked the real study case where there were only two OTU's with significant NIE 1 .
Three indices were used to evaluate the model performance: Recall, Precision and F1 which were calculated as follows: where TP, FP, TN and FN denote true positive, false positive, true negative and false negative respectively. Recall is a measure of statistical power, the higher the better. Precision has an inverse relationship with false discovery rate (FDR) which is equal to (1-Precision), and thus the higher the Precision, the lower the FDR. When FP = 0, Precision was set to be 1 regardless of whether TP = 0. F1 is the Harmonic mean [34] of Recall and Precision that measures the overall performance in terms of Recall and Precision. In the data analysis step of the simulation, MarZIC analyzed each taxon as a mediator one by one whereas CCMM employed 1 regularization to handle high dimensionality. Multiple testing was adjusted using the Benjamini-Hochberg Procedure [35] such that the targeted FDR is 20% for all approaches in this comparison which means that the targeted Precision should be around 80%.
The simulation results (See Table 2) showed that MarZIC had a very good overall performance for identifying NIE 1 and NIE 2 in terms of Recall (>77.5%), Precision (>87.2%) and F1 (>87.3%). MarZIC achieved the targeted Precision of 80% across all cases. CCMM had good performance in terms of Recall, but its Precision rates (38.8-52.4%) were much lower than the targeted Precision rate (80%) which resulted in low F1 values (55.3-66.1%). This suboptimal performance is likely due to: (a). CCMM was proposed to model the RA on log-scale whereas Equation (12) is on the original scale of RA, (b). CCMM was not developed to incorporate the mediation effect of the binary variable 1 (M 1 >0) , and (c). CCMM could not handle interactions between the independent variable and mediators such as X1 (M 1 >0) in model (12). And CCMM could not generate any results for those scenarios with the number of taxa greater than or equal to 300 (See Table 2) due to computational issues whereas MarZIC can handle all cases very well. This is likely because CCMM is too computationally demanding for its 1 regularization algorithm which is not computationally capable of handling such high dimensionality. IKT had good Precision rates (>99.7%), but low recall rates (23.5-58.0%) compared to MarZIC, and thus also low F1 values.
In addition, we also considered cases with 5 taxa having significant NIE 1 and one taxon having significant NIE 2 and cases with 10 taxa having significant NIE 1 and one taxon having significant NIE 2 . The simulation results (See Table 3) also showed that MarZIC outperformed the other approaches. It had good recall rates for NIE 1 (>85.3%) and NIE 2 (>93%), and also achieved the target precision rate (80%) for both NIE 1 and NIE 2 except that it was 77.10%, slightly lower than 80%, for the case with 300 taxa of which 10 taxa had significant NIE 1 . Its F1 values were also good for both NIE 1 (>79.6%) and NIE 2 (>86.6%). CCMM had fair recall (>66.0%), but much lower precision rate (19.0-66.2%) and therefore low F1 values (31.2-43.9%). IKT, on the other hand, achieved target precision rate for all cases (>99.1%), but low recall rate (29.3-66.2%), and thus low F1 values (44.3-78.2%). Per the suggestion of a referee, we also did a simulation study with only 5 taxa (i.e., K = 4) in the data. The sample size was still 200 and the mean RA of the five taxa were approximately 0.196, 0.250, 0.220, 0.146 and 0.188 respectively. There were no false zeros because the five RA were large. The first two taxa had non-zero NIE 1 and the second taxon had non-zero NIE 2 . The simulation results from 100 random data sets showed good performance for both NIE 1 (Recall = 0.95, Precision = 0.96 and F1 = 0.94) and NIE 2 (Recall = 1, Precision = 0.97 and F1 = 0.98).
Orally administered VSL#3 has shown success in ameliorating symptoms and reducing inflammation in human pouchitis [36] and ulcerative colitis [37]. Preventive VSL#3 administration can also attenuate colitis in Il10−/− mice [38] and ileitis in SAMP1/YitFc mice [39]. When used as a preventative strategy, it has the potential capability to prevent inflammation and carcinogenesis. In a mouse model, Arthur et al. [40] studied the ability of a probiotic cocktail VSL#3 to alter the colonic microbiota and decrease inflammationassociated colorectal cancer when administered as interventional therapy after the onset of inflammation. The study duration was 24 weeks. In this study, there were 24 mice of which 10 were treated with VSL#3 and 14 served as control. Gut microbiome data were collected from stools at the end of the study with 16S rRNA sequencing. We obtained sequence data from Arthur et al. [40] and generated open reference OTUs using the Quantitative Insights into Microbial Ecology (QIIME) [41] version 1.9.1 at 97% similarity level using the Greengenes 97% reference dataset (release 13_8). Chimeric sequences were detected and removed using QIIME. OTUs that had 0.005% of the total number of sequences were excluded according to Bokulich and colleagues [42]. Taxonomic assignment was done using the RDP (ribosomal database project) classifier [43] through QIIME with confidence set to 50%. There were 362 OTUs in total in the data sets after quality control and data cleaning. 40% of the OTU RA data points were zero.
The relative abundance (RA) of each OTU was analyzed as a mediator variable using a ZIB distribution. The outcome variable in our analysis was dysplasia score (the higher the worse) which is a ordinal categorical variable measuring the abnormality of cell growth and it is treated as a continuous variable in the analysis because of its ordinal nature and its roughly bell-shaped density curve. The treatment variable is coded as 1/0 indicating VSL#3/control. Again, the FDR approach was used for adjusting for multiple testing such that the targeted FDR is 20% and the 95% CI were calculated before adjustment. NIE 1 of two OTUs were found to be statistically significant. The first OTU was assigned to the family S24-7 under order Bacteroidales and the second one was assigned to class Bacilli. The estimates of NIE 1 were 0.27 (95% CI: 0.1, 0.42) and −1.28 (95% CI: −2.06, −0.49) respectively. The interpretation for the mediation effects are that the treatment had a marginal positive effect of 0.27 on the dysplasia score through changing the RA of the first OTU and it also had a marginal negative effect of −1.28 on the dysplasia score through changing the RA of the second OTU. The family S24-7 and class Bacilli found by our approach have also been reported to be related with colorectal cancer in the literature [44,45]. To give a full picture of the mediation effects in this data set, a heatmap based on p-values was constructed (see Figure 2) to illustrate the NIE 1 of all OTUs. CCMM and IKT did not find any significant mediation effects of the OTUs.

Discussion
We developed an innovative marginal mediation modeling approach under the PO framework to analyze zero-inflated compositional mediators such as microbiome. We showed that the marginal mediation effect for zero-inflated mediators can be decomposed into two components of which the first is due to the change in the mediator over its positive domain and the second is due to the discrete binary change from zero to a non-zero status. These two components have different interpretations and are equally important for investigating causal mechanisms. The marginal model approach can also account for the compositional structure. When the point mass at zero (i.e., ∆) is equal to zero for the mediator (i.e., the distribution is not zero-inflated), the model reduces to a marginal mediation model for data without zeros as described in Section 2.1. Therefore, this approach can be also used for data sets after zero-valued data points are imputed with a positive number such as a Pseudocount (or after other normalization techniques are applied). R scripts for implementing the method are available upon request. This paper considered X as a univariate variable and did not include covariates as potential confounders in the models. It is straightforward to adjust for a set of covariates using our approach. Let C denote a vector of covariates or potential confounders. Then the NIE and NDE can be calculated at a specific value, c, of C as NIE The value of c can be taken as the mean value of the covariates similar to how least squares mean is calculated in regression models [46]. CI can be obtained using the delta method or resampling methods. Decomposition of NIE follows the same procedure as shown in Section 2.4.
Misspecification of the mechanisms for observing zero-valued data points could have an impact on the model performance. This is similar to missing data issues where partial information is available on the missing data. It can be considered as missing not at random (MNAR) [47] because the probability of a data point being observed as zero depends on its true value. Besides the LOD mechanism in Equation (8), another possible mechanism could be P(M * j = 0|M j , L) = exp(−ηM j L) where η > 0. Model selection approaches such BIC or AIC can be used to choose the optimal mechanism among different mechanisms. Although these mechanisms may not be perfect to account for MNAR, it can, to a large extent, alleviate the burden of not accounting for false zeros in the data at all. A future project has been planned to study the robustness of our model with respect to the mechanism for observing zeros using sensitivity analysis techniques.

Data Availability Statement:
The dataset analyzed in this paper is available upon request.
Next we need to derive the expression for E X M k M j for all k = 1, . . . , K + 1 in the above equation. It is trivial to see that E X M j M j = M j . Let M −j denote the vector containing all but M j and thus M −j = (M 1 , . . . , M j−1 , M j+1 , . . . , M K+1 ) T . Since M has a Dirichlet distribution, the subcomposition . Let z 0.025 denotes the 97.5th percentile of the standard normal distribution and the 95% CI of NIE 1 can calculated as f 1 (ζ) − z 0.025 var( f 1 (ζ)), f 1 (ζ) + z 0.025 var( f 1 (ζ)) . The 95% CI for NIE 2 , NDE and CDE can be calculated similarly.

Appendix C. Microbiome Data Generation Process for Simulation Setting 2
Let K + 1 be the number of taxa, for the ith subject, the microbiome data generation steps are listed below: Step 1: Generate true zeros for all taxa. The zeros for a taxon were generated using a Bernoulli distribution, Ber(∆), with ∆ given in Equation (7). So the probability of the taxon being 0 is equal to ∆. For taxon 1, we set ∆ = 0 so that there's no zero in taxon 1. For taxon 2, we set γ 0 = 1 and γ 1 = −3 in Equation (7) for ∆. For all the other taxa, γ 0 were generated from U(1, 2) and γ 1 = 0 in Equation (7) for ∆. So only the absence (or presence) of taxon 2 was associated with X. The total percentage of zeros was between 68.8% and 81.6% with K + 1 ranging from 10 to 500, which indicates high data sparsity.
Step 2: Generate RA for the non-zero taxa from a Dirichlet distribution. Assume we had P non-zero taxa (from Step 1) indexed by (t 1 , t 2 , · · · , t P ) in the ascending order meaning t 1 < · · · < t P . Here t 1 = 1 since the first taxon does not have any zeros from Step 1. The RA of those non-zero taxa was generated by the P-dimensional Dirichlet distribution with the dispersion parameter φ and mean parametesr (µ t 1 , · · · , µ t P ) that satisfies ∑ P p=1 µ t p = 1.
The dispersion parameter φ was set to be 50 to mimic the over-dispersion in real data. The values of mean parameters were chosen in a way such that it has smaller values for taxa with larger ∆'s in Step 1 so that taxa with lower abundance are more likely to have zeros. More specifically, the mean parameters were determined as follows: If t 2 = 2 (i.e., taxa 2 is one of the non-zero taxa): × exp(a 0 + a 1 X) 1 + exp(a 0 + a 1 X) , , p ∈ {3, . . . , P}, where a 0 = −2, a 1 = 5, α 1 0 was set to be the value such that the false zeros for taxon 2 that will be generated in next step will be around 20%, and α k 0 , k ∈ {2, . . . , K + 1} were generated from U(0, 1) If t 2 > 2 (i.e., taxa 2 is not one of the non-zero taxa): , p ∈ {2, . . . , P}.
Under this data generation, only the RA of taxa 1 and taxa 2 (conditional on presence) were dependent on X. The RA of taxa 3 to taxa K were independent of X.
Step 3: Generate sample absolute abundance (SAA) and false zeros from a multinomial distribution. Let (R t 1 , · · · , R t P ) denote the RA generated in Step 2 for the P non-zero taxa and thus ∑ P p=1 R t p = 1. The P-dimensional multinomial distribution with the parameter vector (R t 1 , · · · , R t P ) and the library size (randomly selected from real data) was used to generate SAA for all the P non-zero taxa. Those taxa with SAA = 0 generated from the multinomial distribution are false zeros.
Step 4: Getting final RA for all non-zero taxa. After SAA were generated for all nonzero taxa in Step 3, the SAA were divided by the library size to get the final RA for all non-zero taxa.
Step 5: Repeat the above Steps 1-4 for each subject to get a full data set of microbiome data for 200 subjects.