Microbiome as Mediator of Diet on Colorectal Cancer Risk: The Role of Vitamin D, Markers of Inflammation and Adipokines

Obesity and diet are associated with colorectal cancer (CRC) risk, and microbiome could mediate this risk factor. To investigate this interaction, we performed a case–control study (34 CRC cases and 32 controls) and analyzed fecal microbiota composition using 16S rRNA metabarcoding and sub-sequential shotgun analyses of genomic bacterial DNA to evaluate the role of microbiome and diet in CRC etiology, taking into account vitamin D and other risk biomarkers. Dietary habits were evaluated using a short questionnaire. Multivariate methods for data integration and mediation analysis models were used to investigate causal relationships. CRC cases were significantly more often deficient in vitamin D than controls (p = 0.04); FokI and CYP24A1 polymorphism frequency were different between cases and controls (p = 0.03 and p = 0.02, respectively). A diet poor in fatty fish and rich in carbohydrates was found to be significantly associated with CRC risk (p = 0.011). The mediation analysis confirmed the significant role of the microbiome in mediating CRC risk—increasing levels of Bifidobacteria/Escherichia genera ratio, an indicator of “healthy” intestinal microbiome, can overcome the effect of diet on CRC risk (p = 0.03). This study suggests that microbiome mediates the diet effect on CRC risk, and that vitamin D, markers of inflammation, and adipokines are other factors to consider in order to achieve a better knowledge of the whole carcinogenic process.

ence database and taxonomy: greengenes 13.8); (iv) chimeric sequences were removed using Chimera-Slayer [8]. The OTU table was re-generated by excluding chimeric OTUs and normalized by rarefaction.
12.5 million PE reads were produced in the 16S rRNA analysis. The mean number of PE reads per sample was about 187,000. About 97.5% of the sequences were taxonomically annotated using BioMaS. In particular, 89.5% and 68.2% of the sequences were taxonomically annotated at genus and species rank, respectively. In total, 744 operational taxonomic units (OTUs) were detected using 97% similarity threshold. The total number of sequences represents ranges from 115,437 to 204,779 sequences (Median 169,084, Average 168,265). Taxonomic data were summarized at phylum, class, order, family, genus and species level, normalized by applying DESeq2 [9] and analyzed by using Linear Discriminant Analysis (LDA) effect size (LEfSe) [10]. LEfSe results were plotted using an inhouse developed R script and GraPhlAn [10]. Univariate analysis on a per dataset basis was performed using LEfSe to identify features that were statistically different among groups and estimate their effect size. All p-values were set at 0.05, two-sided, adjusting for False Discovery Rate (FDR).
We also presented results of multivariable logistic models: taxa significantly associated with CRC status, after adjustment for FDR and for significant confounding factors.
To better understand microbiota composition and its role in the CRC carcinogenesis we analyzed also microbiota species and pathways from shotgun metagenomics. Shotgun data were missing for 2 CRC patients and 5 controls. Quantitative taxonomic profiling was performed using MetaPhlAn2 [11], whereas HUMANn2 [12] was used to profile pathway and gene family abundances. The generated profiles are available through the curated MetagenomicData R package [13].

Statistical analysis to investigate independent role of microbiome and interactive factors
Overall we applied classical statistical multivariate methods measuring significant abundances, adjusted only for FDR (such as Linear Discriminant Analysis effect size [10]), and multivariate methods adjusted also for confounders, which integrate different types of dataset (such as multivariable logistic models, Data Integration Analysis for Biomarker Discovery and priority-LASSO hierarchical approaches) [14]. Furthermore, we compared results obtained with 16S and shotgun data.
Information on lifestyle risk factors, diet and serum biomarkers are summarized with descriptive characteristics and compared between cases and controls. Median and interquartile range for continuous variables and absolute and relative frequencies for categorical variables are presented for cases and controls. Two-independent samples Wilcoxon test, Chi-square (Fisher exact test for sparse data) were used to assess differences between groups. We employed linear regression models after verifying normal distribution of residuals from full models. We graphically checked normal probability plots, which are designed to specifically test for the assumption of normality, Q Q plots, which compare our data to data from a distribution with known normality, and also the histograms of residuals, which are useful to investigates whether data meets assumption.
Serum biomarkers were investigated as continuous variables and categorical variables. 25-OHD levels are subjected to great seasonal variation and vitamin D supplementation is usually prescribed taking into account the levels of 25-OHD in winter [15][16][17]. Therefore, the comparison of 25-OHD in subjects enrolled in summer with 25-OHD in subjects enrolled in winter would have introduced a bias. To avoid this bias, we decided to consider a cut-off value for 25-OHD based on the season of blood collection (Supplementary Table S1), we defined the following cut-off points for vitamin D deficiency: values below 10 ng/ml in winter, spring and initial summer (from November to June) and below 20 ng/ml in late summer and autumn [18]. The cut-off point for hs-CRP was chosen based on the median value of controls, the cut-off point for adiponectin was chosen based on first quartile among controls. The cut-off point chosen for IL-6 was based on studies showing that IL-6 was associated with bone loss and resorption [19].
To develop the World Cancer Research Fund International (WCRF) score, we took into account the dietary recommendations for cancer prevention published by WCRF/American Institute for Cancer Research [20,21]. Subjects were considered adherent if their BMI was lower than 25, they declared high physical activity and a healthy diet (high consumption of fruit and vegetable or low meat sweets, cakes and pastries consumption). We also build two dietary scores: one was a categorical variable and the other was a continuous variable obtained from the multivariable logistic model, including significant risk factors and confounders.
Non-parametric Wilcoxon-Ranks and Kruskall-Wallis tests were used to investigate differences in abundance of normalized taxa among CRC and controls as shown in the box plots.
The network inference was based on Spearman's rank Correlation Coefficient and plotted by using the graph R package [22] including all the following continuous variables: serum biomarkers, species (normalized data) that were significantly associated with CRC in the sPLS-DA analysis (adjusting for FDR), BMI and a dietary score obtained from a multivariable logistic model, with cases vs controls as endpoint, adjusted for confounders. Correlation values were computed on pairwise complete observation, by adopting the cor function of stats R package (R 3.4 version, https://www.rproject.org/) and choosing as a measure the Spearman coefficient, a non-parametric measure of rank correlation.
We used Canonical Correspondence Analysis (CCA) [23] to further explore possible clinical inflammatory markers associated with salivary microbiota community structure. CCA is an unsupervised method aimed to identify the linear combinations of two sets of variables that are maximally correlated with each other and provides an important tool to summarize the overall dependency structures between two sets of variables such as taxa and serum biomarkers. We carried out the analysis on the log transformation of taxa with the highest association with cases and we obtained a graphical representation of the first two components (triplot) for cases and controls.
We combined microbiome normalized data with serum biomarkers, BMI and diet variables by using the Data Integration Analysis for Biomarker Discovery [24]. DIABLO is a recent framework of the mixOmics R package for the integration of different dataset in supervised analysis [25]. This allowed us to discriminate between the CRC and healthy controls by a block sPLS-DA supervised model.
A first step includes the creation of a priori knowledge design matrix for determining which dataset blocks should be maximally correlated between components. A correlation of 0.2 was imposed among the above mentioned dataset blocks. Then the best number of components was tuned, by running a 10-fold cross-validation repeated 10 times DIABLO model without selecting any variables. A grid of values indicating the range (min-max) of variables, that might be selected among blocks, was provided. In particular, a range from 5 to 11 variables was adopted and the optimum number of variables, to select in each block, was chosen for the first two components. Finally, the final sparse Partial Least Square -Differential Analysis (sPLS-DA) model was fit (10-fold crossvalidation and 100 repeats) by using the tuning setting found in previous steps.
A Heatmap plot was generated to graphically selecting the most discriminative species using the first and second component loading vectors.Heatmap plot was generated by performing a sparse Partial Least Square -Differential Analysis (sPLS-DA) (10-fold cross-validation and 100 repeats) and selecting the most discriminative species using the first and second component loading vectors.

Mediation analysis
To estimate the role of microbiome as mediator of dietary factors on CRC development, we performed a mediation analysis based on a counterfactual framework approach [26,27]. We decomposed the total effect (TE) of diet on CRC into a natural direct effect (NDE) and a natural indirect effect (NIE) acting on cancer through microbiome, considering alcohol consumption and physical activity as possible confounders. The NDE of diet was estimated by comparing the effect of a high risk diet (low fat fish and high cereal/carbohydrates consumption) versus the effect of a healthy diet (high fat fish and low cereal/carbohydrates consumption) on CRC, having the microbiome set to the value it would naturally have under the non-risk condition. The NIE was estimated by considering the exposure to risk and comparing the effect of the microbiome under a high risk diet versus the effect of the microbiome under a low risk diet on CRC. In agreement with previous publications [28,29], microbiome was evaluated considering the ratios of Bifidobacteria to Escherichia genera and Firmicutes to Bacteroides phyla, which have been shown to be modified with obesity [30]. Both ratios were log-transformed, adding 1 unit, and modelled as mediators with linear regression models, whereas ORs with 95% CIs were obtained for NIE using unconditional logistic regression models adjusting for alcohol and physical activity. For comparison, we evaluated also BMI and adiponectin as mediators of diet.
The formulas used to calculate each effect are as follows.
Let Y be the binary outcome (CRC status), A the exposure (diet), M the mediator variable (microbiome) and C a set of multiple confounders (alcohol and physical activity).
The outcome Y was modelled using logistic regression as follows: The mediator M was modelled using linear regression as follows: where c is the vector of confounders.
Provided that the assumption that the outcome Y is rare holds, we derived NDE and NIE on the Odds Ratio scale as following: where is the variance of the error term in the regression model on mediator M. For the binary exposure A, the two levels being compared are a*=0 and a=1.

Prediction of CRC
Abundances were log-transformed (after adding a pseudo-count of 1 to avoid non-finite values resulting from log(0)) and priority-LASSO (hierarchical approach) [14] unconditional logistic regression model was applied as well as backward and forward selection of variables. We used priority-LASSO models with the following block structure, according to their level of priority: "Serum biomarkers" (25-OHD, adiponectin, hs-CRP etc.) and "metagenomic data". The lambda parameter, i.e. regularization strength was selected as the minimum value that maximizes the area under the Receiver Operating Characteristics curve (ROC). A clustering analysis was applied to group similar microbial communities, using Spearman correlation, in order to identify taxa representative of the variability but not closely correlated with others. The number of clusters was established adopting the R package "NbClust", choosing the "Calinski and Harabasz Index" as score (maximum value).
We then integrated the clustering results together with the variables selected by priority-LASSO models and PLS-DA in order to choose significant independent biomarkers for the multivariable logistic models. Models were then evaluated by calculating the area under the ROC curve (AUC).
Leave-one-out cross-validation of the multivariable models was also applied and the correspondent cross-validated AUC was calculated.

Subgroup analyses to evaluate associations with prognostic factors and CRC recurrence
We evaluated differences in abundance of microbiome biomarkers among controls and cases grouped considering pT (pT1-2 vs pT3-4), pN (pN+, i.e. evidence of lymph-node involvement, and pN0) and early recurrence. Non-parametric Wilcoxon rank and Kruskall-Wallis tests were used.
F. nucleatum was categorised in quartiles to verify whether high abundance was associated with time to relapse in cases, calculated from diagnosis to first cancer relapse or last follow-up. We calculated Kaplan-Meier curves for disease free survival. Log-rank tests were evaluated to investigate differences between curves. All statistical tests were two-sided. Statistical analyses were performed using the SAS statistical software (version 9.02 for Windows) and R 3.4 version.

SUPPLEMENTARY FIGURE LEGENDS
Supplementary Figure  On the left, bar-plot representing the association of Genus Parvimonas in CRC patients with pathological tumor (pT) 1-2, with pT3-4 and controls. On the right, bar-plot representing the association of Genus dialister in CRC patients without lymph-node (pN0), with lymph-node involvement (pN+) and controls.

SUPPLEMENTARY TABLES
Supplementary