Integration of the Microbiome, Metabolome and Transcriptomics Data Identified Novel Metabolic Pathway Regulation in Colorectal Cancer

Integrative multiomics data analysis provides a unique opportunity for the mechanistic understanding of colorectal cancer (CRC) in addition to the identification of potential novel therapeutic targets. In this study, we used public omics data sets to investigate potential associations between microbiome, metabolome, bulk transcriptomics and single cell RNA sequencing datasets. We identified multiple potential interactions, for example 5-aminovalerate interacting with Adlercreutzia; cholesteryl ester interacting with bacterial genera Staphylococcus, Blautia and Roseburia. Using public single cell and bulk RNA sequencing, we identified 17 overlapping genes involved in epithelial cell pathways, with particular significance of the oxidative phosphorylation pathway and the ACAT1 gene that indirectly regulates the esterification of cholesterol. These findings demonstrate that the integration of multiomics data sets from diverse populations can help us in untangling the colorectal cancer pathogenesis as well as postulate the disease pathology mechanisms and therapeutic targets.


Introduction
Colorectal cancer (CRC) is the second most common cause of death due to cancer worldwide, with an incidence of almost two million cases in 2018 [1]. Early detection and treatment are critical factors in the course and prognosis of CRC, as the survival rate decreases with disease progression [2]. It is thought that CRC arises due to complex interactions of the transcriptome, metabolome, microbiome and immune system [3]. Assessment of these omics platforms and their associations may reveal important pathways that can be used for early cancer detection as well as therapeutic targets. Recent development of

Microbiome and Metabolome Analysis Identified Novel Interactions
In order to identify potential microbiome-metabolome interactions, we applied a Bayesian additive regression trees classification method (BART) to discriminate CRC cases from healthy controls in two published datasets; Kim et al. [8] and Clos-Garcia et al. [11]. The BART method was chosen because it is capable of both capturing nonlinear effects and estimating uncertainties associated with the estimates. Models were built with (1) microbiome only and (2) combined microbiome and metabolome data. In each case, two steps were followed. First, BART was used to select important features by local thresholding (please see Section 4 for more details). Second, having obtained the set of important features, a cross-validated BART model was built to optimise the classification accuracy. Since, within Kim et al. [8] dataset, controls (n = 102) were over-represented in relation to cases (n = 36), step 1 was repeated 10 times, each time using a different, but balanced, data resampling, in order to reduce the bias introduced by subsampling. Hypothesising that interactions that appear more frequently are potentially interacting ones, potential interaction effects were examined by counting the number of times each pair of features co-occurred in any downward path of the fitted trees. Partial dependence plots were constructed for both individual features and high frequency interacting pairs.
Both microbiome and metabolome features, within the Kim et al. [8] dataset, were selected (Supplementary Table S2). The out-of-sample (oos) misclassification error of the cross-validated BART model was 0.306. When only microbiome features were used, however, the oos error increased to 0.528, implying that the model, built on microbiome features alone, does not convey predictive power in differentiating carcinoma from control samples. Figure 1a,b shows the partial dependence of harmane and 5-aminovalerate, two of the metabolic features that were selected in all 10 repeats. We observe a decrease in the risk of cancer in relation to an increase of harmane levels, in addition, the risk is elevated as 5-aminovalerate increases. Partial dependence plots for the rest of the features, which were selected in at least half of the runs, are presented in Supplementary Figure S1. Although these features are selected by the model as significantly impacting the prediction of the outcome, no strong relationship with CRC risk was observed when considered individually, as illustrated, for example, by the partial dependence plots of ornithine, 1,2-dilinoleoyl-GPC (18:2/18:2), glycochenodeoxycholate, cholesterol and bacterium genus Eubacterium (OTU37). However, this inconsistency may be explained by interaction effects, which are not captured by partial dependence, since partial dependence, by definition, averages out the effect of all other variables.
We therefore investigated pairwise interaction effects for all selected microbiome and metabolomic features. Figure 1c displays the interaction count matrix resulting from the BART model. The most noticeable metabolite is 5-aminovalerate, which is shown to interact with many other metabolites and some bacteria, in particular, the genus Eubacterium, Adlercreutzia and SMB53. Although less prominent, 1,2-Dilinoleoyl-GPC (18:2/18:2), guanosine and indolin-2-one also were shown to be involved in multiple interactions.
These metabolites were shown to interact with 5-aminovalerate as well amongst themselves, as indicated by the interactions among all such pairs highlighted in Figure 1c. This suggests that high-order interactions may be present.
The top 2% feature pairs with the highest counts in Figure 1c were selected for partial dependence investigation. For each such pair, we estimate log-odds of cancer on a regular 2D grid of feature values of this pair, by fixing the feature values to be on the grid and averaging over all predictions from the training instances. Among all feature pairs, higher risk of CRC implies elevated level of 5-aminovalerate, however, increased 5-aminovalerate does not necessarily imply higher risk of CRC, when considering the effect of other metabolites ( Figure 2 and Supplementary Figure S2). In particular, increased harmane, guanosine, arabonate_xylonate, glycerate, N_Formylmethionine and Pyridoxamine levels or decreased 1,2-Dilinoleoyl-GPC (18:2/18:2), N_Acetylvaline and indolin-2-one levels, are shown to lower the risk of CRC, even though 5-aminovalerate is abundant. When 5-aminovalerate is not abundant, CRC risk is minimised, for example, by increasing harmane, guanosine, or by decreasing 1,2-Dilinoleoyl-GPC (18:2/18:2) levels. In addition, the absence of the genus Adlercreutzia is shown to lower the risk of CRC when 5-aminovalerate is elevated. The complete results are represented within Supplementary Table S3. is elevated. The complete results are represented within Supplementary Table S3. Although when not considering interaction effects, CRC risk increases as 5-aminovalerate levels increase (Figure 1b), the result of BART interaction analysis suggests that the risk is attenuated when controlling for other metabolites. This observation offers additional insights to the regulation of metabolomic and microbiome markers in CRC tumorigenesis, and could serve as a basis for future experimental and validation studies in the search for potential targets for therapeutic intervention.  Although when not considering interaction effects, CRC risk increases as 5-aminovalerate levels increase (Figure 1b), the result of BART interaction analysis suggests that the risk is attenuated when controlling for other metabolites. This observation offers additional insights to the regulation of metabolomic and microbiome markers in CRC tumorigenesis, and could serve as a basis for future experimental and validation studies in the search for potential targets for therapeutic intervention.
The Clos-Garcia [11] dataset analysis revealed a better performance in discriminating CRC from healthy controls, in both combined microbiome metabolome dataset (oos: 0.094) and microbiome dataset alone (oos: 0.113). Compared with Kim et al. [8], the Clos-Garcia [11] microbiome samples confer considerably higher predictive power. This noticeable difference in the classification error between the two published datasets may be partly explained by the observation that no common microbes, at the genus level, were selected by BART. The BART interaction analysis identified three microbiome-metabolome pairs and three microbial pairs that exhibit strong interactions compared to other feature pairs. All microbiome-metabolome pairs involve ChoE(20:4), which is shown to interact with the bacteria genera Staphylococcus, Blautia and Roseburia. Partial dependence of ChoE (20:4) showed that the risk of CRC increases when ChoE(20:4) is high, although the relation is nonlinear as the risk remains low as long as ChoE(20:4) level is below some threshold ( Figure 3a). Partial dependence of these feature pairs (Figure 3b), however, provides us with valuable insights into how the risk may be mitigated by controlling for specific microbes. As can be seen from the contour plots, under high values (lognormalised) of ChoE(20:4), decreasing Staphylococcus and Roseburia, or increasing Blautia, reduces the risk of CRC. Further inspection of the contour plots for microbial interactions reveals that the risk is minimised when Staphylococcus and Roseburia decrease simultaneously; and when Blautia and an unknown genus (OTU17213), from the same family as Blautia, Lachnospiraceae, increase simultaneously. Supplementary Table S4 provides a summary of these results.
able difference in the classification error between the two published datasets may be partly explained by the observation that no common microbes, at the genus level, were selected by BART. The BART interaction analysis identified three microbiome-metabolome pairs and three microbial pairs that exhibit strong interactions compared to other feature pairs. All microbiome-metabolome pairs involve ChoE(20:4), which is shown to interact with the bacteria genera Staphylococcus, Blautia and Roseburia. Partial dependence of ChoE (20:4) showed that the risk of CRC increases when ChoE(20:4) is high, although the relation is nonlinear as the risk remains low as long as ChoE(20:4) level is below some threshold ( Figure 3a). Partial dependence of these feature pairs (Figure 3b), however, provides us with valuable insights into how the risk may be mitigated by controlling for specific microbes. As can be seen from the contour plots, under high values (log-normalised) of ChoE(20:4), decreasing Staphylococcus and Roseburia, or increasing Blautia, reduces the risk of CRC. Further inspection of the contour plots for microbial interactions reveals that the risk is minimised when Staphylococcus and Roseburia decrease simultaneously; and when Blautia and an unknown genus (OTU17213), from the same family as Blautia, Lachnospiraceae, increase simultaneously. Supplementary Table S4 provides a summary of these results.

Metabolite Enrichment Analysis
The Kim et al. [8] metabolites dataset was also analysed using the IMPaLA (Integrated Molecular Pathway Level Analysis) tool. IMPaLA is a web tool, developed for integrated pathway analysis of metabolomics data alongside gene expression or protein abundance data [12,13]. It works through extending over-representation and enrichment analyses to multiple data types. Supplementary Table S5 depicts the pathways enriched with the Kim et al. [8] metabolites. The ABC transporters and Purine Nucleoside Phosphorylase pathways are the most significant pathways associated with the Kim et al. [8] metabolites, KEGG (p value: 6.9 × 10 −10 ) and SMPDB (p value: 7.65 × 10 −8 ), respectively.
Supplementary Table S6 shows the list of IMPaLA identified enriched pathways using the Clos-Garcia metabolite dataset. Amongst the most significantly enriched pathways were the ABC transporter (p value: 6.09 × 10 −9 ) and the Purine Nucleoside Phosphorylase Deficiency (p value: 7.65 × 10 −8 ) pathways. Cholesterol was found to be significantly enriched among several biochemical and metabolic pathways, primarily featuring path-ways involving vitamin digestion and absorption (p value: 0.0003), and digestion of dietary lipids (p value: 0.0006).

Bulk RNA Seq Analysis
A list of 6066 differential genes were obtained from COAD-TCGA (The Cancer Genome Atlas-Colon Adenocarcinoma) [14] study after iterative undersampling, detailed in Section 2, and the resulting p value significance of the fold change was 0.0001. Out of the 6066 genes, 2653 (43.74%) were upregulated and 3413 (56.26%) were downregulated. Figure 4a shows a volcano plot for the selected 6066 genes. The application of a Recursive Feature Elimination [15] over the 6066 differential gene expression dataset resulted in the identification of 121 unique genes. Out of these 121 genes, 76 appear more than once among the iterations. We used Enrichr [16] for the enrichment analysis for the selected 76 genes which revealed that the Glutamatergic synapse pathway was significantly associated (p value: 0.001) with the GNG3, GLS2, GNG7, GRIK1, GRIA3 and GRIN2D genes (Figure 4b). Furthermore, the alanine, aspartate and glutamate metabolism pathways were significantly associated (p value: 0.02) with GLS2, GPT and ASPA. A list of KEGG pathways can be found in Supplementary Table S1. tegrated pathway analysis of metabolomics data alongside gene expression or protein abundance data [12,13]. It works through extending over-representation and enrichment analyses to multiple data types. Supplementary Table S5 depicts the pathways enriched with the Kim et al. [8] metabolites. The ABC transporters and Purine Nucleoside Phosphorylase pathways are the most significant pathways associated with the Kim et al. [8] metabolites, KEGG (p value: 6.9 × 10 −10 ) and SMPDB (p value: 7.65 × 10 −8 ), respectively.
Supplementary Table S6 shows the list of IMPaLA identified enriched pathways using the Clos-Garcia metabolite dataset. Amongst the most significantly enriched pathways were the ABC transporter (p value: 6.09 × 10 −9 ) and the Purine Nucleoside Phosphorylase Deficiency (p value: 7.65 × 10 −8 ) pathways. Cholesterol was found to be significantly enriched among several biochemical and metabolic pathways, primarily featuring pathways involving vitamin digestion and absorption (p value: 0.0003), and digestion of dietary lipids (p value: 0.0006).

Bulk RNA Seq Analysis
A list of 6066 differential genes were obtained from COAD-TCGA (The Cancer Genome Atlas-Colon Adenocarcinoma) [14] study after iterative undersampling, detailed in Section 2, and the resulting p value significance of the fold change was 0.0001. Out of the 6066 genes, 2653 (43.74%) were upregulated and 3413 (56.26%) were downregulated. Figure 4a shows a volcano plot for the selected 6066 genes. The application of a Recursive Feature Elimination [15] over the 6066 differential gene expression dataset resulted in the identification of 121 unique genes. Out of these 121 genes, 76 appear more than once among the iterations. We used Enrichr [16] for the enrichment analysis for the selected 76 genes which revealed that the Glutamatergic synapse pathway was significantly associated (p value: 0.001) with the GNG3, GLS2, GNG7, GRIK1, GRIA3 and GRIN2D genes (Figure 4b). Furthermore, the alanine, aspartate and glutamate metabolism pathways were significantly associated (p value: 0.02) with GLS2, GPT and ASPA. A list of KEGG pathways can be found in Supplementary Table S1.

Single Cell RNA Seq Analysis
We obtained four lists of differential genes in four different cell types described by Li H et al. (2017) [17] listed in Supplementary Table S7. DAVID enrichment analysis was performed to obtain the significant pathways for these genes [18]. For the case of fibroblast
Significant downregulated pathways identified were the oxidative phosphorylation (p value 0.000347) (NDUFB1, COX6B1, COX7A2, COX7C, ATP5C1, ATP5G1, ATP5H) and the nitrogen metabolism (p value 0.000595) (CA1, CA7, CA2) pathways. The p53 signalling pathway (p value 0.00244) (ATR, MDM2, PERP, SESN3) was depicted as significantly upregulated. We further compared the 121 significant genes obtained from the bulk RNAseq analysis and compared them to the 342 genes obtained from single-cell epithelial cells. A total of 17 genes were identified to be present in both analyses (Figure 4c). All 17 genes show a similar regulatory pattern in both the bulk and single-cell RNA-seq and are represented in Table 1. A heatmap for the 17 genes reveals a differential gene expression pattern across both the bulk (Figure 5a) and single cell RNA seq datasets (Figure 5b). We then employed the Network Analyst platform [20] and constructed a tissue specific interaction network using these 17 genes (Figure 5c). In the resulting network the seed nodes are shown in red while the genes that are involved in metabolism are shown in blue.   . Red denotes upregulation, white indicates no expression or zero expression and blue depicts low expression; both heatmaps were performed using hierarchical clustering on the rows. The three clusters are coloured as red, black and green; (c) the network was generated using the 17 genes that were found to be differentially expressed in the network. The seed genes are labelled and shown in red, while the proteins connected to these genes are shown in green. The interactions between the genes and the proteins are shown in red. The blue coloured proteins or seed genes are involved in the metabolic pathways from KEGG.

Qualitative Integration
We attempt to integrate multiple features from diverse data sets based on the enrichment analysis and literature review. ABCG2 and AQP8 genes are responsible for bile secretion (AQP8 is downregulated), which is converted to cholesterol by the gut enzymes [63]. Cholesterol is further converted to coprostanol by members of the Lachnospiraceae family, including Blautia and Roseburia [64]. In Figure 6, we provide an example of the qualitative integration.

Qualitative Integration
We attempt to integrate multiple features from diverse data sets based on the enri ment analysis and literature review. ABCG2 and AQP8 genes are responsible for bile cretion (AQP8 is downregulated), which is converted to cholesterol by the gut zymes [63]. Cholesterol is further converted to coprostanol by members of the Lachnos raceae family, including Blautia and Roseburia [64]. In Figure 6, we provide an example the qualitative integration Figure 6. ABCG2 and AQP8 genes are responsible for bile secretion and bile is converted to cholesterol by the gut enzymes [63]. Cholesterol is further converted to coprostanol by members of the Lachnospiraceae family [64].

Discussion
In this study, we used multiomics data sets from different populations to understa pathophysiological pathways and identify potential therapeutic CRC targets. In an eff to understand the complex interactions between microbiome and metabolome we e ployed two datasets (Kim et al. [8] and Clos-Garcia [11]) and quantified the top rank interactions using the BART model. We observed a decrease in the predictive risk of C as the level of harmane increased with estimated credible intervals from BART. In o BART model, sex and age were used as a covariate. Kim et al. also considered age, sex, a smoking status (smoker/non-smoker) as covariates. Moreover, we were able to show t this predictive risk is elevated as serum levels of faecal 5-aminovalerate increases. In t case, 5-aminovalerate are considered as interactions as they appear in a contiguous dow ward path of the tree from the root node to a terminal node. Moreover, a partial depen ence of ChoE (20:4) showed that the risk of CRC increases when cholesteryl es Figure 6. ABCG2 and AQP8 genes are responsible for bile secretion and bile is converted to cholesterol by the gut enzymes [63]. Cholesterol is further converted to coprostanol by members of the Lachnospiraceae family [64].

Discussion
In this study, we used multiomics data sets from different populations to understand pathophysiological pathways and identify potential therapeutic CRC targets. In an effort to understand the complex interactions between microbiome and metabolome we employed two datasets (Kim et al. [8] and Clos-Garcia [11]) and quantified the top ranked interactions using the BART model. We observed a decrease in the predictive risk of CRC as the level of harmane increased with estimated credible intervals from BART. In our BART model, sex and age were used as a covariate. Kim et al. also considered age, sex, and smoking status (smoker/non-smoker) as covariates. Moreover, we were able to show that this predictive risk is elevated as serum levels of faecal 5-aminovalerate increases. In this case, 5-aminovalerate are considered as interactions as they appear in a contiguous downward path of the tree from the root node to a terminal node. Moreover, a partial dependence of ChoE (20:4) showed that the risk of CRC increases when cholesteryl ester, ChoE(20:4), is high. Importantly, a pairwise interaction analysis suggests that the risk may be mitigated by controlling for other microbial and/or metabolomic features, as shown in Supplementary  Tables S3 and S4. This offers great potential for targeted experimental pathophysiology studies which could provide further evidence for our findings. To make our analysis consistent across the two datasets of Kim et al. and Clos-Garcia et al., we removed the adenoma samples and considered only the carcinoma and healthy controls, instead of pooling adenomas and carcinomas together. Clos-Garcia et al. identified increased levels of Fusobacterium, Parvimonas and Staphylococcus, with decreased levels of Lachnospiracaea family members in CRC [11]. We also note that in Clos-Garcia et al., a logistic regression model was only fitted to 16 genera and 6 metabolites, which were identified by comparison with their previously published results. Conversely, in our case all available microbiome and metabolome features were included in BART modelling. Kim et al. identified multiple associations between bacteria and metabolites [8].
To add biological pathway information on the metabolomics and microbiome data sets, we used enrichment analysis. Interestingly, based on an analysis of the Kim et al. metabolomics dataset, we identified many biochemical pathways enriched with 5-aminovalerate. We found 5-aminovalerate to be significantly enriched within the arginine and proline metabolism (p-value: 0.00243) and D-arginine and D-ornithine metabolism (p value: 0.0541) pathways. Lysine, proline and arginine have previously shown antitumour effects on cancer cell line HCT 116 by inhibiting MMP expression and invasion [65]. Although it may be thought that their metabolism would be downregulated in cancer if a tumour has been able to develop, a systematic review evaluating the effects of arginine on colorectal cancer found decreased tumour production and crypt cell hyperproliferation during the initiation stage of carcinogenesis, but a stimulation of tumour growth during the promotion stage [66]. Evidence also indicates that the gut microbiome plays an important role in the production of 5-aminovalerate. Lower levels of 5-aminovalerate are found in germ free mice in comparison to normal mice [67,68], and high levels are thought to indicate bacterial overgrowth [69]. In addition to that, 5-aminovalerate is thought to be a metabolite of dietary proteins, particularly lysine, produced by bacteria in the gastrointestinal tract and has also been found to be increased in a haem-enriched diet, which has also been associated with a higher risk of CRC [70].
A metabolic pathway enrichment analysis also identified the short chain fatty acids (SCFAs) acetate, propionate and butyrate to be significantly associated (p < 0.05) with CRC. However, in the data set the expression label was low, so it might be due to the quantification of the metabolomics experimental process. Fibre is digested by anaerobic bacteria in the gut to produce these SCFAs [71]. SCFAs have been detected in higher levels in healthy individuals in comparison to CRC patients, and have been shown to inhibit histone deacetylase activity and interact with cell surface receptors in enterocytes affecting the epigenetic gene expression [72]. Their role is postulated to be protective against CRC, as they have been shown to arrest growth and differentiation in human colon carcinoma cells [73].
In our analysis we performed an integration of metabolomic, microbiome and transcriptome datasets from multiple different CRC cohort studies to identify potential new disease targets. Methodologically it is different from Acharjee et al. [73], and Quraishi et al. [74]. In both of these studies, omics integration was performed from the same cohorts for obesity and inflammatory bowel disease. The data integration performed here for different CRC patient cohorts is being done for the first time. However, in the case of humanised microbiome mouse models [75] the transcriptome, microbiome and metabolites were integrated.
In this analysis, we identified 17 genes common between bulk and single cell RNA sequencing datasets and these genes were used for the pathway analysis. Interestingly several genes like KIAA1199, CDH3, GUCA2B, LGALS4, CA7, NR3C2, ABCG2, AQP8, etc. were found to be implicated in CRC pathophysiology. Few genes that we have identified are involved in various metabolic processes, for example HSD11B2 (converts cortisone to cortisol), NR3C2 and HSD11B2 (Aldosterone regulated sodium reabsorption), ABCG2 and AQP8 (Bile secretion), CA7 (nitrogen metabolism). In our metabolic analysis we identified choE(20:4), also known as cholesteryl ester, as a significant metabolite. It is synthesised by the esterification of cholesterol; the enzyme responsible for it is ACAT1 [76]. Our network analysis suggests that ACAT1 and several other metabolic genes are interacting with the 17 genes we identified either directly or indirectly. Metabolites with high ranking seem to be interacting with multiple microbiome species, for example Indolin-2-one is a tryptophan metabolite, likely of gut microbiome origin [77]. Indolin-2-one compounds have been identified as an effective cancer treatment, with indolin-2-one derivatives showing anticancer effects in vitro against ovarian cancer cells [78][79][80]. Louis et al. [81] have also identified higher log blood harmane concentration in colon cancer and other cancers in comparison to those without cancer.
There are some limitations of this study. Firstly, the different cohorts of populations employed exhibit heterogeneity which limits comparability of results. Secondly, we only considered pairwise interactions. However, it is anticipated that higher order interactions may be present, as indicated by all pairwise interactions between Staphylococcus, Roseburia and ChoE(20:4) (Cholesterol Ester). Thirdly, nutrition, exercise and other lifestyle factors strongly influence the composition of the gut microbiome and metabolome. Unfortunately, these could not be considered in our analysis. Fourthly, the targets (metabolites, genes and microbiota) generated from this study were not validated in follow-up for the participants as only one stool sample was collected. Potential future studies could include the collection of multiple longitudinal stool samples for causal analysis to improve validity of results.
A small proportion of colorectal cancers, which are not included in our analysis, are directly caused by inherited gene mutations, including mutations of the MUTYH gene or mismatch repair genes [82]. Nonetheless, alterations in gene expression are seen across all patients with colorectal cancer [82]. Mutations in the APC, Kirsten-ras and p53 genes are thought to be an alternative pathway in tumour development [83]. Additionally, the transcription of other genes may be upregulated or downregulated in CRC [84], as seen in our analysis.

Materials and Methods
We used existing public datasets (Table 2) to identify potential mechanisms and interactions between microbiome, metabolome and genes for CRC. A workflow of all the methods and processes are described in Figure 7.  Figure 7. Diagrammatic representation of the qualitative integration of different methods and omics data sets to identify markers for colorectal cancer using metabolomics, transcriptomics (single-cell RNA and bulk RNA) and microbiome data sets. Gene set enrichment was used to identify associations between the genes and metabolites.
The R package "bartMachine" [86] is a Bayesian model characterised by the sum of regression trees. Consider an input x = (x1, ⋯, xp) with "p" predictors, the prediction label "y" is modelled as y = f(x) + ε, ε∼N(0, 2), where f(x) = ∑ , and Tk(x) represent regression trees. Given an input x, Tk assigns x to one of the terminal node values which is consistent with the rules used to split the predictors in x in the tree, and the prediction target is obtained by summing all Tk(x), k = 1, ⋯, m from the m trees, plus some Gaussian noise. . Diagrammatic representation of the qualitative integration of different methods and omics data sets to identify markers for colorectal cancer using metabolomics, transcriptomics (single-cell RNA and bulk RNA) and microbiome data sets. Gene set enrichment was used to identify associations between the genes and metabolites. The R package "bartMachine" [86] is a Bayesian model characterised by the sum of regression trees. Consider an input x = (x 1 , · · · , x p ) with "p" predictors, the prediction label "y" is modelled as y = f (x) + ε, ε∼N(0, 2), where f (x) = ∑ m k=1 T k (x), and T k (x) represent regression trees.
Given an input x, T k assigns x to one of the terminal node values which is consistent with the rules used to split the predictors in x in the tree, and the prediction target is obtained by summing all T k (x), k = 1, · · · , m from the m trees, plus some Gaussian noise.
Tree-based machine learning methods, such as decision tree and random forest, typically directly learn from the data the partition and the terminal node values. For the case of the BART method, these are assigned priors, and inference is drawn from the posterior distribution of the sum of trees given the data observed. One distinct feature of BART is the specification of the priors. First, as in many Bayesian machine learning methods, the prior is chosen to regularise the fit so that individual tree effect is small and no single tree dominates the prediction, a property shared by gradient boosted trees (GBM); however, unlike GBM, the number of trees m required in BART is often much smaller-studies have shown that BART performs well with no more than 200 trees across a variety of datasets [87] and in some cases the performance improvement is only marginal for using more than 50 trees [86].
Second, data-informed priors for terminal node values and observation variance are used are used to ensure that they do not deviate much from what the training data entails, a desirable property when external information is unavailable. The posterior distribution of sum-of-trees given data is simulated through the backfitting MCMC algorithm [87]. Parameter estimates and predictions can then be obtained from the posterior. Additionally, the uncertainty of the estimates can be quantified through Bayesian credible intervals. With some modification to the backfitting algorithm, the BART probit model for (binary) classification: P(Y = 1|x) = Φ(f (x)) can be used.
The variable importance was assessed by the fraction of times each variable is used to split the nodes in the posterior sum of trees, called "inclusion proportion". Variables that appear frequently in the trees, hence higher inclusion proportion, are likely to be important. Significance was assessed by local thresholding [86], that is, a variable will be selected if its inclusion proportion exceeds the 0.95 quantile of its null distribution obtained by permuting the response. The interaction effect was estimated by counting the number of times a set of variables appear together in a downward path of the tree from the root node to a terminal node [86]. When the number of trees m is large, the flexibility offered by so many trees means that inclusion of irrelevant variables is unlikely to significantly degrade the fit, and so these variables could appear frequently in the trees, resulting in spurious interactions and less effective variable selection. Therefore, by limiting the number of trees to be small and forcing the variables to compete to enter into the model, BART can be an effective method for determining variable importance and interaction effect.
Metabolic enrichment analysis was conducted using IMPaLA, a publicly available web platform, that has been developed for integrated pathway analysis of metabolomics data alongside gene expression or protein abundance data [12]. It performs an overrepresentation or enrichment analysis with user-specified lists of metabolites and genes using over 3000 preannotated pathways from 11 databases. It provides pathway overrepresentation and enrichment analysis functionality with user-specified lists of genes/proteins and/or metabolites, generally termed physical entities.
We analysed the differential expression of genes involved in colon adenocarcinoma using. The TCGA-COAD (The Cancer Genome Atlas-Colon Adenocarcinoma) mRNASeq level 3 raw count data [14] generated by the UNC version 2 analysis pipeline. The inspected 20,532 gene loci out of which 29 were hypothetical loci and 500 samples (Colon adenocarcinoma: 459 and Normal: 41). We used DESeq2 to investigate the RNAseq raw count and identify differentially expressed genes [88]. DESeq2 reduces the number of genes tested by removing the genes unlikely to be significantly differentially expressed prior to testing and performs gene-level quality control. The p-values are attained by the Wald test and are corrected for multiple testing using the Benjamini and Hochberg method. The number of genes, in the output table of DESeq analysis with the significance cutoff set to 0.1, was the same as the original (20,532). As unbalanced class distribution of labels (Colon adenocarcinoma: 459 and Normal: 41) can affect predictive performance, specifically for minority class [89], we performed under sampling of the data [90]. We chose 41 colon adenocarcinoma samples and 41 normal samples iteratively (not repeating the samples) and analysed them using DESeq2. We then combined the results by taking the union of the list of genes in the resultant table from each iteration. We subset the results iteratively by decreasing the significance cut off from 0.05 to 0.0001 (0.05, 0.01, 0.001, 0.0001), which resulted in 12,290 genes with p adjusted < 0.001. We then selected the overlapping genes across all the iterations, 6066 genes.
We used Recursive Feature Elimination (RFE) to select a small subset of genes from a broad range of gene expression data [15]. RFE fits a model, like random forests (rfFuncs), and removes the weakest feature (or features) until the specified number of features is reached. We used the RFE via CARET [91]. Due to the unbalanced classes, we performed RFE iteratively on the undersampled data as we did earlier. We used random forest to select the subset of genes from each iteration. This results in a list of 345 genes collectively from all iterations. Taking the union of the list of genes from each iteration resulted in a list of 121 genes and out of 345 genes 76 appeared more than once among 11 iterations. For the 76 genes selected, we used Enrichr for the enrichment analysis [16]. Enrichr is an integrative web-based software application that includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualisation approaches to display enrichment results.
The FPKM values of the significant genes from the single-cell data for four different cell types were obtained from the Li et al. [13] and Zhang et al. [19] datasets. The list of genes used in this study can be found in Supplementary Table S1.
We used the DAVID (Database for Annotation, Visualization and Integrated Discovery) online tool [18] to find the significant pathways that are enriched for the different single cells' significant genes lists.
The heatmaps were generated using gplots package heatmap2 function in R. The network analysis was done using NetworkAnalyst (v3.0) [91] where a tissue specific coexpression network was built with the degree filter of 1.0 on all the nodes. The seed genes are shown in red colour and are labelled while the proteins connected to the seed are shown in yellow. The blue colour nodes are the proteins involved in the metabolic processes.
We used enrichment analysis to associate and integrate genes and metabolites identified from transcriptomics and metabolomics datasets derived from diverse populations. Moreover, a literature-based validation was performed to gain insights to CRC associated pathways.

Conclusions
Our microbiome and metabolome analysis identified novel interactions related to 5-aminovalerate and cholesterol. Using bulk and single cell RNA sequencing, we identified 17 genes (e.g., KIAA1199, CDH3, GUCA2B, LGALS4, CA7, NR3C2, ABCG2, AQP8, etc.) including several metabolic genes (HSD11B2, NR3C2, ABCG2, CA7) associated with CRC pathophysiology. A metabolic enrichment analysis revealed a prominent of cholesterol pathways in CRC physiology. Finally, our qualitative integration approach catered the identification of the downregulation of ABCG2 and AQP8 genes, responsible for bile secretion, which directly increases the cholesterol synthesis facilitated by gut microbes.