Bovine Milk Fat Intervention in Early Life and Its Impact on Microbiota, Metabolites and Clinical Phenotype: A Multi-Omics Stacked Regularization Approach

The integration and analysis of multi-omics modalities is an important challenge in bioinformatics and data science in general. A standard approach is to conduct a series of univariate tests to determine the significance for each parameter, but this underestimates the connected nature of biological data and thus increases the number of false-negative errors. To mitigate this issue and to understand how different omics’ data domains are jointly affected, we used the Stacked Regularization model with Bayesian optimization over its full parameter space. We applied this approach to a multi-omics data set consisting of microbiota, metabolites and clinical data from two recent clinical studies aimed at detecting the impact of replacing part of the vegetable fat in infant formula with bovine milk fat on healthy term infants. We demonstrate how our model achieves a high discriminative performance, show the advantages of univariate testing and discuss the detected outcome in its biological context.


Introduction
Integrating data from different data sources such as genomics, metabolomics, proteomics, images and/or clinical features is an active area of research. Traditionally, a paired t-test or a non-parametric counterpart is employed to determine significant differences and to detect relevant biomarkers. However, such approaches overlook the multivariate interaction between different biomarkers. This might lead to a suboptimal detection of biomarkers and an increased number of false-negative errors. Similarly, the use of multivariate but linear techniques such as the popular mixOmics [1] do not take into account the full complexity of biological data. Recently, several multi-omics methods have been developed, which can be categorized into three types: concatenation-based, where the data are concatenated for analysis; model-based, where separate models are built for each data set, and a final model is built from the bottom ones; and transformation-based models, where each data set is transformed into a particular format and then joined with the rest to build the model [2,3]. The model-based stacking generalization method is an attractive solution due its simplicity and high empirical performance [4,5]. However, it has two drawbacks: the models' hyperparameters are usually optimized separately, missing the global optimum, and each model only has access to its own domain and thus misses out on the information conveyed by other potentially correlated features/biomarkers. Here, we use a particular model-based strategy, Manifold Mixing for Stacked Regularization (MMSR) [6], that addresses both issues. This method is based on the manifold assumption, the idea that complex high-dimensional data lie on or close to a lower-dimensional manifold. Each data set manifold is then used to mix information across the various modalities, and the transformed data are then fed to a stacked model whose parameters are jointly optimized via Bayesian optimization. We extend the MMSR model for highly curved manifolds by partitioning the space into flat and curved regions while creating region-specific maps between the different manifolds. We then demonstrate its performance on the combined data of two intervention trials. The intervention studies examined the effect of infant formula (IF) containing bovine milk fat on microbiota, metabolomics and clinical parameters in healthy term infants. In both studies, we aim to understand the impact of different IFs with varying sn-2 palmitate content on various physiological and clinical parameters. Specifically, we compare an IF with a mixed-fat blend of 50% vegetable oils and 50% bovine milk fat (MF), containing 39% of sn-2-palmitate, with a standard formula with the same total fat content but originating from a 100% vegetable fat blend (VF), containing 10.1% of sn-2-palmitate. To our knowledge, this is the first IF intervention study analyzed using a multi-omics, multivariate model.
In order to understand the impact of different IFs on the overall biomarker profile, we investigate the model feature importance. Due to the model complexity, we take a model-agnostic approach related to permutation importance (PI). However, since PI is biased in the presence of highly correlated data, we use the pairwise permutation algorithm (PPA) [7] and discuss the multi-omics profile differences that the model exploits to discriminate between the two groups.

•
Novel application of the Manifold Mixing for Stacked Regularization framework; • Extension of MMSR by using curvature-dependent domain partition and non-linear inter-manifold maps; • Novel application of PPA; • Exploration of the effect of milk-fat-containing formula on infant gut parameters using a multi-omics multivariate model.

Data set
The data set consisted of three sub-sets: microbiota, metabolites and clinical parameters. These data were obtained from 2 different clinical trials [8,9] including a total of 33 infants. In both studies, the infants consumed standard IF with 100% vegetable fat (VF) or IF with 50% milk fat (MF) for two weeks in a cross-over design. In this study, the intervention period was 2 weeks. At the end of each two-week intervention period, fecal samples were collected. The fecal samples were analyzed for microbiota composition and fecal metabolites specifically for the current model. Clinical parameters included anthropometric measurements, amount of formula consumed and data from questionnaires on stool characteristics and gastrointestinal symptoms as well as fecal biochemical profiles. This information was also collected at the end of each intervention period and has been reported before [8,9]. Both studies were approved by the Harokopio University's ethics committee (Athens, Greece). The first trial [8] was conducted in Athens and Larissa (Greece) between December 2017 and July 2018. The second study [9] was conducted between May 2019 and November 2019 in Athens, Greece. The first and second trials were conducted in agreement with the International Conference on Harmonisation guidelines on Good Clinical Practice and registered at the Dutch Trial Register (trialregister.nl) as Trial NL6702 and Trial NL7815, respectively.

Metabolomics
Lyophilized fecal samples were analyzed for metabolite composition by Biocrates (Innsbruck, Austria) for the first and by Fraunhofer Institute (Hannover, Germany) for the second trial. To extract the metabolites, samples were supplemented with 10 volumes of extraction buffer (85% ethanol in phosphate buffer) and vortexed thoroughly until dissolution. The homogenized samples were placed in a chilled ultrasonic bath for 5 min and centrifuged, and the resulting supernatants were used for the quantification of metabolites with an MxP Quant 500 kit (Biocrates, Innsbruck, Austria). Lipids and hexoses were measured by flow injection analysis-tandem mass spectrometry (FIA-MS/MS) using a Xevo ® TQ-S instrument (Waters, Milford, MA, USA) with an electrospray ionization (ESI) source, and small molecules were measured by ultra-performance liquid chromatography-tandem mass spectrometry (UPLC-MS/MS) using the same Xevo ® TQ-S instrument. Data were quantified using TargetLynx mass spectrometry software (Waters). In the model, only the data of the small molecules were included.

Microbiota Composition
DNA was isolated from the lyophilized fecal materials of both studies using a QIAmp fast FNA stool mini kit (Qiagen, Venlo, The Netherlands) and quantified using Quantit assays (Thermo Fisher Scientific, Waltham, MA, USA). Per sample, 50 ng of DNA was used to generate dual-indexed sequencing libraries using the DNA Flex method (Illumina) with 5 cycles of PCR amplification. The resulting libraries were sequenced on an Illumina HiSeq2500 sequencer (Illumina). Paired-end reads of 300 base-pairs in length were generated using a modified sequencing protocol. Per sample, between 10 and 12 million clusters were generated. Basecalling was performed using BCL2FASTQ2 software (Illumina). Raw reads were checked and quality-filtered using fastp (v.0.20.0) [10]. Here, the adapter was detected and removed; a total of 5 bp in front for read1 was trimmed, and sliding-window quality trimming was applied (with a window width of 4 bp and threshold Q-score of 15). After trimming and adapter removal, reads shorter than 70 bp were removed. Paired-end reads that passed quality filtering were then mapped against the human genome (hg19) using Bowtie 2 (v.2.4.1) [11]. SAMtools (v.1.9) [12], sambamba (v.0.7.1) [13] and BEDtools (v.2.27.1) [14] were used to remove reads that were mapped to the human genome. The remaining high-quality, non-human reads were subsampled to 20 million paired-end reads per sample using seqtk (v.1.3r106) and fed to a humann3 pipeline (v.3.0.0.alpha.3) [15]. For each sample, the species-level microbial composition was inferred using MetaPhlAn3 (v.3.0.2) [15].

Software
All our models were implemented in Python 3.7, NumPy (version 1.21. 5), and Pandas (version 0.24.2) were used to prepare and manipulate the data set. For the base models, we used the tree-based XGBoost model from the XGBoost package version 1.6.1.

Stacked Regularization
We used a stacking framework [6], where each domain data set is fed to a different model, and the outputs of these models are subsequently fed into one or multiple layers of models that learn how to optimally combine the base-level predictions (see Figure 1). The goal is to compute p y|x 1 , . . . , x M , where x i are the coordinates of an instance from X in domain X i . The input is passed to a first layer of W 0 predictors g 0 1 (x), . . . , g 0 W 0 (x), with: where θ 0 i are the hyperparameters of the ith model. We pass one data source per model, g 0 i (x) = p y|x i , θ 0 i , so that the width of the first layer, W 0 , is equal to the number of domains, M. The output from this layer is then passed to one or more layers of W k models g k 1 , . . . , g k W k , which blend the outputs of the previous ones: where L is the total number of blending layers and θ k i the hyperparameters of ith model from the kth layer. The last blending layer is then passed to a final model f that produces the

Manifold Mixing for Stacked Regularization
The stacking approach described in the previous section only shares information across different domains via the predictions produced by the base models. Instead, we would like the base models to have access to the other domains' information, since there are likely cross-domain interactions. To achieve this, we first assume that each domain's data lie in a lower-dimensional manifold and create a map between each pair of domains X s → X t . Next, we use these maps to project the points from the original to the target manifolds to deform the local geometry so that the two become more similar. Consider a set of points S and two mappings taking the points in S to two coordinate systems of domains X t and X s , ϕ : S → R |t| , ψ : S → R |s| ; suppose subsets X t , X s of data set X are measured in these coordinate systems. Let us introduce an approximation to the mapping, ϕ • ψ −1 : R |s| → R |t| , from the coordinates of domain X s to the coordinates of domain X t : L t s = X t X s (X s X s ) −1 defined as the solution to minimization problem min Let the array of the points in X s , which are the neighbors of instance Our goal is to 'mix' information from different manifolds. This is accomplished by projecting the neighbors of x t i from the source to the target domain and then finding the linear combination of the points that best reconstructs x t i in the original domain: min wherex t←s i is the reconstruction of x t i using domain X s . We visualize how substituting x i withx t←s i might affect the target manifold in Figure 2. After setting the derivative with respect to w i to zero, the optimal solution corresponds to: whereÑ t←s i = L t s N s←t , the neighbors of x i in X t projected from their coordinates in X s back to the coordinates in X t . We can now transform original space X t into the space reconstructed from the other domains,X t , by computing for each instance the weighted mean of its reconstructions: where β j can be seen as the prior of domain X j 's relevance, and ∑ j β j = 1. When evaluating a new point x new , first, the nearest neighbors from the training set are found; then, the reconstruction is given byÑ s←t i w new . Linear map L t s might incur a large error when dealing with highly curved manifolds. In this case, a more accurate solution can be found by partitioning the source domain into disjoint curved subspaces and then constructing a map for each subspace plus one for the flat regions' union. We extend the original method by using a recursive algorithm to partition the space in this way, which is presented in Algorithm 1. The curvature is determined by computing the nearest neighbor matrix first and then finding the difference between the direct second neighbor distance and the sum of the first plus the second neighbor distances. Additionally, the linear map can be substituted for a non-linear map φ : R |s| → R |t| ; thus,Ñ t←s i = φ N s←t would be used in Equation (4) instead.

Pairwise Permutation Algorithm
Permutation importance (PI) is a popular algorithm that measures feature importance by comparing model performance before and after randomly permuting the feature values. Whenever there are highly correlated features, the values given by PI are biased, since the model can still rely on the other correlated features. In this work, we use the pairwise permutation algorithm (PPA) [7] to account for the correlation bias. The PPA starts by computing the correlation matrix between all the features and then permutes pairs of features whose hierarchically clustered correlation exceeds a pre-defined distance threshold α = 1. The importance value is then given by: where R is the feature hierarchical clustered correlation distance matrix; 1 is the indicator function; and PI i,j is the permutation importance value computed by permuting the ith and jth features together.

IF Data Set Model Performance
For each feature in the metabolomics, microbiota and clinical parameters, we computed the value differences between the first and second measurements (cross-over design) and normalized them using zero-mean unit variance standardization. A binary variable indicating whether MF was fed first or second was used as the outcome variable.
In order to evaluate the stacked model performance, we used Leave-One-Out crossvalidation combined with subsampling to prevent overfitting and computed the area under the receiver operating characteristic curve (ROC AUC). We tested the model on 10 subsamples with 75% of the data size. Moreover, we performed Bayesian optimization over the whole model's combined parameter space using 20% of the data as validation set. This routine is depicted in Figure 3. To account for errors due to the relatively low sample size, we used a binary tuning parameter indicating whether the mixing described in Section 2.6 coupled with a non-linear map was turned on. In order to control for potential confounders such as age, gender and length/weight, we included them in the clinical feature data set.
The model obtained an ROC AUC of 0.93 ± 0.03, thus achieving high discrimination between both interventions (Figure 4). To determine the statistical significance of the results, we performed a permutation test to evaluate the model performance on bootstrapped data with randomly permuted output values. The model was significant at the p ≤ 0.01 level. The permutation test score distribution is included in Appendix A ( Figure A1). The complete data set was subsampled T times with a sample size of S. For each sample, the data set was divided into its domains (microbes, metabolites and clinical), and each was split into the training/validation set over which we performed Bayesian optimization to tune the model. In parallel, one data instance whose label was predicted by the model was left out. This was performed for each of the S data instances in the sample. The ROC AUC curve was then computed for the aggregated predictions, and the score was averaged over T samples.

Feature Importance
The goal of our approach is to determine (combinations of) biomarkers that change due to fat sources in IF. Since we are confident in the model pattern recognition ability due to its high performance, we now investigated the most discriminating features between the two groups. We performed a permutation importance test to determine the relative contribution of each feature to the model, first for each modality and then for the overall data set. We then plotted the box plots to check the univariate differences for the top 10 features of each data set. The top 25 feature importance values and the top 12 box plots for each data set are depicted in Figures A2 and A3. As can be seen from Figure A2b,d, the model relies on lauric acid as the most important discriminant feature, with a higher fecal excretion in VF intervention. The biochemical parameters account for most of the discrimination performance, with a higher fecal excretion of total soaps and palmitic acid soap in the VF intervention. We further investigated the feature profile by creating a normalized spider-plot using the top 10 features of each data set ( Figure 5A,B). Since the overwhelming difference in lauric acid (C12) between the two groups likely overshadowed the difference in other markers in the combined profile, we recomputed feature importance in the absence of this fatty acid. Excluding C12 highlighted the importance of a.o glutamate and its descendant GABA (Figure 5A,B). We computed the standard permutation importance results in Appendix A ( Figure A4). The PPA ranked total and palmitic acid soaps on top, while it ranked stearic acid considerably lower. This suggests that stearic acid was likely highly ranked in the standard permutation importance list because the degree of correlation among the other fatty acid measurements masked its true importance.
Since all measures of weight and length, as well as age and gender, were attributed approximately zero importance, it is safe to assume these were not relevant confounders in this study.
We computed the difference in means p-values for each of the overall top 25 biomarkers using the t-test or Wilcoxon test when appropriate. From Figure 6, all of the bacteria except Veilonella parvula, as well as most metabolites in the list, had non-significant concentration differences. Using the traditional univariate approach would therefore cause one to miss these biomarkers.
In order to investigate feature interaction, we measured the Spearman correlation coefficient among the top features (Figure 7). The correlation between the microbial species and the top markers in the metabolites/clinical markers might explain why the model was able to pick up some bacteria with no significant differences between the groups, such as Veilonella parvula and its positive association with the FA soaps. Overall, there was a strong correlation between the different soap measurements, and GABA was also correlated with most of these. Interestingly, Veilonella parvula exhibited a very similar pattern of correlation with soaps compared to GABA.

Discussion
In this study, we demonstrate how a multivariate 'multi-omics' model is capable of identifying relevant biomarkers that do not meet the usual univariate statistical significance. This is evidenced by the high performance of the model and non-zero permutation importance for those non-statistically significant features. Although some of the attributed importance can be explained by the correlation with the markers exhibiting a more pronounced difference, the fact that the model picked up features with little mean differences and no correlations with markers with large differences is evidence that the model is able to capture highly non-linear interactions between the features and output that cannot be accounted by linear nor rank-based correlation alone. This is a major advantage over linear methods such as the popular mixOmics [1].
Regarding biomarker discovery, a major advantage of using a multi-omics approach is the possibility to compare the relative importance of features across different modalities. Although permutation importance is one of the most frequently used feature-ranking methods, it is known to be biased in the presence of correlated features. This is especially relevant in the medical/biology domains where the features form an intertwined network of interacting components. The contrast between the highest-ranked features in the PPA and that of standard PI makes it clear that PI underestimates the relevance of members in the fatty acid group due to their high correlation.
The top feature with the most significant difference between the VF and MF interventions was lauric acid (C12). The large difference in fecal lauric acid likely results from differences in its content between the two IFs; lauric acid concentrations were 6 and 10.4 mol% of triglycerides in, respectively, MF and VF [8]. Despite the reported antimicrobial activity of this FA against dominant gut bacteria, including Bifidobacterium spp., in the work by Matsue et al. [16], the relative abundances were not significantly lower in the VF intervention for any of the bacteria included in our model. The model is over-reliant on lauric acid due to its large difference, which overshadows patterns arising from other markers. Interestingly, removing it resulted in higher feature importance for the metabolites GABA, glutamate, glycine and taurine. GABA is produced by several microbial species, such as Bifidobacterium, Bacteroides and Escherichia species [17]. The correlation between GABA/Glutamate and several microbes in the top 10 list could mean that changing milk fat content has an impact on the synthesis of GABA by specific microorganisms. GABA and glutamate are inhibitory/excitatory neurotransmitters; thus, their balance in the central nervous system plays an important role in maintaining a stable nervous condition [18]. It is, however, still unclear how this is related to GABA and glutamate concentrations in the gut. Interestingly, Veillonella parvula and GABA correlated positively with each other and both correlated similarly with the most important fatty acids and fatty acid soaps. This positive correlation between GABA and Veillonella parvula is consistent with results of a recent clinical trial with preterm infants (Russell et al., 2021). Our data further showed a negative correlation between glutamate and GABA, which is likely to be explained by the fact that the decarboxylation of glutamate results in the synthesis of GABA and CO 2 . Although associations of Veillonella and intestinal GABA levels with brain-related diseases have been suggested [19,20], further studies are required to establish causal relationships and to determine the relevance of small differences in healthy populations.
The contribution of taurine and glycine to discriminate between the interventions might be indicative of differences in (secondary) bile salt metabolism due to the different fat sources used. This is further corroborated by the strong positive correlation between taurine and Collinsella aerofaciens that might be related to the known bile salt hydrolase activity of this organism. Thus, some of the discriminatory factors presented in this work suggest a relationship between the nervous and microbial digestive systems. Further studies are required to establish causal relationships among diet, changes in the microbiome and its metabolites, and clinical endpoint. One pitfall of the current methodology is the lack of detail in explaining the relationship between the covariates. Specifically, the impact on the microbiome is likely highly non-linear based on its distribution and model reliance. Unfortunately, the current importance method only provides a holistic explanation of which features play a role in distinguishing the two groups and does not discriminate the output distribution as a function of different covariate values. To achieve this ambitious goal, methods that study the impact on individual infants' biomarker profiles should be used in future studies.

Conclusions
Determining the optimal IF can impact infants' overall health and growth at their critical early stage in human development. Modern technology allows a large amount of biomarkers to be collected, providing a more complete picture of the impact different IFs have on infants' metabolism. The usual univariate analysis underscores the importance of the biological context and feature interactions; thus, a multivariate approach is preferred to reduce type II errors. However, this presents a challenge since data from different domains likely have very different distributions which can be difficult to model. We demonstrated how a stacked model optimized over the whole space of parameter combinations achieved very high performance when discriminating infant treatment interventions. This resulted in a general overview of the IF impact, and as expected, fatty acid soaps were the leading factor of discrimination, as was also found using the univariate analysis. Although the difference in microbe concentrations was the smallest of all the feature groups, the combined model selected several bacteria, highlighting a possible impact on the infant microbiome. Presumably, the model was able to pick up the microbial signal due to the correlation with the top markers in the clinical and metabolomics groups, something that would be missed using univariate models.