A New Pipeline for the Normalization and Pooling of Metabolomics Data

Pooling metabolomics data across studies is often desirable to increase the statistical power of the analysis. However, this can raise methodological challenges as several preanalytical and analytical factors could introduce differences in measured concentrations and variability between datasets. Specifically, different studies may use variable sample types (e.g., serum versus plasma) collected, treated, and stored according to different protocols, and assayed in different laboratories using different instruments. To address these issues, a new pipeline was developed to normalize and pool metabolomics data through a set of sequential steps: (i) exclusions of the least informative observations and metabolites and removal of outliers; imputation of missing data; (ii) identification of the main sources of variability through principal component partial R-square (PC-PR2) analysis; (iii) application of linear mixed models to remove unwanted variability, including samples’ originating study and batch, and preserve biological variations while accounting for potential differences in the residual variances across studies. This pipeline was applied to targeted metabolomics data acquired using Biocrates AbsoluteIDQ kits in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. Comprehensive examination of metabolomics measurements indicated that the pipeline improved the comparability of data across the studies. Our pipeline can be adapted to normalize other molecular data, including biomarkers as well as proteomics data, and could be used for pooling molecular datasets, for example in international consortia, to limit biases introduced by inter-study variability. This versatility of the pipeline makes our work of potential interest to molecular epidemiologists.


Introduction
Metabolomics is a powerful tool for investigating candidate etiological pathways for chronic diseases [1][2][3][4]. Using either untargeted or targeted (via sets of pre-defined annotated metabolites) approaches, prior metabolomics studies have identified metabolites associated with the risk of several chronic conditions, including type-2 diabetes (T2D) [5], cardiovascular diseases (CVD) [6], and cancer [7][8][9]. Metabolomics has also been used to characterize specific signatures of anthropometric measures and lifestyle exposures, including body mass index (BMI) [7,10], adherence to a Mediterranean diet [6], and coffee consumption [5], as a way to investigate candidate biological mechanisms underpinning the relationship between these exposures and chronic diseases.
As with other-omics technologies, pre-processing metabolomics data is critical before relating them to phenotypes, such as cancer endpoints or lifestyle exposures [11,12]. After a matrix of p metabolites (or features) measured in n samples has been generated, pre-processing usually involves (i) feature and sample filtering, where low-quality features and samples are excluded, (ii) data imputation, to take care of missing values, and (iii) data normalization, to correct for sources of unwanted variation in metabolomics data, such as batch effects and other factors related to the handling of samples [11,[13][14][15][16]. Following the success of data acquisition efforts in large-scale epidemiological investigation, collaborative consortia have been put in place, offering the possibility to pool metabolomics data acquired in different studies in order to increase sample size and range of biological variation, and eventually enhance the statistical power of the analysis. However, pooling metabolomics data across studies raises methodological challenges as several preanalytical and analytical factors can induce differences in metabolite measurements and unwanted variability between datasets. Specifically, sample types (e.g., serum versus plasma), fasting status of the participant, and any other elements related to sampling conditions, sample treatment, and storage represent preanalytical factors, while analytical factors include information on the organization of samples in batches, the acquisition instrument, the acquisition time (i.e., time at which the sample was assayed), and the laboratory [17]. Correcting for these sources of variations is crucial in order to conduct accurate statistical analyses on pooled datasets.
Data on common quality controls assayed in all studies and/or reference assay data from a subset of samples in each study can be used for normalization [17,18]. However, these data are not always available in large international investigations and consortia. Accordingly, we developed a pipeline for the normalization and pooling of metabolomics data acquired in different studies that does not require data on quality controls or reference assay data, which covers four main steps. First, data cleaning identified and removed features and samples exceeding certain thresholds of missingness and outlying samples [11,16]. Second, the remaining missing values were imputed within each study using information on limits of detection and quantification when available and appropriate, and measurements were log-transformed to reduce skewness. Third, the principal component partial R-square (PC-PR2) technique was implemented to identify sources of variation in the metabolomics data [13]. Last, mixed effect models were used to correct for unwanted variability while preserving biological variability [14]. The ComBat method [19] implemented in the R sva package [20] and a PCA-based method [21,22] were also implemented for comparison. Our pipeline was applied to targeted metabolomics data acquired in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) [23]. Comprehensive analytical and graphical examinations of measurements were performed to assess whether different normalization approaches improved the comparability of metabolomics data. For illustration, metabolomics data were pooled and related to study participants' BMI.

Description of the Study Population
Targeted metabolomics data acquired within the EPIC study and centralized at the International Agency for Research on Cancer (IARC) included 16,060 pre-diagnostic blood samples originating from eight case-control studies nested within EPIC (details in Section 4.1) on seven types of cancer: breast cancer (one study denoted by BREA; n = 3172 samples) [8], endometrial cancer (ENDO; n = 1706) [24], gallbladder cancer (GLBD; n = 112), liver cancer (LIVE; n = 596) [25], kidney cancer (KIDN; n = 1213) [26], prostate cancer (PROS; n = 6020) [9,27], and colorectal cancer (two studies denoted by CLRT1 and CLRT2; n = 946 and n = 2295, respectively). As displayed in Table 1, samples collected at recruitment were  assayed at IARC for BREA, LIVE, KIDN, PROS, and CLRT1, at the Helmholtz Zentrum  (München, Germany) for CLRT2 and GLBD, and at the Imperial College London (UK) for ENDO. Across all studies, measurements of a total of 171 metabolites were acquired using either the AbsoluteIDQ p180 or the AbsoluteIDQ p150 (for CLRT2 only) commercial kit (Biocrates Life Science AG, Innsbruck Austria), following the procedure recommended by the vendor. As displayed in Table 1, samples were assayed on different liquid chromatography (LC) and mass spectrometry (MS) instruments across the different studies, but each study used one single pair of LC-MS instruments for all samples. Samples were mostly either serum or citrate plasma, and samples within one study all originated from the same type of blood matrix, except in BREA and GLBD where samples from Swedish participants originated from a different blood matrix compared to the other participants. For these two studies, samples assayed within each batch all originated from the same blood matrix (not shown). Samples were assayed between 2013 and 2018. The pipeline detailed in Section 4.2 was applied to the (n × p) matrix with n = 16,060 samples and the p = 118 metabolites measured in all studies. Specifically, they included 13 metabolites (amino acids) measured by a quantitative LC-MS/MS method (liquid chromatography coupled with tandem mass spectrometry) and 105 metabolites (76 glycerophospholipids, 12 sphingolipids, 16 acylcarnitines, and 1 hexose, the sum of six-carbon sugars) acquired by a semi-quantitative FIA-MS/MS method (flow injection analysis coupled with tandem mass spectrometry, one-point calibration, no individual internal standards).

Data Cleaning and Imputation
For the exclusion of metabolites and samples exceeding a given threshold of missingness, we applied the method described in Section 4.2.1 with a threshold set to 20% and with missing values defined as "fully missing" values only, i.e., not including values out of measurable range. Among the 118 metabolites originally retained for the analysis, the acylcarnitine C4-OH (C3-DC) was the only one with a fully missing value in more than 20% of the samples of at least one study (PROS), and was excluded. Among the 16,060 samples originally retained for the analysis, none was excluded because of exceeding 20% of fully missing values, eight were excluded because they were measured in batches with less than 10 samples, and two were excluded because they were considered outliers after a principal component analysis (PCA). Thus, the final study population included 16,050 samples, for which measurements of 117 metabolites were included (Supplementary Table S1). Out of the 1,877,850 corresponding measurements, 1066 were fully missing and 63,564 were out of the measurable range: specifically, 63,044 were below a known LOD (limit of detection, applicable to acylcarnitines, glycerophospholipids, hexose, and sphingolipids), 517 below a known LLOQ (lower limit of quantification, applicable to amino acids), 2 above a known ULOQ (upper limit of quantification), and 1 below an unknown LOD. All these 1066 + 63,564 = 64,630 missing values were imputed as described in Section 4.2.2, and concentration values were log-transformed.

Identification of Major Sources of Variations
As displayed in Figure 1 (left panel), the projection of the measurements on the first two principal components of the PCA were strongly clustered by study, suggesting the presence of systematic sources of heterogeneity across studies. The PC-PR2 method was applied to assess the proportion of the overall variation in the metabolomics data that was explained by a predefined list of variables, including (i) participants' characteristics, i.e., study center, gender, case-control indicator, age, BMI, alcohol intake, smoking status, and (ii) three variables describing possible preanalytical and analytical sources of unwanted variations: fasting status, time of the day of blood collection, study, and batch, with batch nested within study. As shown in Figure 2 (top panel), the PC-PR2 analysis indicated that these variables together explained more than 55% of the total variation of the metabolomics measurements before normalization. Study explained 31% of the total variation, while batch within study explained about 8%. Study center explained about 9% of the total variation, and gender, BMI, and alcohol intake explained about 2%, 2%, and 1%, respectively. Fasting status, time of blood collection, age at recruitment, smoking status, and case-control status all explained less than 1% of the total variation.
Metabolites 2021, 11, x FOR PEER REVIEW 5 of 18 less than 10 samples, and two were excluded because they were considered outliers after a principal component analysis (PCA). Thus, the final study population included 16,050 samples, for which measurements of 117 metabolites were included (Supplementary Table S1). Out of the 1,877,850 corresponding measurements, 1066 were fully missing and 63,564 were out of the measurable range: specifically, 63,044 were below a known LOD (limit of detection, applicable to acylcarnitines, glycerophospholipids, hexose, and sphingolipids), 517 below a known LLOQ (lower limit of quantification, applicable to amino acids), 2 above a known ULOQ (upper limit of quantification), and 1 below an unknown LOD. All these 1066 + 63,564 = 64,630 missing values were imputed as described in Section 4.2.2, and concentration values were log-transformed.

Identification of Major Sources of Variations
As displayed in Figure 1 (left panel), the projection of the measurements on the first two principal components of the PCA were strongly clustered by study, suggesting the presence of systematic sources of heterogeneity across studies. The PC-PR2 method was applied to assess the proportion of the overall variation in the metabolomics data that was explained by a predefined list of variables, including (i) participants' characteristics, i.e., study center, gender, case-control indicator, age, BMI, alcohol intake, smoking status, and (ii) three variables describing possible preanalytical and analytical sources of unwanted variations: fasting status, time of the day of blood collection, study, and batch, with batch nested within study. As shown in Figure 2 (top panel), the PC-PR2 analysis indicated that these variables together explained more than 55% of the total variation of the metabolomics measurements before normalization. Study explained 31% of the total variation, while batch within study explained about 8%. Study center explained about 9% of the total variation, and gender, BMI, and alcohol intake explained about 2%, 2%, and 1%, respectively. Fasting status, time of blood collection, age at recruitment, smoking status, and case-control status all explained less than 1% of the total variation.   Results from the PC-PR2 analysis of the imputed data (i.e., before the normalization step; (top)) and the normalized data (bottom). In each plot, the y−axis represents the proportion (expressed as a percentage) of the total variation of the metabolomics measurements explained by each variable represented on the x−axis, together with their total.

Normalization of the Measurements
Based on PC-PR2 analysis, metabolite concentrations were normalized using the method described in Section 4.2.4 to correct for variation due to study and batch, and preserve the variation due to study center, gender, BMI and alcohol intake. These latter four variables were all unequally distributed across studies and batches (not shown). They were included as fixed effects in matrix Z (Equation (2) in Section 4.2.4; otherwise, some of the variation they explain would be removed because of the adjustment for study and batch, while study and batch within study were modeled as random effects in matrix X. Other variables studied in the PC-PR2 analysis were not included in matrix X or Z as they contributed very little to the total variation. Heteroscedastic metabolite-specific mixed models with a study-specific variance component were used, although homoscedastic models produced very similar results (not shown). The PCA of normalized data ( Figure  1; right panel) indicated that the projections on the first two principal components were not clustered by study anymore, and measurements' distribution largely overlapped. Data from PROS (men only) were slightly shifted to the left and data from BREA and ENDO (women only) were shifted to the right, suggesting that the normalization preserved some variation due to gender overall. For illustration, the distribution of one semiquantified metabolite, SM OH C22:1, was computed within batches and across studies, for the imputed and the normalized measurements ( Figure 3). Imputed data displayed a study effect, with concentration levels of SM OH C22:1 in the CLRT2, ENDO, GLBD, KIDN, and LIVE studies substantially larger than those in BREA, CLRT1, and PROS. A remarkable batch effect was observed within some studies, e.g., BREA. After normalization, the distributions were very similar across batches and studies. Again, the distribution was slighty shifted downward for concentration levels in PROS (men only), and upward in BREA and ENDO (women only), compared to the other five studies CLRT1, CLRT2, GLBD, KIDN, and LIVE (which included both men and women), confirming that the normalization preserved some variation due to gender for this particular metabolite. The PC-PR2 analysis of normalized data ( Figure 2, bottom panel) confirmed that Figure 2. Results from the PC-PR2 analysis of the imputed data (i.e., before the normalization step; (top)) and the normalized data (bottom). In each plot, the y−axis represents the proportion (expressed as a percentage) of the total variation of the metabolomics measurements explained by each variable represented on the x−axis, together with their total.

Normalization of the Measurements
Based on PC-PR2 analysis, metabolite concentrations were normalized using the method described in Section 4.2.4 to correct for variation due to study and batch, and preserve the variation due to study center, gender, BMI and alcohol intake. These latter four variables were all unequally distributed across studies and batches (not shown). They were included as fixed effects in matrix Z (Equation (2) in Section 4.2.4; otherwise, some of the variation they explain would be removed because of the adjustment for study and batch, while study and batch within study were modeled as random effects in matrix X. Other variables studied in the PC-PR2 analysis were not included in matrix X or Z as they contributed very little to the total variation. Heteroscedastic metabolite-specific mixed models with a study-specific variance component were used, although homoscedastic models produced very similar results (not shown). The PCA of normalized data ( Figure 1; right panel) indicated that the projections on the first two principal components were not clustered by study anymore, and measurements' distribution largely overlapped. Data from PROS (men only) were slightly shifted to the left and data from BREA and ENDO (women only) were shifted to the right, suggesting that the normalization preserved some variation due to gender overall. For illustration, the distribution of one semi-quantified metabolite, SM OH C22:1, was computed within batches and across studies, for the imputed and the normalized measurements ( Figure 3). Imputed data displayed a study effect, with concentration levels of SM OH C22:1 in the CLRT2, ENDO, GLBD, KIDN, and LIVE studies substantially larger than those in BREA, CLRT1, and PROS. A remarkable batch effect was observed within some studies, e.g., BREA. After normalization, the distributions were very similar across batches and studies. Again, the distribution was slighty shifted downward for concentration levels in PROS (men only), and upward in BREA and ENDO (women only), compared to the other five studies CLRT1, CLRT2, GLBD, KIDN, and LIVE (which included both men and women), confirming that the normalization preserved some variation due to gender for this particular metabolite. The PC-PR2 analysis of normalized data ( Figure 2, bottom panel) confirmed that normalization removed unwanted sources of variation (batch and study), but kept most variability attributed to participants' characteristics. Complementary PC-PR2 analysis showed that blood matrix and LC-MS instruments contributed to less than 0.1% of the total variation after normalization (results not shown). Compared to our approach, the PCA-based method produced rather different results for many metabolites, while ComBat [19] produced very similar results for all metabolites with the exception of most acylcarnitines and the glycerophospholipid PC aa C40:1 ( Supplementary Figures S1 and S2).
Metabolites 2021, 11, x FOR PEER REVIEW 7 of 18 normalization removed unwanted sources of variation (batch and study), but kept most variability attributed to participants' characteristics. Complementary PC-PR2 analysis showed that blood matrix and LC-MS instruments contributed to less than 0.1% of the total variation after normalization (results not shown). Compared to our approach, the PCA-based method produced rather different results for many metabolites, while ComBat [19] produced very similar results for all metabolites with the exception of most acylcarnitines and the glycerophospholipid PC aa C40:1 ( Supplementary Figures S1 and S2). . Dots indicate measurements out of the interval (q1 − 1.5 × IQR, q3 + 1.5 × IQR) with q1 and q3 being the first and third quartile, respectively, and IQR = q3 − q1 the interquartile range. In each plot, the y−axis represents the value of the measurement and the y−axis represents each batch, arranged by study.

Technical Reproducibility of Measurements before and after Normalization
Intraclass correlation coefficients (ICC) were computed for each metabolite to assess their technical reproducibility, using measurements from 2 × 219 = 438 duplicate samples, i.e., samples measured once in two different studies (2 × 147 samples; see Supplementary  Table S2) or in two different batches of the prostate study (2 × 72 samples), as detailed in Section 4.3. Figure 4 shows the distributions of ICCs for the semi-quantified (lipids, acylcarnitines, and hexose) and quantified (amino acids) metabolites, before and after normalization. Normalization shifted the distribution of ICCs upward for semi-quantified metabolites. The distribution of quantified metabolites did not shift as much, but the variability narrowed down, with no ICC value lower than 0.50. Figure 5 shows the effect of the normalization on the ICC of each individual metabolite (top), and on the average ICC for each class of metabolites (bottom). Before normalization, 101 (86%) metabolites (92 semiquantified, 9 quantified) had ICC values lower than 0.75, among which 38 (32%: 35 semiquantified, 3 quantified) had ICC values lower than 0.5. After normalization, only twelve metabolites (10%: 9 semi-quantified, 3 quantified) had ICC values lower than 0.75, among which only two semi-quantified metabolites had ICC lower than 0.50. Moreover, classspecific averaged ICC values were consistently improved by the normalization, in particular for glycerophospholipids and sphyngomyelins. Similar results were observed when normalization was performed with ComBat [19], yet ICCs were lower than those obtained when using our approach, especially for acylcarnitines (Supplementary Figure S3). ICCs were even lower when normalization was performed with the PCA-based method (Supplementary Figure S4). The same analysis was restricted to the 2 × 57 = 114 duplicate samples acquired in two studies from serum and citrate plasma. As displayed in . Dots indicate measurements out of the interval (q1 − 1.5 × IQR, q3 + 1.5 × IQR) with q1 and q3 being the first and third quartile, respectively, and IQR = q3 − q1 the interquartile range. In each plot, the y−axis represents the value of the measurement and the y−axis represents each batch, arranged by study.

Technical Reproducibility of Measurements before and after Normalization
Intraclass correlation coefficients (ICC) were computed for each metabolite to assess their technical reproducibility, using measurements from 2 × 219 = 438 duplicate samples, i.e., samples measured once in two different studies (2 × 147 samples; see Supplementary  Table S2) or in two different batches of the prostate study (2 × 72 samples), as detailed in Section 4.3. Figure 4 shows the distributions of ICCs for the semi-quantified (lipids, acylcarnitines, and hexose) and quantified (amino acids) metabolites, before and after normalization. Normalization shifted the distribution of ICCs upward for semi-quantified metabolites. The distribution of quantified metabolites did not shift as much, but the variability narrowed down, with no ICC value lower than 0.50. Figure 5 shows the effect of the normalization on the ICC of each individual metabolite (top), and on the average ICC for each class of metabolites (bottom). Before normalization, 101 (86%) metabolites (92 semi-quantified, 9 quantified) had ICC values lower than 0.75, among which 38 (32%: 35 semi-quantified, 3 quantified) had ICC values lower than 0.5. After normalization, only twelve metabolites (10%: 9 semi-quantified, 3 quantified) had ICC values lower than 0.75, among which only two semi-quantified metabolites had ICC lower than 0.50. Moreover, class-specific averaged ICC values were consistently improved by the normalization, in particular for glycerophospholipids and sphyngomyelins. Similar results were observed when normalization was performed with ComBat [19], yet ICCs were lower than those obtained when using our approach, especially for acylcarnitines (Supplementary Figure S3). ICCs were even lower when normalization was performed with the PCA-based method (Supplementary Figure S4). The same analysis was restricted to the 2 × 57 = 114 duplicate samples acquired in two studies from serum and citrate plasma. As displayed in Supplementary Figure S5, ICC values were lower than 0.5 for 69 metabolites (59%) and 4 metabolites (3%), before and after normalization, respectively, with ICC values greater than 0.75 for 91 metabolites (78%) after normalization. Supplementary Figure S5, ICC values were lower than 0.5 for 69 metabolites (59%) and 4 metabolites (3%), before and after normalization, respectively, with ICC values greater than 0.75 for 91 metabolites (78%) after normalization.

Impact of Normalization When Relating a Given Phenotype to the Metabolites
The relationship between the metabolites and BMI was assessed. The analysis was restricted to control samples to reduce collider bias, and one sample was randomly chosen from among duplicates. For each of the 117 metabolites, Pearson correlation coefficients were computed between BMI and, in turn, the imputed measurements, the normalized measurements, and the normalized measurements produced by a simpler normalization approach, which corrected for study and batch effects without attempting to preserve variation due to study center, BMI, gender, and alcohol intake. As displayed in Figure 6, most correlation values were above the line y = x, especially for values greater than 0.1: associations with BMI were stronger when using normalized data implementing our approach compared to those observed with both non-normalized data and normalized data implementing a simple, yet incomplete, normalization approach.

Impact of Normalization When Relating a Given Phenotype to the Metabolites
The relationship between the metabolites and BMI was assessed. The analysis was restricted to control samples to reduce collider bias, and one sample was randomly chosen from among duplicates. For each of the 117 metabolites, Pearson correlation coefficients were computed between BMI and, in turn, the imputed measurements, the normalized measurements, and the normalized measurements produced by a simpler normalization approach, which corrected for study and batch effects without attempting to preserve variation due to study center, BMI, gender, and alcohol intake. As displayed in Figure 6, most correlation values were above the line y = x, especially for values greater than 0.1: associations with BMI were stronger when using normalized data implementing our approach compared to those observed with both non-normalized data and normalized data implementing a simple, yet incomplete, normalization approach.   , and normalized measurements produced by the simpler normalization approach (right), which corrected for study and batch effects without specifically attempting to preserve variation due to study center, BMI, gender, and alcohol intake.

Discussion
In this work, a pipeline for the normalization of metabolomics data acquired in different studies was described. After a screening of informative metabolites and samples, the PC-PR2 method was used to identify major sources of variation in metabolomics data and linear mixed effect models were used to correct for unwanted sources of variation, while attempting to preserve biological variation and accounting for potential heteroscedasticity. The pipeline was applied to targeted metabolomics data acquired in eight cancer-specific case-control studies nested within EPIC. Substantial inter-study and interbatch heterogeneity was observed in the original data. Accordingly, the technical reproducibility was low-to-moderate for many metabolites with ICC values lower than 0.50, especially for the semi-quantified metabolites (e.g., glycerophospholipids), suggesting that quantified metabolites might be less prone to unwanted variations due to analytical factors. Our normalization approach eliminated most of the inter-study and inter-batch variability and improved the technical reproducibility of a large proportion of semi-quantified and quantified metabolites, with most ICC values greater than 0.75. Normalization using the ComBat approach [19], which relies on a similar model but uses empirical Bayes estimation, performed similarly for all metabolites except acylcarnitines, for which ICC values were larger with our approach than ComBat. Normalization using the PCA-based method produced lower ICC values for most metabolites. All together, these results suggested that our approach outperformed ComBat and the PCA-based method on the EPIC metabolomics data. ICC values estimated from the duplicate samples originating from different blood matrices (serum versus citrate plasma) were generally larger than 0.75 after normalization. However, they were also generally lower than values estimated from all duplicate samples. In particular, the ICC for methionine was 0.39 (95% confidence interval, CI: 0.14-0.57), as compared to 0.71 (95% CI: 0.64-0.77) when ICC estimation used all duplicate samples. This result calls for caution when pooling samples originating from different blood matrices, as large differences were reported for specific metabolite concentrations in serum and plasma samples [28].
As samples within each individual EPIC study were all assayed in the same laboratory with the same LC-MS instruments, and mostly originated from the same blood matrix , and normalized measurements produced by the simpler normalization approach (right), which corrected for study and batch effects without specifically attempting to preserve variation due to study center, BMI, gender, and alcohol intake.

Discussion
In this work, a pipeline for the normalization of metabolomics data acquired in different studies was described. After a screening of informative metabolites and samples, the PC-PR2 method was used to identify major sources of variation in metabolomics data and linear mixed effect models were used to correct for unwanted sources of variation, while attempting to preserve biological variation and accounting for potential heteroscedasticity. The pipeline was applied to targeted metabolomics data acquired in eight cancer-specific case-control studies nested within EPIC. Substantial inter-study and inter-batch heterogeneity was observed in the original data. Accordingly, the technical reproducibility was low-to-moderate for many metabolites with ICC values lower than 0.50, especially for the semi-quantified metabolites (e.g., glycerophospholipids), suggesting that quantified metabolites might be less prone to unwanted variations due to analytical factors. Our normalization approach eliminated most of the inter-study and inter-batch variability and improved the technical reproducibility of a large proportion of semi-quantified and quantified metabolites, with most ICC values greater than 0.75. Normalization using the ComBat approach [19], which relies on a similar model but uses empirical Bayes estimation, performed similarly for all metabolites except acylcarnitines, for which ICC values were larger with our approach than ComBat. Normalization using the PCA-based method produced lower ICC values for most metabolites. All together, these results suggested that our approach outperformed ComBat and the PCA-based method on the EPIC metabolomics data. ICC values estimated from the duplicate samples originating from different blood matrices (serum versus citrate plasma) were generally larger than 0.75 after normalization. However, they were also generally lower than values estimated from all duplicate samples. In particular, the ICC for methionine was 0.39 (95% confidence interval, CI: 0.14-0.57), as compared to 0.71 (95% CI: 0.64-0.77) when ICC estimation used all duplicate samples. This result calls for caution when pooling samples originating from different blood matrices, as large differences were reported for specific metabolite concentrations in serum and plasma samples [28].
As samples within each individual EPIC study were all assayed in the same laboratory with the same LC-MS instruments, and mostly originated from the same blood matrix (except for GLBD that included serum and heparin plasma samples and BREA that included EDTA and citrate plasma samples), the variability due to these factors was encompassed into the inter-study variability and could not be assessed by the PC-PR2 analysis. In particular, although the large inter-study variability in the non-normalized data supported the presence of inter-laboratory and inter-instrument variability, as previously reported for the AbsoluteIDQ p180 kit [17], correction for batch and study effects also corrected for effects due to blood matrix and LC-MS instruments, which were both observed to contribute to less than 0.1% of the total variation in the normalized data. However, the inter-study and inter-batch variability also reflected biological variability, because factors like study center, gender, BMI, and alcohol intake were not equally distributed across studies and batches. Consequently, some of the biological variation due to these factors would be removed if the normalized data were simply computed as the residuals in linear mixed models adjusted for study and batch. Conversely, by accounting for study center, gender, BMI, and alcohol intake in the mixed models and by computing the normalized residuals using the step described in expression (2) in Section 4.2.4, the normalization preserved (some of) the variation due to these factors. This was illustrated by the distribution of normalized data that was shifted in opposite directions for studies including only men or women, and by the stronger associations with BMI observed when using the complete model for normalization compared to the simpler version that only included batch and study as random effects.
A critical step of normalization procedures that use linear mixed models, or more generally models with location/scale adjustments [19], is the choice of (i) factors that may generate unwanted variation, for which a correction should be implemented, and (ii) factors that represent biological variability, which should be preserved after normalization. As illustrated in Section 4.2.4, while the list of variables in (i) should be included in matrix X (like study and batch), variables in (ii) should be included in matrix Z, and the choice depends on the study design and on the ultimate objective of the analysis. If the objective is to identify metabolites associated with a given phenotype, e.g., BMI, it is crucial to include BMI in matrix Z, particularly if BMI is associated with specific variables included in matrix X. Conversely, if the ultimate objective of the study is to identify metabolites associated with, say, alcohol, while controlling for BMI, then alcohol should be included in matrix Z (particularly if it is associated with specific variables included in matrix X), but BMI could be included in matrix X, so that the associations are adjusted for BMI. In any case, performing sensitivity analyses with normalized data generated including different sets of variables in matrices X and Z is a good practice.
In multicenter investigations like EPIC, study center is a sensitive variable as it expresses technical (preanalytical) variation, likely the result of specific procedures for blood collection, sample treatment, and storage, as well as biological variation reflecting specific lifestyle exposures, often characterized by geographical gradients. In addition, in a multicenter context, the relationship between two sets of variables could be evaluated at the overall level, at the center level, or at the individual level [29]. In this study, to use the whole variability in metabolomics and BMI data, center was initially included in matrix Z. In the sensitivity analysis, study center was included in matrix X and the center-specific variability was removed. As shown in Supplementary Figure S6, results were similar to the overall analysis suggesting that group-level correlations were similar to individual-level correlations [29]. Alternative methods, like SVA [20,28] and RRmix [29], use linear (mixed) models with latent variables to estimate variability attributed to unspecified sources of variation, ultimately to be removed. These methods do not require prior knowledge of the sources of unwanted variation, but require the identification of sources of biological variation, as their effects would likely be removed if not properly accounted for in the linear predictor of the model.
The decision to implement data normalization largely depends on the ultimate objectives of the analysis. As the relationship between metabolites and cancer risk is generally quantified in conditional logistic regression models for matched case-control studies, metabolite measurements are compared within each matched case-control pair. If cases and controls are assayed within the same batch (as was the case in the EPIC metabolomics data), the effects of study and batch on the means of the measurements are not a concern and normalization is not required unless the variances of the measurements also vary across studies or batches. However, if the evaluation focuses on the investigation of lifestyle determinants of metabolomics data, as for example in mediation analysis, the matching is "broken" and control for inter-batch and inter-study variability is required [7].
Although originally developed for the normalization of metabolomics data acquired in different studies, our pipeline could be used for data acquired in a single study, for example to correct for inter-batch variability while preserving biological variability and to correct for potential heteroscedastic structures of concentration levels across batches. Our pipeline could also be adapted to the normalization of biomarker data and other molecular data, possibly with some modifications. In particular, for the normalization of untargeted LC-MS metabolomics data, a step to exclude features based on comparison with blank samples should be added to the data cleaning [16], and a K-nearest neighbors approach has been shown to perform particularly well for the imputation of missing data [15,30] in the context of untargeted metabolomics data. Importantly, when processing untargeted metabolomics data from individual studies separately, different feature identifiers (e.g., mass to charge ratio and retention time) would characterize the same molecule in each study. Therefore, the pooling of several untargeted datasets would generally require an additional feature alignment step consisting of identifying the features present in the different datasets, which might be particularly challenging with data acquired in different laboratories [31].
With the increasing availability of metabolomics data in large scale epidemiological investigations, such as those participating in the COnsortium of METabolomics Studies (COMETS) [32], pooling will be more and more relevant as a strategy for increasing the statistical power when investigating the relationship between metabolomics data with disease indicators, environmental exposures, and/or other-omics and biomarker data. Combined with analytical and graphical inspection of the data to determine sources of unwanted variability to be removed and sources of biological variability to be preserved, linear mixed models provide a flexible tool to normalize metabolomics data, and possibly other -omics and biomarker data, prior to pooling data from different studies. As the comparability of measurements across studies is improved, our normalization approach could also be useful for studies that aim at the meta-analysis of individual patient data from different studies, in particular if heteroscedastic patterns of variability were observed.

The EPIC Study
EPIC is a large prospective study of over 500,000 men and women recruited in 1992-2000 in 23 centres in 10 European countries [23], originally designed to investigate the relationship between diet and cancer risk. Incident cancer cases were identified through a combination of methods including linkage to health insurance records, cancer, and pathology registries and active follow-up through study participants and their next-ofkin [23]. Around 386,000 participants from all countries provided a blood sample at recruitment. Fasting before blood withdrawal was not required. Blood was collected according to a standardized protocol in France, Germany, Greece, Italy, the Netherlands, Norway, Spain, and the UK [23]. Serum (except in Norway), plasma, erythrocytes, and buffy coat aliquots were stored in liquid nitrogen (−196 • C) in a centralized biobank at IARC. In Denmark, blood fractions were stored locally in the vapor phase of liquid nitrogen containers (−150 • C), and in Sweden, they were stored locally at −80 • C in standard freezers. Our analyses used targeted metabolomics data collected within the EPIC study and generated through the AbsoluteIDQ p180 or p150 commercial kit (Biocrates Life Science AG, Innsbruck Austria).
All participants provided written informed consent to participate in the EPIC study. This study was approved by the ethics committee of the International Agency for Research on Cancer (IARC) and all centers.

The Pipeline to Normalize Data
Given a matrix of p metabolites acquired on n samples, our pipeline implemented four main steps, as summarized in Figure 7 and detailed hereafter for the EPIC targeted metabolomics data. R scripts implementing these four steps, along with Rmarkdown and html documents that can be used to reproduce the analysis of the EPIC data, are available at https://code.iarc.fr/viallonv/pipeline_biocrates, accessed on 14 September 2021.
All participants provided written informed consent to participate in the EPIC study. This study was approved by the ethics committee of the International Agency for Research on Cancer (IARC) and all centers.

The Pipeline to Normalize Data
Given a matrix of p metabolites acquired on n samples, our pipeline implemented four main steps, as summarized in Figure 7 and detailed hereafter for the EPIC targeted metabolomics data. R scripts implementing these four steps, along with Rmarkdown and html documents that can be used to reproduce the analysis of the EPIC data, are available at https://code.iarc.fr/viallonv/pipeline_biocrates, accessed on 14 September 2021.

Step1: Data Cleaning
The objective of data cleaning was to remove the least informative metabolites and samples, using a number of (subjective) criteria. First, the pipeline excluded metabolites and samples exceeding a certain threshold of missingness (e.g., 20%) in each study separately. Missing values were here defined as fully missing values, for which no information on the real value was available. In particular, they did not include out of measurable range values, which corresponded to values that were missing because they were below the batch-specific limit of detection (LOD), below the kit-specific lower limit of quantification (LLOQ), or above the kit-specific upper limit of quantification (ULOQ). An extra step was implemented to exclude outlying samples within each batch based on principal component analysis (PCA) [11], using a 20% proportional expansion of the Hotellings T2 distribution ellipse, with the level of the ellipse set to 100 × (1 − 0.05)/Nb % and Nb the total number of batches. Samples assayed in batches with less than 10 samples were also excluded to ensure enough information during batch-specific data imputation (Section 4.2.2) and normalization (Section 4.2.4).

Step 2: Data Imputation
All missing values, including the out of measurable range values, were imputed in the cleaned dataset in each batch separately. Values below batch-specific LOD, below kitspecific LLOQ, or above kit-specific ULOQ were set to LOD/2, LLOQ/2, and ULOQ, respectively. Values below an unknown batch-specific LOD were set to LOD/2 after setting batch-specific LOD to study-specific medians of known LOD values. Fully missing values were set to the batch-specific median of non-missing values if less than 50% of the measurements in the batch were missing and to the study-specific median of the batch-specific medians otherwise. Measurements were log-transformed to reduce skewness.

Step 3: Data Normalization, Part 1: Identification of Sources of Variation
The PC-PR2 technique was used to identify main sources of variation in the metabolomics data [13]. The PC-PR2 is a multivariate technique that combines PCA with multiple linear regression to assess the proportion of the variability of the full metabolomics dataset explained by a set of explanatory variables, including samples characteristics (age, sex, BMI, alcohol consumption, study center), as well as preanalytical and analytical factors

Step 1: Data Cleaning
The objective of data cleaning was to remove the least informative metabolites and samples, using a number of (subjective) criteria. First, the pipeline excluded metabolites and samples exceeding a certain threshold of missingness (e.g., 20%) in each study separately. Missing values were here defined as fully missing values, for which no information on the real value was available. In particular, they did not include out of measurable range values, which corresponded to values that were missing because they were below the batch-specific limit of detection (LOD), below the kit-specific lower limit of quantification (LLOQ), or above the kit-specific upper limit of quantification (ULOQ). An extra step was implemented to exclude outlying samples within each batch based on principal component analysis (PCA) [11], using a 20% proportional expansion of the Hotellings T2 distribution ellipse, with the level of the ellipse set to 100 × (1 − 0.05)/N b % and N b the total number of batches. Samples assayed in batches with less than 10 samples were also excluded to ensure enough information during batch-specific data imputation (Section 4.2.2) and normalization (Section 4.2.4).

Step 2: Data Imputation
All missing values, including the out of measurable range values, were imputed in the cleaned dataset in each batch separately. Values below batch-specific LOD, below kit-specific LLOQ, or above kit-specific ULOQ were set to LOD/2, LLOQ/2, and ULOQ, respectively. Values below an unknown batch-specific LOD were set to LOD/2 after setting batch-specific LOD to study-specific medians of known LOD values. Fully missing values were set to the batch-specific median of non-missing values if less than 50% of the measurements in the batch were missing and to the study-specific median of the batchspecific medians otherwise. Measurements were log-transformed to reduce skewness.

Step 3: Data Normalization, Part 1: Identification of Sources of Variation
The PC-PR2 technique was used to identify main sources of variation in the metabolomics data [13]. The PC-PR2 is a multivariate technique that combines PCA with multiple linear regression to assess the proportion of the variability of the full metabolomics dataset explained by a set of explanatory variables, including samples characteristics (age, sex, BMI, alcohol consumption, study center), as well as preanalytical and analytical factors (fasting status, sample processing protocol, blood matrix, study, batch, laboratory instrument).
While the former set of factors likely determined biological variability, the latter set likely introduced sources of unwanted variation in metabolomics data. PCA was conducted on metabolite measurements and a number K ≥ 1 of components sufficient to explain more than 80% of total variability was retained. Component scores were, in turn, regressed on the list of aforementioned independent variables, say W 1 , . . . , W Q , in multiple linear regression models, and the partial R 2 for each covariate W q was estimated for each component (C k ). For example, the partial-R 2 for W 1  In order to correct for unwanted sources of variability while preserving biological variability, a random effects model was used for each metabolite separately [14], as where y is the n-vector of the measurements for the metabolite/feature under consideration (all studies combined), the matrix X expresses variables corresponding to sources of variations that should be corrected for, and the optional matrix Z expresses variables corresponding to biological variations that should be preserved. Variables expressed in matrices X and Z typically include some of the variables W 1 , . . . , W Q of the PC-PR2 analysis with largest R 2 partial . The vector of parameters β associated to matrix X may include both fixed-and random-effects, while the vector θ associated to matrix Z contains fixed effects only. Parameter α is the intercept and vector ε~N n (0, Σ) corresponds to the random error of the model. Residuals ε are independent of the random effects of the model. Random effects are Gaussian, centered, and were further assumed to have diagonal covariance matrix in our illustration.
Parameters α, β, θ, and the vector of residuals ε under model (1) are estimated by, say, a, b, c, and e. Normalized residual measurements are computed as In this way, normalization preserves the association between the metabolite and variables in Z, while any association with variables in X is eliminated. As mentioned in the Discussion, variables describing biological variations of interest should be included in matrix Z if they are associated with variables included in matrix X (e.g., sources of biological variations that are unequally balanced across studies or batches); otherwise, some of the variation they explain would be removed because of the adjustment for X. In our illustration, study center indicators, gender, body mass index, and alcohol intake were included in matrix Z, while batch and study indicators were included in matrix X.
In the simple homoscedastic random effect models, each component of the vector ε of residuals has the same variance: Σ = σ 2 I n for some σ 2 > 0, where we denote by I p the identity matrix of size p for any positive integer p. However, in practice, pre-analytical and/or analytical factors may not only influence the means of the measurements via the term Xβ in model (1), but also their variance. For example, variances of components of ε may vary across studies. This was accounted for by working under heteroscedastic random effect models with a specified structure for the variance matrix Σ of the residuals, e.g., Σ was made of blocks of the form σ s 2 I ns for observations corresponding to study s (with ns the number of observations in study s). Then, residuals e were replaced by the Pearson residuals in Equation (2), after rescaling them to ensure that their overall variance equals that of the standard residuals. Homoscedastic models were implemented with the lmer function of the lme4 R package, while heteroscedastic models were implemented with the lme function of the nlme R package, using the weights instruction to specify the within-group heteroscedasticity structure.
For comparison, we also considered the ComBat method [19] of the sva R package [20], under which a fixed-effects version of model (1) is estimated using an empirical Bayes approach, to leverage the fact that sources of variation may affect many metabolites in similar ways. In our illustration, ComBat was applied to correct for batch effect (which also accounts for study effect), while attempting to preserve variations due to study center, gender, body mass index, and alcohol intake. We also considered a PCA-based method [21,22], where (i) a PCA was performed on the full metabolomics data matrix, and (ii), normalized measurements were the residuals obtained after regressing each metabolite on the first K = 2 principal components of the PCA; other choices for the value of parameter K were considered and led to similar or lower reproducibility (quantified by the intraclass correlation coefficient; see Section 4.3).

Computation of the Intraclass Correlation Coefficient Using Duplicated Samples
The EPIC data included duplicate samples corresponding to aliquots of a baseline blood sample from the same subject measured twice in different batches or in different studies. These duplicated samples were used to assess the technical reproducibility of metabolomics measurements, and in particular to compare technical reproducibility before and after normalization. In sensitivity analyses, ICC values were estimated using only duplicate samples originating from distinct blood matrices (serum and citrate plasma). For each metabolite, we estimated its ICC using a linear mixed effects model of the form [16] where m ik is the k-th replicate measurement of subject i, k = 1, 2, γ i~N (µ, σ γ 2 ) is a subjectspecific random effect (with µ corresponding to the general mean of m ik ), ξ ik~N (0, σ ξ 2 ) is the residual random error for replicate k of subject i, and Cov(ξ ik , γ i ) = 0. Under this model, ICC = Var(γ I )/Var(γ I + ξ ik ), so the ICC estimate was defined as the ratio of the estimated between-subject variance to the estimated total variance (between-and within-subject). Model (3) above can be estimated even if only a portion of the subjects have replicated samples. It was implemented using the lmer function of the lme4 R package, and 95% confidence intervals (CI) of the ICC values were derived using the parametric bootstrap implemented by the bootMer function of the lme4 R package.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/metabo11090631/s1, Figure S1: Correlations between normalized measurements produced by ComBat and our approach, Figure S2: Correlations between normalized measurements produced by the PCA-based method and our approach, Figure S3: Average ICC values for each class of metabolites after normalization using ComBat and our approach, Figure S4: Average ICC values for each class of metabolites after normalization using the PCA-based method and our approach, Figure S5: Metabolite-specific ICC values before and after normalization and average ICC values for each class of metabolites before and after normalization, Figure S6: Correlations (absolute values) between BMI and the 117 metabolites in control samples, Table S1: List of the 117 metabolites retained after the data cleaning step, Table S2: Study origin of duplicate samples in the EPIC targeted metabolomics data.