Preanalytical Pitfalls in Untargeted Plasma Nuclear Magnetic Resonance Metabolomics of Endocrine Hypertension

Despite considerable morbidity and mortality, numerous cases of endocrine hypertension (EHT) forms, including primary aldosteronism (PA), pheochromocytoma and functional paraganglioma (PPGL), and Cushing’s syndrome (CS), remain undetected. We aimed to establish signatures for the different forms of EHT, investigate potentially confounding effects and establish unbiased disease biomarkers. Plasma samples were obtained from 13 biobanks across seven countries and analyzed using untargeted NMR metabolomics. We compared unstratified samples of 106 PHT patients to 231 EHT patients, including 104 PA, 94 PPGL and 33 CS patients. Spectra were subjected to a multivariate statistical comparison of PHT to EHT forms and the associated signatures were obtained. Three approaches were applied to investigate and correct confounding effects. Though we found signatures that could separate PHT from EHT forms, there were also key similarities with the signatures of sample center of origin and sample age. The study design restricted the applicability of the corrections employed. With the samples that were available, no biomarkers for PHT vs. EHT could be identified. The complexity of the confounding effects, evidenced by their robustness to correction approaches, highlighted the need for a consensus on how to deal with variabilities probably attributed to preanalytical factors in retrospective, multicenter metabolomics studies.


Introduction
Arterial hypertension (HT) was prevalent in approximately 22% of the global population in 2015 [1] and was found to be the leading cause of death in 2017 [2]. Three types of secondary HT, specifically endocrine HT (EHT), are primary aldosteronism (PA), pheochromocytoma and paraganglioma (PPGL), and Cushing's syndrome (CS). All these diseases are associated with increased morbidity and mortality [3][4][5][6][7]. Tailored medical treatments, in specific surgical resection of the underlying hormone-producing lesion, result in decreased morbidity and mortality of patients [3,[5][6][7], therefore correct diagnosis is key for patient survival. However, diagnosing EHT can be challenging, due to the dependence of the PA screening test on numerous factors [8], highly variable clinical presentations and a lack of routine cost-effective biochemical screening for PPGLs [9], and heavy reliance on medical experts' experience in CS screening [10,11]. As a result, many cases of PA, PPGL and CS remain undetected [8,11,12]. By employing a more general, objective and cost-effective screening method, the chances for a long-term cure as well as blood pressure control should be increased [13]. In addition, early disease identification may shed light on pathophysiological mechanisms allowing for personalized treatment approaches [13]. Metabolomics has led to the discovery of potential biomarkers for several diseases [14][15][16][17]. The ENSAT-HT consortium aims to establish multi-omics diagnostic biomarkers, including metabolomics, for classifying patients as having either primary HT (PHT) or a form of EHT [18]. Erlic et al. [19] applied a targeted Liquid Chromatography-Mass Spectrometry approach to differentiate ENSAT-HT plasma samples collected from patients with EHT from those collected from patients with PHT and delineated a biomarker signature with high classification accuracy.
Metabolomics studies, however, when applied to clinical studies, are prone to limitations such as confounding effects [20]. For example, the initial report on the high accuracy of metabolomics in predicting coronary artery disease [21], was contrasted by the follow-up report by Kirschenlohr et al., in which patient gender and medication were identified as sources of unwanted variation and lower accuracies were found by analyzing subgroups [22]. According to the 2005 review by David Ransohoff, "Bias can be so powerful in non-experimental observational research that a study should be presumed 'guilty'-or biased-until proven innocent" [23], therefore it is advisable to investigate and correct the influence of confounders in metabolomics studies. Confounding effects can be corrected using methods such as Analysis Of Variance Simultaneous Component Analysis (ASCA) [24], Covariate-Adjusted Projection to Latent Structures [25] and regularized Multivariate Analysis Of Variance [26].
In this paper, we analyzed bio-banked plasma samples sent from various centers that are part of the ENSAT-HT consortium, using proton nuclear magnetic resonance spectroscopy ( 1 H-NMR)-based untargeted metabolomics as an alternative and possibly complementary approach to targeted metabolomics [19]. We aimed to establish unbiased EHT biomarkers, so we initially compared each EHT disease group to PHT and obtained signatures of each comparison from the analyses. Subsequently, we applied three approaches to investigate and correct the influence of confounders on the obtained signatures, as center heterogeneity is a well-known challenge in multicenter studies [27], particularly when disease groups are inadequately represented in each center's patient cohort [28]. Table 1 summarizes the characteristics of all patient samples analyzed. Briefly, our final dataset consisted of 231 plasma samples collected from patients with EHT (104 PA + 94 PPGL + 33 CS) and 106 samples from patients with PHT. PATIENT CHARACTERISTICS PATIENT AGE 55 [18-79] * 49  48  50  47    Results from the peak picking procedure are shown in Figure 1. Results from the peak picking procedure are shown in Figure 1. The Principal Component Analysis (PCA) score plot of the complete dataset including all 86 peaks ( Figure S1) showed Quality Control (QC) samples clustering closely The Principal Component Analysis (PCA) score plot of the complete dataset including all 86 peaks ( Figure S1) showed Quality Control (QC) samples clustering closely together, indicating a low analytical variation compared to the sum of biological and preanalytical variation. The result shown in Figure 2a appeared promising for the successful separation of the two main disease groups (EHT and PHT), as the two disease groups resulted in slightly different scores along the second principal component (PC2). The classification analyses of these datasets resulted in high accuracies of approximately 80% when attempting to distinguish PHT from EHT forms (Table S2 and Table 2). Lower balanced accuracies (e.g., 65%, Table 2) resulted from separating all groups from each other. Table 3 depicts the signatures obtained from scenario EHT-PHT.

Initial Approach: Establishing Possible EHT-PHT Biomarkers
(GYLU), München (GYMU), Würzburg (GYWU), Torino (ITTU3), Nijmegen (NLNI) and Warsaw (PLWW), samples in cluster Paris (FRPA1) PHT had low glutamine and high glutamate, whereas samples from centers FRPA1 (non-PHT), FRPA2, Glasgow (GBGL2), Galway (IRGA) and Padova (ITPD/ITPD3) were characterized by high lactate and ornithine, GBGL2 samples were further characterized by high glutamate and low glutamine, and ITPD/ITPD3 samples had high methanol. These metabolites, as well as those found in signatures, were identified as described in the Methods section of this paper and shown in Figure 3.  In score plot (a), samples were colored according to disease group (CS, PA, PHT or PPGL), whereas in score plot (b), samples were colored according to the centers in which they were collected and score plot (c) depicts samples colored according to sample age, with the median value as the cutoff. In all scores plots, the percentage of explained variance per component is depicted in the plot axes. Though PC2 scores are slightly higher for PHT samples compared to EHT, samples were strikingly different from center to center. The 95% ellipses in plot (b) were calculated based on a score plot colored according to the two main clusters according to PC1, i.e., cluster 1 (orange) and cluster 2 (blue), and were included here to highlight what seems to be the most important source of variation in these data. Plot (d) is the loadings plot of the same PCA, with NMR peaks in blue and the corresponding metabolite names in red. Only peaks with a correlation cutoff above 0.5 are shown, as these arise from metabolites that most affect sample distribution in the scores plots (a-c).
By investigating the various sources of variation present in our data using PCA, sample center of origin, as well as sample age, appeared to be major sources of variation in the data. As illustrated in Figure 2b, clusters of centers were apparent. Figure 2c demonstrates a tendency for separation of samples based on their sample age, which was significantly different between disease groups, as well as amongst centers, according to a Kruskal-Wallis test (p < 2 × 10 −16 ). Effects of other factors, such as analytical batch and patient age, were not observed to coincide with tendencies in the first two principal components ( Figure S2a and Padova (ITPD/ITPD3) were characterized by high lactate and ornithine, GBGL2 samples were further characterized by high glutamate and low glutamine, and ITPD/ITPD3 samples had high methanol. These metabolites, as well as those found in signatures, were identified as described in the Methods section of this paper and shown in Figure 3.  * Peaks were excluded either because they were found to be important in discriminating samples in a Partial Least Squares Discriminant Analysis (PLSDA) of center cluster, i.e., the separation of centers according to the first dimension in the PCA score plot of Figure 2b, in a PLSDA of sample age (with the median sample age as a cutoff), or because they were found to be higher or lower in the FRPA1 PHT group of samples.

Correcting for Confounders: Approach A (ASCA Correction)
We could not use Analysis of Variance Simultaneous Component Analysis (ASCA) to correct for the center directly, as there was not an adequate representation of each center in all investigated groups. Therefore, we attempted to correct our data for possible confounders using ASCA, performed according to the observed clustering of centers along the first principal component (Figure 2b, with centers GYDR, GYLU, GYMU, GYWU, ITTU3, NLNI, PLWW, as well as the FRPA1 PHT group included in cluster 1 and FRPA1 (non-PHT), FRPA2, GBGL2, IRGA, ITPD/ITPD3 in cluster 2. We did not use ASCA to correct for sample age, as our implementation did not allow continuous variables to be considered as design factors. As shown in Table S3 and Table 2, classification analyses after using ASCA for correcting the data resulted in overall similar results to the previous approach. ASCA correction was not possible on the CS-PHT or the all vs. all dataset, as there were no samples represented by cluster 1 in the CS group.
According to Table 3, glycine and lactate were the differences between the ASCA and Initial Approach signatures of scenario EHT-PHT, indicating that their levels were normalized by ASCA, diminishing their importance. In the other scenarios, ASCA similarly normalized levels of metabolites that correlated to the first principal component of Figure 2.

Correcting for Confounders: Approach B (Metabolite Exclusions)
As shown in Figure 2, center and sample age appeared to be major sources of variation in our data. Table S4 provides an overview of all analyses carried out to assess the effects of these confounders on the various datasets. Overall, accuracies were higher when comparing groups defined by confounders than they were based on disease groups (Initial Approach). We employed PLSDA to establish the signatures of these effects (Table 4) and compared cluster 1 (excluding the FRPA1 PHT group due to a distinct signature) to cluster 2, as well as samples with a sample age below the median to those with a sample age above the median (calculated including all samples). We used PCA, due to limited sample sizes, to compare the FRPA1 PHT group to the rest of cluster 1. Metabolites acetylcarnitine, creatine, dimethyl sulfone, glucose, glutamate, glutamine, glycerol, glycine, lactate, methanol, methionine, ornithine, pyruvate and the unknown signal at 3.284 ppm were all found to be related to these confounders. In relation to disease groups, all scenarios resulted in signatures that included confounder-related metabolites. Notably, 9/13 metabolites making up the EHT-PHT signature depicted in Table 3 were also found to be related to a confounder. cutoff. In all scores plots, the percentage of explained variance per component is depicted in the plot axes. Though PC2 scores are slightly higher for PHT samples compared to EHT, samples were strikingly different from center to center. The 95% ellipses in plot (b) were calculated based on a score plot colored according to the two main clusters according to PC1, i.e., cluster 1 (orange) and cluster 2 (blue), and were included here to highlight what seems to be the most important source of variation in these data. Plot (d) is the loadings plot of the same PCA, with NMR peaks in blue and the corresponding metabolite names in red. Only peaks with a correlation cutoff above 0.5 are shown, as these arise from metabolites that most affect sample distribution in the scores plots (a-c).  Analyses of data after we excluded confounder-related peaks yielded lower accuracy estimates than those calculated from the complete datasets (Table S5 and Table 2).
Regarding signatures, Approach B resulted in a unique set of metabolites compared to the other approaches (with 8/12 metabolites not found with any other approach in the EHT-PHT scenario, Table 3). Some signals arising from the same metabolite were high in one group while others were high in the other, which may be the result of the low statistical power of the data, as peaks from the same metabolite correlated in the overall loadings plot in Figure 2d and Figure S2e (except for the lysine peaks at 2.997 and 3.013 ppm). As a result, the resulting signatures were unreliable, though probably less affected by the confounding effects.

Correcting for Confounders: Approach C (Whole Center Exclusions)
As a final approach to illustrate the putative effects of confounders on our initial signature, we investigated disease group metabolome differences in the cluster which comprised only the centers shown in Table 5 (GYDR, GYLU, GYMU, GYWU, ITTU3, NLNI, PLWW), as these clustered closely compared to other centers (Figure 2b). We included a total of 40 patients with PHT, along with 118 patients with EHT. Of the latter, 54 had PA and 64 had PPGL. Though center selection led to lesser confounding effects compared to the Initial Approach, it also led to lower accuracies for all investigated scenarios. Tables S6 and 2 summarize the resulting accuracies obtained from this approach; accuracies of approximately 60% were found when comparing EHT to PHT, or PA-PPGL-PHT, and approximately 70% via the PA-PHT and PPGL-PHT scenarios. When predicting the excluded samples' disease groups using these models, the EHT-PHT model was 56% accurate, with a sensitivity of 26% and a specificity of 86%, the PA-PHT model was 50% accurate, with a sensitivity of 6% and a specificity of 94%, and the PPGL-PHT model was 65% accurate, with a sensitivity of 57% and a specificity of 74%. To evaluate any remaining confounding effects, we also compared center GYDR to ITTU3, resulting in higher accuracies than the analyses comparing disease groups (Table S7), indicating the possible presence of an additional source of center-related bias. Considering whole center removal as the most effective way of diminishing the influence of the possible confounders seen in Figure 2, we used the Approach C signature to evaluate the other three. According to Table 3, glutamine and glutamate were not found to be related to disease group separation by Approach C, contrasting both the Initial Approach and Approach A. Additionally, seven confounder-related metabolites, which were excluded for Approach B, were selected by Approach C as predictors. Another group of seven metabolites selected only by Approach B was not corroborated by Approach C. Similar observations made in the other scenarios comparing approaches show that Approach A did not correct for some effects at all, several metabolites were related to both confounders and disease groups, and Approach B may have led to overfitted results.

Discussion
We found that our untargeted NMR metabolomics signatures of endocrine hypertension were statistically related to confounders. The metabolomes of our samples were organized in distinct clusters, primarily defined by their center of origin. Glutamate, glutamine, lactate, ornithine and methanol appeared to have a strong relation to this centerrelated clustering, while also being a part of the EHT-PHT signature. In the targeted metabolomics study [19], both glutamate and glutamine were included in the reported biomarkers. Glutamine was reported to correlate positively to atherogenesis [29,30], and high levels of the amino acid were hypothesized to protect vasculature [31]. Lower levels of the amino acid were found in hypertensive men compared to healthy controls [15], whereas higher levels were found in hypertensive women compared to controls [32]. Glutamate was included in a panel for detecting albuminuria in hypertensive patients [33] and, along with proline, has shown a statistically different concentration distribution in hypertensive nephrosclerosis compared to controls [34]. In our previous work [35], proline and methanol were found to be higher in preoperative PPGL patients. To investigate possible confounders in our EHT-PHT signatures, we employed three correction approaches to compare results. We used ASCA [24] (Approach A) but found full correction of possibly confounding effects impossible, due to the lack of adequate representation of each center in each disease group.
We, therefore, established a signature for the center-related clustering, to formulate a hypothesis as to the reason for this source of variation and to determine its potentially confounding effect on the initially established EHT-PHT signature. We found lactate increased and glucose decreased in samples originating from cluster 2 centers compared to those from cluster 1, which was found before in plasma samples harvested from whole blood after a pre-centrifugation delay due to prolonged red blood cell glycolysis [36][37][38][39][40][41][42][43][44][45][46][47]; this would mean that cluster 2 included samples collected after such a delay, compared to the rest of the samples. The coinciding lower levels of pyruvate in cluster 2 compared to cluster 1 samples were shown before when there is a delay in cold temperatures [37,38,42,44,46] (Brunius et al. report the contrary [39]), whereas, at room temperature, pyruvate was found to increase [37,39,40,44,46] (contrary results in [42]). A delay before whole blood centrifugation could also explain the lower levels of glutamine in cluster 2 compared to cluster 1 [40,43] and higher glutamate [40,[46][47][48] (contrary results in [42]), possibly due to the conversion of glutamine to glutamate by glutaminase [48], and higher ornithine [40,42,46,47], possibly due to the activity of erythrocyte arginase [40]. Interestingly, cluster 2 s higher glycerol is not supported by the literature [40,42], in which it is reported to be lower in samples collected after a pre-centrifugation delay. It was unclear why methanol was higher in ITPD/ITPD3 samples compared to the rest, but it can be speculated that it could have originated as an impurity. The separate cluster of FRPA1 PHT samples can be explained by a delay between plasma harvesting and storage at room temperature, which would result in the observed increase in glutamate and decrease in glutamine, and in the lack of effects on glycolysis-related metabolite levels [42].
Sample age may have also influenced the resulting metabolic signature, especially given that there were significantly different sample ages amongst centers. We report results obtained from control (PHT) samples collected one year to 17 years before analysis (averaging at 6.5 years), and EHT samples ranging from less than two weeks to 9.5 years (averaging at 3 years), which show a similar sample age signature as that obtained from analyzing the metabolite differences between the two main clusters of samples. Though Pinto et al. [49] reported minimal metabolic impact of storage at −80 • C for a period up to 2.5 years, a recent study [50] reported increasing levels of glutamate and decreasing glutamine and pyruvate with storage time up to 16 years at −80 • C, as well as higher levels of ornithine, lactate and glycerol after 11 years of storage, in line with our results for these metabolites. Another study [51] found altered levels of several of these metabolites due to sample age. Specifically, increased levels of ornithine, methionine, glycine, glutamine and lower glutamate were found in samples analyzed after 5 years in −80 • C storage, of which we only found a similar direction for ornithine herein.
These findings, linking metabolites from the signatures obtained via the Initial Approach to any confounder, render our EHT-PHT signatures unreliable. In the targeted metabolomics study [19], which also used samples from the ENSAT-HT retrospective cohort, preanalytical confounders were not addressed, but GBGL2 samples were excluded.
The exclusion of confounder-related peaks (Approach B) resulted in analyses with lower statistical power, resulting also in low classification accuracies and signatures that could not be corroborated by any other approach. Metabolites such as lactate were excluded but were still found to play a role in discriminating samples by the final approach. Exclusion of whole centers (Approach C), specifically those we deemed compromised most by confounders, resulted in signatures that we used to evaluate the initially established ones. We expected this approach to yield the most reliable signature from our dataset and research questions, as it was the only one that both provided a complete view of the NMR plasma metabolome and kept confounding effects to a minimum by excluding compromised samples. Even so, we do not claim that Approach C signatures have merit in clinically discriminating PHT from any EHT form, as these analyses were restricted to certain centers and so included far fewer samples than originally planned. Approach C models predicted excluded PHT sample groups accurately, but sensitivity was low, while the highest accuracy achieved was 65% (PPGL-PHT model). Moreover, analyses comparing centers GYDR to ITTU3 were more accurate than those comparing disease groups in Approach C, which demonstrates another potentially confounding effect, albeit weaker than those already discussed. Approach C signatures differed substantially from the Initial Approach signatures, with several metabolites related to confounders not being selected as predictors or being selected with a coefficient of an opposite direction.
These differences underscore the uncertainty associated with the signatures from the Initial Approach, given the simultaneous relationship some metabolites have with both the confounders and the disease group separation. Furthermore, Approaches A and B do not seem to resolve the issue of confounding effects. Other approaches, such as multilevel analysis [52], which could be used to center the centers, would not be appropriate either, as EHT-PHT signatures varied by the center. Notably, the PA-PHT and PPGL-PHT signatures obtained from analyzing ITTU3 and GYDR data, respectively, were different from FRPA1 signatures (result not shown), as expected from the distinct cluster of FRPA1 PHT samples.
Untargeted NMR metabolomics of the ENSAT-HT retrospectively collected plasma samples described in this work was not suitable for obtaining a biomarker that discriminates EHT forms from PHT, due to the method's sensitivity to the sample set's probable confounders. Still, there were additional limitations in our work. Specifically, sample hemolysis [53] was not considered, but it was shown to not have a significant impact on the plasma metabolome with NMR [38], though the opposite was shown with MS [42]. Additionally, diet and medication at the time of sampling, as well as additional clinical parameters, such as patient BMI and disease severity, can all affect the metabolome [53] but were not considered here. There may also be additional confounders, e.g., analytical bias or patient characteristics, but their effect, if any, was less than that of the confounding effects already addressed, since the latter was the strongest source of variation within our data. Methodologically, our NMR method was limited by a relatively high detection limit, resulting in 33 identified metabolites, of which we excluded the ketone bodies acetoacetate and 3-hydroxybutyrate due to their dependence on fasting levels as well as acetone, acetate and choline due to their dependence on run order. There were five peaks to which we could not assign any known metabolites, and four metabolites (dimethylamine, dimethylglycine, dimethyl sulfone and formate) were identified only by visual inspection and comparison to databases. Our internal standard, which was used for peak scaling, was recently found to be attenuated by the presence of macromolecules [54], possibly resulting in lower peak intensities in samples with high concentrations of macromolecules. However, given that in our comparison of methods in our own paper 19/25 (76%) peaks in the final (LED) signature were also found in the ultrafiltration signature [55], we concluded that any bias, e.g., from the binding of maleic acid to protein, represented a minority of variation in a dataset, compared to the differences between groups. Permutation testing, though invaluable for assessing model accuracy, was not performed for the models presented, as these were deemed confounded and their resulting biomarkers unreliable.

Patient and Sample Characteristics
We analyzed Lithium Heparin blood plasma samples collected from 356 patients sampled at 13 ENSAT-HT centers, along with the post-operative counterparts of a subset of PPGL samples, which were used in our previous study [35]. After excluding technical outliers, 337 patients were finally included.

Untargeted 1H-NMR Metabolomics
We recorded spectra according to our NMR method as reported and previously applied [35,55]. Study samples, along with 146 Quality Control (QC) samples, were analyzed over 46 batches, whereas 99 Healthy Volunteer (HV) samples, were analyzed over 43 batches. A maximum of 15 samples per batch to limit intra-batch variability was selected, based on robust Principal Component Analysis (PCA) [56] results (R package "rospca" [57], version 1.0.4, Tom Reynkens), which showed QC samples outlying after 18 NMR experiments. NMR spectra recorded after this limit of 18 samples (n = 4, due to recording delays) were excluded. Samples were thawed and centrifuged at room temperature for the first three batches and at 4 • C for all the rest. A Bruker DRX AVANCE spectrometer equipped with a triple resonance inverse 5 mm probe head operating at 500.13 MHz was employed for analyzing samples. Both the sequence and batch number of all samples were recorded for estimating technical variability. Longitudinal Eddy-Current Delay (LED) spectra were recorded and processed as previously described [35,55]. Spectra with high line width were not used for further analysis (maleic acid peak width > 1.2 Hz, n = 4), along with spectra recorded from samples that had large chemical shift differences from other samples (n = 4) or peaks corresponding to EDTA (n = 2) or citrate (intensities corresponding to blood collection, n = 1). Four samples were retrospectively found to not correspond to any of the four disease groups (PA, PPGL, CS or PHT) and were excluded from the analysis. Areas corresponding to macromolecules, water, glucose and noise were excluded from data processing (areas above 10, 5.35 to 5.24, 5.15 to 4.40, 3.91 to 3.68, 3.54 to 3.36, 3.26 to 3.19, 1.30 to 1.10, 0.90 to 0.75, and below 0 ppm). R studio version 1.1.463 (J. J. Allaire, Boston, MA, USA) [58] running R version 3.4.4 (R core team, Vienna, Austria) [59] was used for loading the R package "batman" [60] version 1.2.1.03 (Jie Hao), which was used to extract the spectra as tables. R package "SPEAQ" [61] version 2.0.0 (Charlie Beirnaert) was used for obtaining the peak table and was applied according to the script "SPEAQ pipeline 3". Due to their high number, SPEAQ was not possible to perform on all samples simultaneously. Hence, a script for aligning different SPEAQ batches was developed ("Combining1 2 PLASMA 3 new"). These two scripts can be found on the first author's GitHub page (https://github.com/NickBliz/PPGL-PRE-VS-POST, last update on 5 June 2021) and data can be provided upon request. All subsequent data processing steps, which can be found in scripts on the first author's GitHub page (https://github.com/NickBliz/ENSAT-HT-PLASMA-NMR, last update on 28 September 2021), were performed on R version 3.6.3, loaded on R studio version 1.2.5033. The ethanol peaks at 3.65 ppm, the highly variable histidine peaks at 7.06 and 7.81 ppm, the macromolecule peak at 2.02 ppm and the peaks with a strong correlation with the run order of samples at 3.185, 2.21 and 1.904 ppm were also excluded, along with peaks at 2.26, 2.28, 2.31, 2.37, 2.39 and 4.13 ppm, corresponding to ketone bodies. These latter metabolites were not taken into account as, giving rise to outliers, would not be valuable biomarkers. HV samples could be separated based on the batch in which they were analyzed, and QC samples based on their run order within batches, but these analytical factors were not found to have a major impact on our dataset, as seen in Figure S2. As the first step in data processing, features not present in at least 80% of samples belonging to either the QC or HV group [62], were removed. Probabilistic Quotient Normalization (PQN [63]) was applied as a normalization method, using as reference the median spectrum ignoring non-detects of a set of samples that were drawn from 98 HVs from the GYDR center. HV samples obtained from center FRPA1 were not used for normalization, as these only included healthy young male subjects. Next, peaks with a coefficient of variation of more than 30% in QC samples were removed. Finally, after missing value estimation with the k-nearest neighbors (k-NN) method [64] (k = 10), using R package "impute" version 1.58.0 (Balasubramanian Narasimhan) [65], the generalized log transformation (GLOG [66]) based on 133 QC samples was applied using R package "LMGene" version 2.40.0 (Blythe Durbin-Johnson) [67]. The QC samples were prepared by pooling plasma obtained from 390 anonymized plasma samples to a total volume of about 450 mL, which was subsequently aliquoted into 1 mL batches and stored at −80 • C until analysis. As a normalization and GLOG transformation optimization step, the process was repeated after the HV and QC sample groups were relieved of outliers detected by means of robust PCA [56]. The resulting dataset was directly used for multivariate statistics. We assigned peaks to metabolites using Chenomx evaluation v. 8.4 [68] and Bruker Topspin v. 4.0.6. We used the human metabolome database [69], along with the Madison Metabolomics Consortium Database [70], as references. To aid in peak assignment, additional methods were employed. Specifically, 2D NMR, namely correlation spectroscopy (COSY, cosyprqf) and J-resolved (JRES, jresgpprqf) pulse sequences allowed for investigating correlations between peaks and peak multiplicity, respectively. These experiments, along with 1D NMR on filtered samples in pH 2.5 [71], allowed for a number of peak identities to be confirmed, and some unknowns to be assigned to metabolites. COSY spectra were recorded using 4 K data points in F2 direction, 128 in F1, 16  Receiver gain was set to the optimum value. Spectra were processed using Fourier transform in both directions, tilting and symmetrizing phase-sensitive spectra, and calibrated TSP to 0 ppm. F Furthermore, a spiking experiment was carried out, utilizing a QC sample and a non-related-to-the-study control sample, which were spiked with pyruvic acid, succinic acid, choline, carnitine, acetylcarnitine, methanol and 5-methylthioadenosine, along with histidine, serine, phenylalanine (Sigma) and tyrosine (Merck). A stock solution (#1) with concentrations of 0.31 mM of pyruvic acid, 0.15 mM succinic acid, 0.12 mM choline, 0.18 mM carnitine, 0.041 mM acetylcarnitine, 0.024 mM 5-methylthioadenosine and 0.99 mM methanol, another (#2) with 1.1 mM histidine, 1.5 mM serine, 0.60 mM phenylalanine, 0.42 mM tyrosine, along with two stock solutions (#3, #4) with half these concentrations, were prepared. A volume of 100 µL of stock #1 was added to 400 µL of a QC, another volume of 100 µL of stock #3 to another 400 µL of a QC, and a volume of 100 µL of stock #2 was added to 400 µL of the unrelated control, another volume of 100 µL of stock #4 to another 400 µL of the unrelated control. All these spiked samples underwent the same procedure as the study plasma samples to acquire as similar results as possible. Table S1 summarizes all peaks included for our analyses, along with their assigned metabolite names and the level of assignment rigor.

Data Analysis and Statistics
Based on our research aims, we aimed to separate the following disease groups:  [73], as an unsupervised method to discover trends and outliers within the data. Partial Least Squares Discriminant Analysis (PLSDA) [74] was used as a supervised method for establishing the signature for separating samples grouped by center as well as sample age (median sample age cutoff). Important PLSDA variables were determined based on their Variable Importance in the Projection (VIP) score [75] (median outer-loop VIP score above 1 after double cross-validation, or CV2) in PLSDA models. Analysis of Variance Simultaneous Component Analysis (ASCA [24]) was used for removing the interaction between PCA cluster and disease group, present within the dataset (as observed in the PCA score plot of Figure 2). Sparse PLSDA (sPLSDA [76,77]), was employed for investigating the metabolic signature (regression coefficients) of the differences between samples grouped by diagnostic category (PHT, PA, PPGL or CS, with EHT as a pool of the latter three and the first as controls). As an alternative to sPLSDA, l1-norm regularized logistic regression [78] was employed, using R package "glmnet" version 4.0-2 (Trevor Hastie). Parameters for glmnet included alpha = 1, family = binomial and the minimum lambda, which was determined by cross-validation, was used for prediction, whereas "class" was selected as a prediction measure. Metabolomics data were not scaled, whereas when including age and gender as variables data were autoscaled (unit variance). Optimal model parameters (number of latent variables in sPLSDA, number of included variables in sPLSDA and glmnet) were determined by cross-validation. Each model was assessed by double cross-validation (CV2) [79,80], by leaving out a number of samples determined by the k-fold method, with k = 8 and 7 for outer and inner loops, respectively. These values for k were switched to k = 6 and k = 5 (respectively) for approach 4 samples, where a higher class imbalance was present. In scenario CSVPHT, k = 7 and k = 6 (respectively), where a higher class imbalance was also present. A single repeat of the process was chosen for the inner loop and 50 repeats for the outer loop. Finally, the left-out sample labels were predicted based on the model and using the Mahalanobis distance. Each model's signature was determined based on the total set of samples, after single CV and the same parameters as with CV2 (e.g., 50 repeats). The maximum number of latent variables for (s)PLSDA models was 10, and for each inner cross-validation loop, the minimum number of components with the maximum accuracy was chosen, based on predictions using the Mahalanobis distance. All parameters and scripts for multivariate data analysis can be found on the first author's GitHub page (https://github.com/NickBliz/ENSAT-HT-PLASMA-NMR) and data can be provided upon request. Univariate statistical analysis methods were used to assess confounder effects on multivariate models and were performed using the "stats" R package [59], to check for data normality and to discover significant differences between variables. Tests included the ANOVA/Krusall-Wallis test, as well as the t/Wilcoxon test [81] to compare averages, at a significance level of 5%. For model accuracies, sensitivities and specificities reported in tables, one-sample t-tests were performed in R and the means and 95% confidence intervals were obtained. The Pearson/Kendall/Spearman correlation [82] estimates were used for investigating univariate correlations between various factors (technical, clinical and biological) and each model's number of misclassifications. A p-value of less than 0.05 was accepted as statistically significant.

Confounders
Several approaches were used to investigate trends found by PCA: (A) Using ASCA [24], we estimated the confounder effects and used the Simultaneous Component Analysis residuals for sPLSDA. (B) We analyzed the data with PLSDA, based on groups defined by a specific confounder, to establish an inclusive list of all potentially relevant metabolites (i.e., without searching for most predictive metabolites, but rather all related metabolites, in our attempts to prove the study "innocent" of confounding effects). These metabolites were thereafter excluded from the dataset, in the initial phase of data processing (i.e., during the peak exclusion step), and the new dataset was used for classifying samples based on their disease groups. (C) Samples most related to the strongest confounder were removed, and classification models were calculated based on the remaining samples.

Conclusions
Our data reveal obvious between-center differences. Probably the most important factor is the retrospective multicenter character of the study. Specifically, for untargeted metabolomics studies, this stresses the importance of the preanalytical conditions (vena puncture-and sampling specifics, storage conditions and storage time, medication and dietary influences, transport and handling conditions, etc.). In the present study, the most likely explanation for the large variation in samples per center is a combination of precentrifugation and pre-storage delay, as well as variations in storage time, which form a triangle of effects with the disease group variation. These factors are equally important in prospective multicenter studies. In such studies, investigators can catch such variables in protocols and can use control samples as quality control means. In retrospective studies, multicenter quality controls may be more challenging to obtain, therefore, the study design should include a harmonized protocol, which would allow for these factors to be considered. This protocol should be followed by all participating centers [83] ensuring that preanalytical confounders are kept to a minimum. Samples should only be selected for inclusion if they fulfill certain preanalytical criteria. In fact, it would be beneficial for any multicenter or retrospective study to carry out a preliminary experiment on samples collected to assess the quality of the selected or prospective cohort. In addition, quality markers could help detect samples of low quality to be discarded [40]. Finally, normalization methods could be used for adjusting any residual preanalytical effects, provided that samples are selected in such a way that ensures their applicability [24][25][26]52]. Implementation of these procedures may prove vital in the discovery of robust disease biomarkers. Our study did not result in robust EHT biomarkers, due to the lack of adequate solutions and international consensus for containing the bias caused by preanalytical factors. This need should be covered by decisions on study design requirements for future multicenter metabolomics studies, with respect to future as well as published research findings on the effects of preanalytical conditions. Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/metabo12080679/s1, Figure S1: PCA score plot of the complete dataset. This plot includes Healthy Volunteers (HV), Quality Controls (QC), as well as samples collected from all disease groups (PA, PHT, PPGL and CS). One patient was found to have adrenocortical carcinoma (ACC) and was left out of subsequent analyses. Principal components 1 and 2 were used for the plot. QC samples were technical replicates and were aliquoted from pooled plasma collected from patients not included in the present study. The spread of QC samples is indicative of technical variation associated to the data, which is significantly lower than biological. Figure S2: PCA score plots of the dataset that included all disease groups. Samples colored by (a) patient age, (b) patient sex, (c) analytical batch, (d) run order within batches. For continuous variables patient age, analytical batch and run order, groups were defined by the corresponding medians. Figure (e) is the loadings plot of the PCA score plot depicted in Figure 1. The clusters of samples can be explained by outstanding metabolites lactate (peaks at 1.3 and 4 ppm), methanol (3.346 ppm), glutamine (peaks at 2.1 and 2.4 ppm) and glutamate (peaks at 2.325 and 2.05 ppm). Table S1: Detectable metabolites, based on their corresponding peaks, PubMed CID, level of identification rigor (according to MSI guidelines (Sumner et al., 2007)) and metabolites that also correspond to at least one peak. L2* identification rigor resulted from visual inspection along with 2D NMR and experiments on filtered plasma at pH 2.5. Table S2: Summary of accuracies from analyses done via the Initial Approach, without correcting for confounders. Table S3: Summary of accuracies from analyses done via Approach A (ASCA correction). Table S4: Summary of PLSDA analyses done to compare groups of samples defined by confounders. Table S5: Summary of accuracies from analyses done via Approach B (after the exclusion of peaks related to confounders). Table S6: Summary of accuracies from analyses done via approach C (whole center exclusions). Table S7: Summary of accuracies from analyses done to evaluate datasets used in approach C (whole center exclusions). Reference [84]