Screening for inborn errors of metabolism using untargeted metabolomics and out-of-batch controls

Motivation Untargeted metabolomics is an emerging technology in the laboratory diagnosis of inborn errors of metabolism (IEM). In order to judge if metabolite levels are abnormal, analysis of a large number of reference samples is crucial to correct for variations in metabolite concentrations resulting from factors such as diet, age and gender. However, a large number of controls requires the use of out-of-batch controls, which is hampered by the semi-quantitative nature of untargeted metabolomics data, i.e. technical variations between batches. Methods to merge and accurately normalize data from multiple batches are urgently needed. Methods & results Based on six metrics, we compared existing normalization methods on their ability to reduce batch effects from eight independently processed batches. Many of those showed marginal performances, which motivated us to develop Metchalizer, a normalization method which uses 17 stable isotope-labeled internal standards and a mixed effect model. In addition, we propose a regression model with age- and sex as covariates fitted on control samples obtained from all eight batches. Metchalizer applied on log-transformed data showed the most promising performance on batch effect removal as well as in the detection of 178 known biomarkers across 45 IEM patient samples and performed at least similar to an approach using 15 within-batch controls. Furthermore, our regression model indicates that 10-24% of the considered features showed significant age-dependent variations. Conclusions Our comprehensive comparison of normalization methods showed that our Log-Metchalizer approach enables the use out-of-batch controls to establish clinically-relevant reference values for metabolite concentrations. These findings opens possibilities to use large scale out-of-batch control samples in a clinical setting, increasing throughput and detection accuracy. Availability Metchalizer is available at https://github.com/mbongaerts/Metchalizer/

metabolomics can also reveal new biomarkers or increase our understanding of disease mechanism when 53 exploited in epidemiological studies (Glinton, et al., 2019). 54

55
In traditional targeted diagnostic laboratory tests hundreds of reference samples are required to establish 56 robust reference intervals. When using untargeted metabolomics the establishment of reference values is 57 complicated due to the semi-quantitative nature of the data owing to several sources of variation like 58 injection volume, retention time, temperature, or ionization efficiency in the mass spectrometer that cannot 59 easily be amended. Moreover, these variations are even larger between different measurement runs in which 60 a batch of samples is being measured simultaneously, hampering the resemblance between different 61 batches. As a result, within-batch variation is smaller than between-batch variation. Therefore, to conquer 62 these batch effects, current approaches include reference samples in each single batch of measurements 63 Clearly, this reduces the throughput efficiency of IEM screening as the number of patient samples that can 68 be included in a batch is considerably lower when the reference samples need to be measured as well. But, 69 more importantly, the number of reference samples in one batch might fall short in the establishment of 70 adequate reference ranges as variations in certain metabolites are not captured well enough in the relatively 71 small reference panel. For example, factors like age, sex and BMI can affect abundancies of metabolites, 72 and, to establish reliable reference ranges, one thus needs to correct for these factors by using a large number 73 of reference samples (Chaleckis, et al., 2016) (Rist, et al., 2017) (Yu, et al., 2012). Consequently, for reliable 74 untargeted metabolomics in clinical testing, a large set of reference samples is needed, while for throughput 75 efficiency a small set is preferred. Altogether, this calls for an approach that can establish reference values 76 based on reference samples being measured in several batches (out-of-batch controls). 77 78 When relying on reference samples from different batches, one needs to correct for the batch effects to 79 obtain reliable estimates for the reference ranges. This is generally solved by normalization methods and 80 some have already been proposed within the context of untargeted metabolomics and mass spectrometry 81 (Veselkov, et al., 2011) (Li, et al., 2017) (Välikangas, et al., 2016). Only a few groups have used out-of-82 batch controls to determine the reference values and used relatively simple normalization techniques like 83 median scaling (Miller, et al., 2015), using a reference internal standard per metabolite (Körver-Keularts, 84 et al., 2018) or using anchor samples (Glinton, et al., 2019). However, there has not been an extensive 85 exploration of normalization techniques within the context of diagnostic testing for IEM's. 86

87
We explore several known normalization methods on their ability to remove batch effects and to detect 88 biomarkers from patients with known IEM. Furthermore, we introduce a new normalization method, which 89 we called Metchalizer, which uses internal standards and a mixed effect model to remove batch effects. As 90 this allows for a large set of (out-of-batch) reference samples, we also explore a regression model that uses 91 age and sex as covariates to correct for potential age and sex effects on the reference values. Using the 92 regression model combined with the Metchalizer normalization, we achieve similar performances in 93 biomarker detection compared to the use of within-batch controls. Hence, this opens the possibility to 94 increase the throughput of untargeted metabolomics in IEM screening as well as including more complex 95 confounder strategies. 96 97 98 99 1) When features were annotated in reference batch and the batch being merged, these features were 131 pooled to the merged dataset. 132 2) When MS/MS spectra were present for a potential matching pair of features, the cosine similarity 133 metric was calculated and had to be > 0.8. 134 3) Retention time difference in percentage was calculated between potential matches, and had to be < 135 2.5%. 136 4) Progenesis QI determined per feature an isotope distribution and we required sufficient overlap of 137 these distributions between potential matching pairs. This was determined by calculating a difference 138 in percentage between each bin of this distribution. The maximum difference of these bins had to be < 139 50%. 140 5) As we expect matching features to have similar within-batch median abundancies (despite of batch 141 effects), we calculated the differences between these medians in percentages, which had to be < 300%. 142 6) Neutral masses were known for the matching pair but not the MS/MS spectra, the ppm-error had to be 143 < 1. 144 7) m/z-values were known for the matching pair but not the MS/MS spectra and neutral masses, the ppm-145 error of between the m/z-values had to be < 1. arginine (+/-). Amino acids were determined by ion-exchange chromatography according to protocols 156 described by the manufacturer (Biochrom). Free carnitine and acylcarnitines analysis was performed as 157 described by Vreken et al. (Vreken, et al., 2002).

Initial transformations 164
Prior to normalization raw abundancies were for some methods transformed using a log-transform or Box-165 Cox transformation. The latter was given by: The internal standard, m, that best correlates with a feature j is being used to normalize the abundances of 202 feature j. The correlation is measured within each batch using the spearman correlation between feature j 203 and each internal standard individually across all samples and subsequently averaged across all eight 204 batches. The internal standard which (positively) correlated the best was used for normalization according:

231
Six metrics were used to evaluate the performance of normalization methods. 232 WTRj score: The WTR score (Within variance Total variance Ratio) calculates the ratio between the 233 'overall' within-batch variance and the total variance from the QC samples: 234 where is the variance of all eight batch averages for metabolite j in the QC samples, and 236 the 'overall' variance based on all QC samples. The WTR score is between 0 and 1. As we would like batch 237 averages to be similar for the QC samples (resulting in approaching zero), we are interested in 238 WTR scores close to one. Reference values for metabolites were determined by using a Z-score methodology: a set of reference values 276 was Z-transformed (corrected for mean and divided by the standard deviation) which was then assumed to 277 be normally distributed. Aberrations can then be called by considering significant Z-scores using a chosen 278 cutoff level. We use four different methods to determine the Z-scores. where is the predicted (normalized) abundancy of feature j for sample I, is an intercept. , 298 (interaction) and indicate slopes. P is the degree of the polynomial used for regression on age 299 and set to P=3 in this study.
is 1 for women and 0 for men. is the estimated error. The latter 300 expression is the model in vector notation with .

P-values from Welch's t-test 344
As an alternative to using the (average) Z-scores we also considered the p-values obtained from the Welch's 345 t-test to be informative, as it indicates whether the mean of triplicates differs significantly from the 346 population average. Note that the triplicate was expected to have only technical variance whereas the 347 reference population has variance consisting of technical-plus biological variance. For every Z-score 348 method (15in, 15out, All controls, Regression) these p-values were obtained per feature (and patient).

Comparing normalization methods 406
We investigated the performance of several normalization methods on batch effect removal by evaluating 407 multiple metrics based on quantitative measurements, the Quality Control (QC) samples and PCA analysis 408 were expected to separate from the human plasma samples (squares vs circles in Figure 2) in the first four 420 Principle Components (PC) due to overall abundancy differences for several metabolites. Normalization 421 should maintain this separation which was measured by the QC prediction score ( Figure 3B). Log-CRMN 422 conserved QC/plasma separation, with a median QC prediction score of 1.00 (1.00) for positive (negative) 423 ion mode, but was less able to reduce batch effects since it had a median batch prediction score of 0.76 424 Reduced between-batch variation in QC samples: Next, we compared the within-batch variance of the QC 445 samples with respect to the total variance which is expressed by the WTR score (Methods) for each 446 normalization method. WTR scores close to 1 indicate the absence of batch effects. None-Raw and Log-447 Raw had low WTR scores and after normalizing these scores increased ( Figure 2E). BC-Metchalizer and 448 Log-Metchalizer scored among the highest on this WTR score. None-Anchor had high WTR scores, but 449 since None-Anchor uses the QC samples for normalization the WTR scores are biased towards higher 450

values. 451 452
Preserved feature ranks in QC samples: Removal of variation results in higher WTR scores but potentially 453 removes also variation(s) of interest. Therefore, we investigated whether the ranks of the abundancies of 454 the different features in the QC samples remained the same as in the raw data (expressed as the QC rank 455 differences, , see Methods for details). A lower rank difference indicates that metabolic differences 456 present in the QC samples were conserved after normalization. Figure 2F shows the QC rank differences 457 for each normalization method. These results confirm the previous believe that Log-NOMIS and Log-RUV 458 also removed non-batch related variations (higher ), since they had relatively high 's.

BC-459
Metchalizer and Log-Metchalizer showed rank differences but were lower than most other competing 460 methods. None-Anchor showed high QC rank differences, but this is again the result of the fact that None-461 Anchor uses the QC samples for normalization.

471
The squares indicate QC samples whereas the circles indicate patients and controls samples.

Confounder effects of age and sex 483
To explore confounding effects of age and sex on metabolite abundancies, we developed a regression model 484 with sex as linear covariates and age as a polynomial (p=1,2,3) covariate (see Methods). After 485 normalization, we fitted the model parameters for every feature using all control samples present in the 486 eight batches and determined the significance of the coefficients in the regression model (see Methods). 487 Table 1 shows the percentages of (strong) significant coefficients ( ) per ion mode and (selected) 488 normalization methods. Our findings suggest that 6-24% of all features showed age dependency when 489 looking at coefficient (i.e. the linear term in the model). It is noteworthy that more age-related features 490 were found in the negative ion mode. 491 492 Age-dependent metabolites (supplement S3 Table S3), when using normalization by BC-Metchalizer, 493 include known IEM biomarkers, such as: guanidinoacetic acid(+), homoarginine(-) and N-acetyltyrosine(-494 ) , 2-ketoglutaric acid(-), citrulline(-) and ornithine(-). As an example, we plotted the regression model for 495 guanidinoacetic acid (Figure 3), illustrating that the Z-score for a fixed abundancy depends on age (and 496 slightly on sex at later ages). This also shows a non-linear trend with age. Our analyses showed that more 497 metabolites have significant non-linear trends over age ( and in Table 1). Moreover, age dependent 498 features have the tendency to increase/decrease in abundancy faster for decreasing age, implying that a 499 matching reference population on younger ages seems to be more important (supplement S3 Figure 5). 500

516
The first standard deviation is indicated by the thin(ner) line whereas the second standard deviation ends at the shaded region.

Detection of the expected IEM biomarkers 519
Next we investigated the impact of normalization and using out-of-batch controls on expected biomarker 520 Operator Curve (AUC). We decided to take the method that uses 15 within-batch controls and raw 529 abundancies (15in&None-Raw) as the reference approach, where the performance was expressed as a 530 percentage of this reference AUC, named . Here indicates if the AUC was created from 531 the average Z-scores or p-values. These p-values were obtained from the Welch's t-test which tests whether 532 the average Z-score of an expected biomarker or feature across the triplicate significantly differs from the 533 average Z-score of the reference population (Methods). 534 535 Log-transform improves biomarker detection for p-values: Our first observation is that, when considering 536 the Z-scores, the log-transformed raw abundancies (15in&Log-Raw) has an AUC approximately equal to 537 (Figure 4), implying that this transformation hardly affected this performance metric. 538 However, when using the p-values, the log-transformation improved the detection of the expected 539 biomarkers, as is 8% higher than the (Figure 4). 540 541 Reduced performance with age/sex matched out-of-batch controls: When comparing the performance of 542 using 15 out-of-batch controls (15out&Raw) to the 15in&Raw reference, the performance for 15out was 543 clearly reduced (Figure 4 A), achieving only 80% of the reference . This difference was 544 also present when looking at the p-values, resulting in a clear reduction of the (74%). 545 Hence, potential improved age/sex matching for 15out, due to the increased number of available controls 546 (supplement S4 Figure 6), did not result in improved performance, most likely due to the dominance of 547 batch effects.  and, sometimes, sex-dependent reference ranges are established using hundreds of reference samples. 584 Untargeted metabolomics is a promising alternative enabling the determination of many metabolites in one 585 analysis. This can speed up the diagnostic process and extends the number of different IEMs that can be 586 screened in a single run. A major drawback of current approaches is that reference samples need to be 587 included in the same experimental batch to ensure proper reference ranges (or Z-score transformations). 588 Some methods do exist that use reference samples measured in different batches (out-of-batch controls) to 589 determine age and sex corrected Z-scores, and they are based on normalizing methods that remove the batch 590 effects. There has not been a comprehensive comparison of the different normalization methods with 591 approaches that use out-of-batch controls, which we have set out in this work. Moreover, we developed a 592 new normalization method, Metchalizer, that makes use of isotope-stable internal standards, an approach 593 that has been shown to be useful when mapping specific metabolites to specific internal standards (Körver-594 Keularts, et al., 2018) which we generalize to all features measured. Because more reference samples are 595 available when using the out-of-batch controls, we additionally propose a regression model that 596 incorporates sex and age effects as (non-linear) covariates. Alltogether, we have shown that our 597 methodology has biomarker detection performances at least similar to using 15 within-batch controls. 598 599 Typically, around 20,000 features in both negative and positive mode were detected per batch. When we 600 require a feature to have been measured (and matched) in all eight batches, we retained 598 positive and 601 773 negative ionized features, respectively. As some normalization methods use a statistical approach 602 (PQN, Fast Cyclic Loess), the reduction in features might explain the reduced performance of these 603 methods. In addition, the requirement of features being measured (and matched) across all eight batches 604 also resulted in the loss of some biomarkers, which hampered the performance of all out-of-batch methods 605 with respect to the within-batch methods. As an alternative, we could have made the inclusion of features 606 dependent on fewer batches (for example being present in >5 out of 8 batches). We decided not to do that 607 as this would have resulted in an unequal number of control samples for the different features, leading to 608 inconsistent results between the out-of-batch methods. The availability of more batches could have solved 609 this issue because an equal number of control samples could likely be obtained per feature, even when these 610 features were not present/matched in some batches. It is interesting to note that our proposed normalization 611 method (Metchalizer) showed consistent performances when data from various number of batches is being 612 used (supplement Figure S7). Some biomarkers, for example isobutyrylglycine, were only detected in the 613 batches having patients with elevated levels of these specific metabolites. We anticipate that for this kind 614 of biomarkers out-of-batch strategies are not useful since abundancies in (healthy) controls are (very) low, 615 thereby making Z-score calculation unsuitable. 616

617
Anchor uses an anchor (fixed) samples, measured in all batches, to normalize the features. Anchor 618 normalization on none-transformed data performed well when compared to most of the other normalization 619 methods explored, but slightly less than BC-Metchalizer and Log-Metchalizer when considering the 620 performance metrics Spearman score, R 2 score, batch prediction score and performance on biomarker 621 detection. We anticipate that the anchor samples may not correlate with all types of variation like, for 622 example, injection volume which is a source of variation at the sample level, whereas the abundancy of the 623 Metchalizer assumes that this relationship holds across batches and with that assumption determines (batch) 634 intercepts that correct for the 'unexplained' batch/technical variations. Consequently, when such linear 635 relationship between internal standards and features does not exist, the normalization would be fully based 636 on the (batch) intercepts, deteriorating the power of this approach. Alternatively, when batch differences 637 (represented by the intercepts) differ from each other due to biological variations between batches, this will 638 be interpreted as 'unexplained' batch/technical variations, and consequently wrongly removed by 639 Metchalizer. For this reason, it is important to use randomized control samples in every batch (in terms of 640 age, sex etc) to minimize the possibility of biological variations between batches. 641 642 Log-Metchalizer log transforms the abundancies before applying Metchalizer, whereas BC-Metchalizer 643 uses a less strong Box-Cox transformation. The effect of this stronger transformation on most investigated 644 metrics in this study was small, although we did observe that a stronger initial transformation led to 645 improved biomarker detection performances when considering the p-values. 15in&None-Raw had a lower 646 than 15in&Log-Raw and could therefore also explain the improved performance of Log-Metchalizer 647