#
Bucket Fuser: Statistical Signal Extraction for 1D ^{1}H NMR Metabolomic Data

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

^{†}

## Abstract

**:**

^{1}H NMR experiments offer good sensitivity at reasonable measurement times. Their subsequent data analysis requires sophisticated data preprocessing steps, including the extraction of NMR features corresponding to specific metabolites. We developed a novel 1D NMR feature extraction procedure, called Bucket Fuser (BF), which is based on a regularized regression framework with fused group LASSO terms. The performance of the BF procedure was demonstrated using three independent NMR datasets and was benchmarked against existing state-of-the-art NMR feature extraction methods. BF dynamically constructs NMR metabolite features, the widths of which can be adjusted via a regularization parameter. BF consistently improved metabolite signal extraction, as demonstrated by our correlation analyses with absolutely quantified metabolites. It also yielded a higher proportion of statistically significant metabolite features in our differential metabolite analyses. The BF algorithm is computationally efficient and it can deal with small sample sizes. In summary, the Bucket Fuser algorithm, which is available as a supplementary python code, facilitates the fast and dynamic extraction of 1D NMR signals for the improved detection of metabolic biomarkers.

## 1. Introduction

^{1}H NMR experiments, which simultaneously detect all proton-containing metabolites present at sufficient concentrations in a sample, offer good sensitivity at reasonable measurement times. Numerous studies have already demonstrated the ability of 1D

^{1}H NMR to reveal novel biomarkers, e.g., in the context of kidney [1,2,3,4] and heart diseases [5,6], as well as all-cause mortality [7,8].

^{1}H NMR spectra requires sophisticated data preprocessing strategies. Prior to any statistical evaluation, NMR signals that correspond to specific metabolites need to be extracted from the spectra. Each extracted NMR signal or feature should ideally represent the same metabolite across the complete sample cohort. This requirement, which is of paramount importance for subsequent statistical and bioinformatic data analyses, is challenged by the fact that NMR signal positions can vary across specimens due to differences in sample pH, ionic strength, and measurement temperature, as well as metabolite–protein interactions. Potentially the most popular NMR feature extraction method that is used to compensate for NMR signal shifts across sets of spectra is equidistant bucketing or binning. Each spectrum is split into buckets/bins of equal size and the signals within each bucket are summed or integrated. This method is able to substantially reduce the high dimensionality of 1D

^{1}H NMR spectral data and can thus lower both the burden of multiple tests and the problem of overfitting in the subsequent statistical hypothesis testing and machine learning data analysis. However, this method is not able to resolve strongly overlapping NMR signals in crowded regions, which are typically present in the 1D

^{1}H NMR spectra of complex biofluids, such as urine or plasma. Over recent years, several more sophisticated methods for 1D

^{1}H NMR metabolic feature extraction have been proposed, including Gaussian binning [9], adaptive binning [10], adaptive intelligent binning [11], and dynamic adaptive binning [12], as well as the statistical recoupling of variables (SRV) [13] and the pJRES binning algorithm (JBA) [14]. Especially the latter two approaches, which perform clustering of adjacent spectral regions based on covariance to correlation ratios, are computationally feasible even in the case of large metabolomics data sets [14].

^{1}H NMR feature extraction procedure that is based on a regularized regression framework, which uses fused group Least-Absolute Shrinkage and Selection (LASSO) terms. BF dynamically constructs NMR features, which predominantly comprise the same NMR signals across one dataset. We demonstrated its performance using three different NMR datasets and benchmarked it against existing state-of-the-art NMR feature extraction methods, including equidistant binning, SRV, and JBA. Extensive performance evaluations were carried out via hypothesis testing and correlation analyses using absolutely quantified metabolite concentrations, including a thorough investigation of sample size dependence. The BF algorithm is freely available as a python implementation.

## 2. Methods

^{1}H NMR spectra due to baseline distortions, we replaced the corresponding numbers by their absolute values. This procedure is not unique, but it only affects regions that do not contain clear metabolite signals. We modeled the data matrix Y using a penalized linear regression model:

#### 2.1. Algorithm

#### 2.2. Metabolomic Data Acquisition and Processing

#### 2.2.1. Datasets

^{1}H NMR metabolic datasets, all of which were acquired using a 600 MHz Bruker Avance III spectrometer (Bruker BioSpin GmbH, Rheinstetten, Germany) that was equipped with a cryogenic probe head and an automatic cooled sample changer. The first dataset consisted of 1D

^{1}H NMR spectra from 106 urine specimens that were collected from patients 24 h after cardiac surgery with cardiopulmonary bypass (CPB) use [1]. Of these 106 patients, 34 were diagnosed with postoperative acute kidney injury (AKI). 400 $\mathsf{\mu}$L of each urine specimen were mixed with 200 $\mathsf{\mu}$L of 0.1 mol/L phosphate buffer, pH 7.4, and 50 $\mathsf{\mu}$L of 29.02 mmol/L 3-trimethylsilyl-2,2,3,3,-tetradeuteropropionate (TSP) dissolved in deuterium oxide as internal standard (Sigma-Aldrich, Taufkirchen, Germany), and the 1D

^{1}H NMR spectra were acquired using a 1D nuclear Overhauser enhancement (NOESY) pulse sequence with solvent signal suppression by presaturation during relaxation and mixing time [20]. The NMR spectra are available via the publicly accessible MetaboLights database at https://www.ebi.ac.uk/metabolights/ (accessed on 21 October 2012; accession ID: MTBLS24).

^{1}H NMR spectra from 85 EDTA plasma specimens that were collected from patients 24 h cardiac surgery with CPB use, who were a subcohort of the 106 patients mentioned above [2]. Out of these 85 patients, 33 were diagnosed with postoperative AKI. Each EDTA plasma specimen was subjected to 10 kDa cut-off filtration to remove macromolecules and subsequent sample preparation, as well as NMR spectral data acquisition, which was performed as described above.

^{1}H NMR spectra from 223 EDTA plasma specimens that were collected at the baseline time point of the German Chronic Kidney Disease (GCKD) study [3,21]. For this dataset, 400 $\mathsf{\mu}$L of each unfiltered EDTA plasma specimen were mixed with 200 $\mathsf{\mu}$L of 0.1 mol/L phosphate buffer, pH 7.4, 50 $\mathsf{\mu}$L of 0.75% (w/v) TSP that was dissolved in deuterium oxide, and 10 $\mathsf{\mu}$L of 81.97 mmol/L formic acid (Sigma-Aldrich, Taufkirchen, Germany), which served as the internal standard for referencing and quantification. The 1D

^{1}H NMR spectra were acquired using a Carr–Purcell–Meiboom–Gill (CPMG) pulse sequence to suppress unspecific macromolecular signals. The absolute concentrations of 25 unique metabolites were quantified from these 1D

^{1}H NMR spectra, according to the method in [22] and using the Chenomx software suite (Chenomx Inc., Edmonton, AB, Canada). The NMR signals were identified through comparison to reference spectra from pure compounds, which were available from the Chenomx software suite. The NMR spectra are available from the MetaboLights database at https://www.ebi.ac.uk/metabolights/ (accessed on 26 June 2019; accession ID: MTBLS798).

#### 2.2.2. Feature Extraction

^{1}H NMR dataset was split into 9999 even bins. The spectral regions from 6.5 to 4.5 ppm, which corresponded to the remaining water and broad urea signals, and the TSP region from 0.5 to −0.5 ppm were excluded, which resulted in a total of 7000 bins. A probabilistic quotient normalization (PQN) [24] was applied to reduce sample-to-sample variations that were caused by differences in fluid intake.

^{1}H NMR dataset, the spectral region from 9.5 to −0.5 ppm was split into 9999 even bins (the raw bucket table has been published in [25]). The spectral intensities were normalized to the internal standard TSP to correct for variations in spectrometer performance [2]. The spectral region from 6.2 to 4.6 ppm, which corresponded to the remaining water and broad urea signals, and the TSP region from 0.5 to −0.5 ppm were excluded, which resulted in a total of 7400 bins. After the application of the different binning methods, the regions of 3.82–3.76 ppm, 3.68–3.52 ppm, 3.23–3.20 ppm, and 0.75–0.72 ppm, which corresponded to filter residues and free EDTA, were excluded prior to further statistical analysis.

^{1}H NMR spectra were referenced and normalized to the internal standard formic acid to correct for variations in spectrometer performance [3,21]. The spectral region from 9.5 to 0.5 ppm was evenly split into 9000 bins and the spectral region from 6.0 to 4.5 ppm, which corresponded to the remaining water and broad urea signals, was excluded, which resulted in a total of 7499 bins.

## 3. Results

#### 3.1. Bucket Fuser Dynamically Constructs NMR Metabolite Features

- (1)
- BF fits plateaus, as shown by the thick blue and red lines;
- (2)
- The plateaus start and end at the same position for all spectra;
- (3)
- The regularization parameter $\lambda $ calibrates the plateau width: $\lambda =5$ yields larger plateaus than $\lambda =2.5$ and $\lambda =2.5$ yields larger plateaus than $\lambda =1$.

#### 3.2. Bucket Fuser Improved Signal Extraction

#### 3.3. Metabolite Identification

#### 3.4. The Bucket Fuser Can Deal with Small Sample Sizes

#### 3.5. The Bucket Fuser Improved the Detection of Metabolic Biomarkers for Acute Kidney Injury after Cardiac Surgery

## 4. Discussion and Conclusions

^{1}H NMR metabolic data. We compared the developed method to other state-of-the-art approaches and demonstrated its superior performance using absolutely quantified metabolite concentrations. Moreover, we studied two realistic applications in which metabolite concentrations in urine and blood from individuals who suffered an AKI event after cardiac surgery were compared to the metabolite concentrations in samples from patients who did not develop AKI. From this comparison, the p-value distributions also indicated the superior performance of BF compared to the other state-of-the-art methods. In this context, it was interesting to observe that the p-value distributions of the plateau regions showed the strongest peaks. The peaks were less pronounced for non-plateau regions. This finding was not surprising since BF predominantly built consensus plateau regions around peaks that occurred at the same position across multiple spectra. Thus, a plateau that was determined by BF also provided additional evidence of consistent signals within this region across multiple spectra. BF depends on a single parameter, i.e., the regularization parameter $\lambda $. Similarly, the equidistant binning method depends on bin size and JBA and SRV depend on correlation thresholds, which are both chosen by the user. To improve the usability of BF, the regularization parameter was defined so that it yielded stable plateau widths for different sample sizes. Consequently, the regularization parameter might not need to be recalibrated for new datasets. In practice, we observed that values around the order of ∼1 performed reasonably well for an initial bin width of $0.001$ ppm.

^{1}H NMR spectroscopic data as there are numerous data types that require segmentation steps for analyses (e.g., array CGH data). BF could be straightforwardly applied to such data and offers the additional benefit that data segmentation would be chosen consistently across all measurements, which could be particularly valuable for machine learning applications. Moreover, BF could be modified to account for 2D data, which could be a promising future direction of research, e.g., for the analysis of 2D NMR spectroscopic data.

## Supplementary Materials

^{1}H NMR spectra of the (a) urinary and (b) plasma AKI datasets using the different binning approaches, Section S4: Supplementary Figures, Figure S1: The exemplary NMR spectral regions that ranged from 4 ppm to 3.5 ppm, together with their corresponding BF fits for $\lambda =1$, Figure S2: The average plateau widths versus the number of training samples for $\lambda =1$, $\lambda =2.5$, and $\lambda =5$ using the GCKD dataset, Figure S3: The comparison of the integrals of the spectral features that were constructed by Bucket Fuser using $\lambda =1$ and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S4: The comparison of the integrals of the spectral features that were constructed by Bucket Fuser using $\lambda =2.5$ and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S5: The comparison of the integrals of the spectral features that were constructed by Bucket Fuser using $\lambda =5$ and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S6: The comparison of the integrals of the spectral features that were constructed by JBA and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S7: The comparison of the integrals of the spectral features that were constructed by SRV and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S8: The comparison of the integrals of the spectral features that were constructed by the equidistant binning method (bin size = $0.01$ ppm) and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S9: The comparison of the integrals of the spectral features that were constructed by the equidistant binning method (bin size = $0.02$ ppm) and the absolutely quantified metabolite concentrations for the 25 metabolites, Figure S10: The exemplary NMR spectral region that ranged from 3.29 to 3.25 ppm, comprising the metabolite signals of D-glucose, betaine, trimethylamine-N-oxide, and myo-inositol, Figure S11: The Spearman’s correlations between the absolutely quantified metabolite concentrations and the integrals of the corresponding spectral features that were constructed by the different binning approaches, according to the number of training samples n for the 25 absolutely quantified metabolites from the GCKD dataset, Figure S12: The Spearman’s correlations between the absolutely quantified metabolite concentrations and the integrals of the corresponding spectral features that were constructed by the different binning approaches, according to the number of training samples n for the 25 absolutely quantified metabolites from the GCKD dataset, Figure S13: The p-value distributions of AKI versus non-AKI after cardiac surgery, based on the permuted urine data, Figure S14: The p-value distributions of AKI versus non-AKI after cardiac surgery, based on permuted plasma data, Figure S15: The hyperparameter calibration for the plasma AKI dataset, Figure S16: The hyperparameter calibration for the urinary AKI dataset. Section S5: The relationship between the choice of logarithmic base and the regularization parameter $\lambda $, Section S6: The convergence analysis of the BF algorithm.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Zacharias, H.U.; Schley, G.; Hochrein, J.; Klein, M.S.; Köberle, C.; Eckardt, K.U.; Willam, C.; Oefner, P.J.; Gronwald, W. Analysis of human urine reveals metabolic changes related to the development of acute kidney injury following cardiac surgery. Metabolomics
**2013**, 9, 697–707. [Google Scholar] [CrossRef] - Zacharias, H.U.; Hochrein, J.; Vogl, F.C.; Schley, G.; Mayer, F.; Jeleazcov, C.; Eckardt, K.U.; Willam, C.; Oefner, P.J.; Gronwald, W. Identification of plasma metabolites prognostic of acute kidney injury after cardiac surgery with cardiopulmonary bypass. J. Proteome Res.
**2015**, 14, 2897–2905. [Google Scholar] [CrossRef] - Zacharias, H.U.; Altenbuchinger, M.; Schultheiss, U.T.; Samol, C.; Kotsis, F.; Poguntke, I.; Sekula, P.; Krumsiek, J.; Köttgen, A.; Spang, R.; et al. A novel metabolic signature to predict the requirement of dialysis or renal transplantation in patients with chronic kidney disease. J. Proteome Res.
**2019**, 18, 1796–1805. [Google Scholar] [CrossRef] - Gronwald, W.; Klein, M.S.; Zeltner, R.; Schulze, B.D.; Reinhold, S.W.; Deutschmann, M.; Immervoll, A.K.; Böger, C.A.; Banas, B.; Eckardt, K.U.; et al. Detection of autosomal dominant polycystic kidney disease by NMR spectroscopic fingerprinting of urine. Kidney Int.
**2011**, 79, 1244–1253. [Google Scholar] [CrossRef] - Brindle, J.T.; Antti, H.; Holmes, E.; Tranter, G.; Nicholson, J.K.; Bethell, H.W.; Clarke, S.; Schofield, P.M.; McKilligin, E.; Mosedale, D.E.; et al. Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1 H-NMR-based metabonomics. Nat. Med.
**2002**, 8, 1439–1445. [Google Scholar] [CrossRef] - Delles, C.; Rankin, N.J.; Boachie, C.; McConnachie, A.; Ford, I.; Kangas, A.; Soininen, P.; Trompet, S.; Mooijaart, S.P.; Jukema, J.W.; et al. Nuclear magnetic resonance-based metabolomics identifies phenylalanine as a novel predictor of incident heart failure hospitalisation: Results from PROSPER and FINRISK 1997. Eur. J. Heart Fail.
**2018**, 20, 663–673. [Google Scholar] [CrossRef] - Fischer, K.; Kettunen, J.; Würtz, P.; Haller, T.; Havulinna, A.S.; Kangas, A.J.; Soininen, P.; Esko, T.; Tammesoo, M.L.; Mägi, R.; et al. Biomarker profiling by nuclear magnetic resonance spectroscopy for the prediction of all-cause mortality: An observational study of 17,345 persons. PLoS Med.
**2014**, 11, e1001606. [Google Scholar] [CrossRef] - Deelen, J.; Kettunen, J.; Fischer, K.; van der Spek, A.; Trompet, S.; Kastenmüller, G.; Boyd, A.; Zierer, J.; van den Akker, E.B.; Ala-Korpela, M.; et al. A metabolic profile of all-cause mortality risk identified in an observational study of 44,168 individuals. Nat. Commun.
**2019**, 10, 3346. [Google Scholar] [CrossRef] [PubMed] - Anderson, P.E.; Reo, N.V.; DelRaso, N.J.; Doom, T.E.; Raymer, M.L. Gaussian binning: A new kernel-based method for processing NMR spectroscopic data for metabolomics. Metabolomics
**2008**, 4, 261–272. [Google Scholar] [CrossRef] - Davis, R.A.; Charlton, A.J.; Godward, J.; Jones, S.A.; Harrison, M.; Wilson, J.C. Adaptive binning: An improved binning method for metabolomics data using the undecimated wavelet transform. Chemom. Intell. Lab. Syst.
**2007**, 85, 144–154. [Google Scholar] [CrossRef] - De Meyer, T.; Sinnaeve, D.; Van Gasse, B.; Tsiporkova, E.; Rietzschel, E.R.; De Buyzere, M.L.; Gillebert, T.C.; Bekaert, S.; Martins, J.C.; Van Criekinge, W. NMR-based characterization of metabolic alterations in hypertension using an adaptive, intelligent binning algorithm. Anal. Chem.
**2008**, 80, 3783–3790. [Google Scholar] [CrossRef] [PubMed] - Anderson, P.E.; Mahle, D.A.; Doom, T.E.; Reo, N.V.; DelRaso, N.J.; Raymer, M.L. Dynamic adaptive binning: An improved quantification technique for NMR spectroscopic data. Metabolomics
**2011**, 7, 179–190. [Google Scholar] [CrossRef] - Blaise, B.J.; Shintu, L.; Elena, B.; Emsley, L.; Dumas, M.E.; Toulhoat, P. Statistical recoupling prior to significance testing in nuclear magnetic resonance based metabonomics. Anal. Chem.
**2009**, 81, 6242–6251. [Google Scholar] [CrossRef] [PubMed] - Rodriguez-Martinez, A.; Ayala, R.; Posma, J.M.; Harvey, N.; Jiménez, B.; Sonomura, K.; Sato, T.A.; Matsuda, F.; Zalloua, P.; Gauguier, D.; et al. pJRES Binning Algorithm (JBA): A new method to facilitate the recovery of metabolic information from pJRES 1H NMR spectra. Bioinformatics
**2019**, 35, 1916–1922. [Google Scholar] [CrossRef] - Bleakley, K.; Vert, J.P. The group fused lasso for multiple change-point detection. arXiv
**2011**, arXiv:1106.4199. [Google Scholar] - Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol.
**2005**, 67, 91–108. [Google Scholar] [CrossRef] - Tibshirani, R.; Wang, P. Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics
**2008**, 9, 18–29. [Google Scholar] [CrossRef] - Meier, L.; Van De Geer, S.; Bühlmann, P. The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol.
**2008**, 70, 53–71. [Google Scholar] [CrossRef] - Boyd, S.; Parikh, N.; Chu, E. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers; Now Publishers: Boston, MA, USA, 2011. [Google Scholar]
- Zacharias, H.U.; Hochrein, J.; Klein, M.S.; Samol, C.; Oefner, P.J.; Gronwald, W. Current experimental, bioinformatic and statistical methods used in nmr based metabolomics. Curr. Metabolomics
**2013**, 1, 253–268. [Google Scholar] [CrossRef] - Altenbuchinger, M.; Zacharias, H.U.; Solbrig, S.; Schäfer, A.; Büyüközkan, M.; Schultheiß, U.T.; Kotsis, F.; Köttgen, A.; Spang, R.; Oefner, P.J.; et al. A multi-source data integration approach reveals novel associations between metabolites and renal outcomes in the German Chronic Kidney Disease study. Sci. Rep.
**2019**, 9, 13954. [Google Scholar] [CrossRef] - Wallmeier, J.; Samol, C.; Ellmann, L.; Zacharias, H.U.; Vogl, F.C.; Garcia, M.; Dettmer, K.; Oefner, P.J.; Gronwald, W.; GCKD study Investigators. Quantification of Metabolites by NMR Spectroscopy in the Presence of Protein. J. Proteome Res.
**2017**, 16, 1784–1796. [Google Scholar] [CrossRef] [PubMed] - R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
- Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem.
**2006**, 78, 4281–4290. [Google Scholar] [CrossRef] [PubMed] - Zacharias, H.U.; Rehberg, T.; Mehrl, S.; Richtmann, D.; Wettig, T.; Oefner, P.J.; Spang, R.; Gronwald, W.; Altenbuchinger, M. Scale-invariant biomarker discovery in urine and plasma metabolite fingerprints. J. Proteome Res.
**2017**, 16, 3596–3605. [Google Scholar] [CrossRef] [PubMed] - Hedjazi, L.; Gauguier, D.; Zalloua, P.A.; Nicholson, J.K.; Dumas, M.E.; Cazier, J.B. mQTL. NMR: An integrated suite for genetic mapping of quantitative variations of 1H NMR-based metabolic profiles. Anal. Chem.
**2015**, 87, 4377–4384. [Google Scholar] [CrossRef] - Rodriguez-Martinez, A.; Posma, J.M.; Ayala, R.; Neves, A.L.; Anwar, M.; Petretto, E.; Emanueli, C.; Gauguier, D.; Nicholson, J.K.; Dumas, M.E. MWASTools: An R/bioconductor package for metabolome-wide association studies. Bioinformatics
**2018**, 34, 890–892. [Google Scholar] [CrossRef] - Lin, W.; Shi, P.; Feng, R.; Li, H. Variable selection in regression with compositional covariates. Biometrika
**2014**, 101, 785–797. [Google Scholar] [CrossRef] - Altenbuchinger, M.; Rehberg, T.; Zacharias, H.; Stämmler, F.; Dettmer, K.; Weber, D.; Hiergeist, A.; Gessner, A.; Holler, E.; Oefner, P.J.; et al. Reference point insensitive molecular data analysis. Bioinformatics
**2017**, 33, 219–226. [Google Scholar] [CrossRef] - Kadkhodaie, M.; Christakopoulou, K.; Sanjabi, M.; Banerjee, A. Accelerated alternating direction method of multipliers. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 497–506. [Google Scholar]
- Krishnamurthy, K. CRAFT (complete reduction to amplitude frequency table)—Robust and time-efficient Bayesian approach for quantitative mixture analysis by NMR. Magn. Reson. Chem.
**2013**, 51, 821–829. [Google Scholar] [CrossRef]

**Figure 1.**(

**a**,

**b**) Exemplary NMR spectral regions that ranged from 4 ppm to 3.5 ppm, together with their corresponding BF fits for $\lambda =2.5$ and $\lambda =5$, respectively. The thin blue and red lines show the spectral intensities of two exemplary NMR spectra from the urinary AKI dataset. The black dotted lines show the corresponding BF fits, along which the plateaus of the fits are additionally highlighted as thick blue and red lines. The ticks in the middle of the figures correspond to the standard equidistant binning with bin sizes of 0.01 ppm, 0.02 ppm, and 0.04 ppm (from top to bottom, respectively). The lower parts of the figures display the detected consensus plateaus in cyan, which start and end at the same positions for all included spectra. The yellow blocks represent the regions that did not correspond to plateaus but were also retained for subsequent analysis. For both the yellow and cyan regions, we neglected those with widths ≤0.002 ppm, which are plotted as white blocks. The inserted figures show these dotted regions in detail.

**Figure 2.**The p-value distributions of AKI versus non-AKI patients after cardiac surgery in urine specimens: (

**a**–

**g**) the different binning approaches (BF with $\lambda =1$, BF with $\lambda =2.5$, BF with $\lambda =5$, JBA, SRV, equidistant binning with a bin size of $0.01$ ppm, and equidistant binning with a bin size of $0.02$ ppm, respectively). For the BF method, the same color-coding was applied as that in Figure 1, for example. Note that the light green bars correspond to the overlapping regions of the cyan and yellow bars.

**Figure 3.**The p-value distributions of AKI versus non-AKI patients after cardiac surgery in plasma specimens: (

**a**–

**g**) the different binning approaches (BF with $\lambda =1$, BF with $\lambda =2.5$, BF with $\lambda =5$, JBA, SRV, equidistant binning with a bin size of $0.01$ ppm, and equidistant binning with a bin size of $0.02$ ppm, respectively). For the BF method, the same color-coding was applied as that in Figure 1, for example. Note that the light green bars correspond to the overlapping regions of the cyan and yellow bars.

**Figure 4.**The binomial deviances (y-axis) from the leave-one-out cross-validation using the different binning approaches (x-axis) for the plasma (

**a**) and urinary (

**b**) AKI data. The dotted horizontal lines correspond to the best method for the plasma (EB ($0.01$ ppm)) and urine (BF ($\lambda =1$)) data. The error bars correspond to $\pm 1$ standard deviation.

**Figure 5.**The binomial deviances (y-axis) from the leave-one-out cross-validation using BF with regularization parameter ($\lambda $) values between $0.25$ and $7.0$ (x-axis) for the plasma (

**a**) and urinary (

**b**) AKI data. The dotted horizontal line corresponds to the lowest observed binomial deviance across all $\lambda $ values. The error bars correspond to $\pm 1$ standard deviation.

**Table 1.**The number of metabolite features that were extracted from the 1D

^{1}H NMR spectra of 223 plasma samples from the GCKD cohort using different binning approaches. The BF results are presented in the form “number of plateau regions + number non-plateau regions = number of features”.

BF ($\mathit{\lambda}$ = 1) | BF ($\mathit{\lambda}$ = 2.5) | BF ($\mathit{\lambda}$ = 5) | SRV | JBA | EB (0.01 ppm) | EB (0.02 ppm) |
---|---|---|---|---|---|---|

360 + 261 = 621 | 398 + 234 = 632 | 507 + 301 = 808 | 531 | 538 | 749 | 375 |

**Table 2.**The Spearman’s correlations to the absolutely quantified metabolite concentrations using BF with $\lambda =1$, BF with $\lambda =2.5$, BF with $\lambda =5$, JBA, SRV, and the equidistant binning method with bin sizes of $0.01$ ppm and $0.02$ ppm. The highest correlations for each of the 25 metabolites are highlighted in bold.

BF ($\mathit{\lambda}$ = 1) | BF ($\mathit{\lambda}$ = 2.5) | BF ($\mathit{\lambda}$ = 5) | JBA | SRV | EB (0.01 ppm) | EB (0.02 ppm) | |
---|---|---|---|---|---|---|---|

3-Hydroxybutyrate | 0.768 | 0.720 | 0.689 | 0.768 | 0.600 | 0.547 | 0.498 |

Acetate | 0.757 | 0.983 | 0.968 | 0.966 | 0.908 | 0.946 | 0.892 |

Acetoacetate | 0.670 | 0.664 | 0.610 | 0.670 | 0.611 | 0.614 | 0.603 |

Acetone | 0.528 | 0.748 | 0.568 | 0.530 | 0.472 | 0.455 | 0.350 |

Alanine | 0.680 | 0.927 | 0.947 | 0.722 | 0.915 | 0.926 | 0.905 |

Asparagine | 0.685 | 0.662 | 0.635 | 0.563 | 0.698 | 0.683 | 0.659 |

Betaine | 0.509 | 0.637 | 0.480 | 0.689 | 0.481 | 0.630 | 0.221 |

Carnitine | 0.678 | 0.419 | 0.427 | 0.692 | 0.412 | 0.444 | 0.445 |

Creatine | 0.929 | 0.907 | 0.584 | 0.909 | 0.842 | 0.763 | 0.626 |

Creatinine | 0.772 | 0.893 | 0.881 | 0.443 | 0.749 | 0.703 | 0.619 |

Dimethylamine | 0.895 | 0.649 | 0.649 | 0.598 | 0.684 | 0.587 | 0.592 |

Glucose | 0.990 | 0.990 | 0.989 | 0.987 | 0.989 | 0.990 | 0.988 |

Glutamine | 0.873 | 0.870 | 0.905 | 0.612 | 0.779 | 0.897 | 0.878 |

Glycine | 0.838 | 0.808 | 0.741 | 0.332 | 0.778 | 0.655 | 0.543 |

Histidine | 0.548 | 0.764 | 0.523 | 0.453 | 0.571 | 0.558 | 0.631 |

Isobutyrate | 0.845 | 0.793 | 0.568 | 0.445 | 0.687 | 0.618 | 0.522 |

Isoleucine | 0.866 | 0.911 | 0.808 | 0.735 | 0.762 | 0.792 | 0.790 |

Lactate | 0.988 | 0.989 | 0.989 | 0.979 | 0.983 | 0.990 | 0.985 |

Phenylalanine | 0.850 | 0.874 | 0.888 | 0.819 | 0.818 | 0.810 | 0.800 |

Proline | 0.754 | 0.937 | 0.699 | 0.676 | 0.871 | 0.680 | 0.609 |

Pyruvate | 0.884 | 0.968 | 0.941 | 0.884 | 0.931 | 0.857 | 0.692 |

Threonine | 0.494 | 0.502 | 0.490 | 0.426 | 0.472 | 0.490 | 0.474 |

TMAO | 0.282 | 0.401 | 0.403 | 0.449 | 0.279 | 0.400 | 0.231 |

Tyrosine | 0.931 | 0.941 | 0.947 | 0.816 | 0.935 | 0.939 | 0.924 |

Valine | 0.811 | 0.947 | 0.961 | 0.725 | 0.924 | 0.952 | 0.952 |

**Table 3.**The three different BF approaches (with $\lambda =1$, $\lambda =2.5$, and $\lambda =5$) individually compared to JBA, SRV, and the equidistant binning method. The numbers indicate how often each method was selected as the “best performing”.

BF ($\mathit{\lambda}$ = 1) | BF ($\mathit{\lambda}$ = 2.5) | BF ($\mathit{\lambda}$ = 5) | JBA | SRV | EB (0.01 ppm) | EB (0.02 ppm) | |
---|---|---|---|---|---|---|---|

BF ($\lambda =1$) | 11 | - | - | 7 | 3 | 5 | 1 |

BF ($\lambda =2.5$) | - | 14 | - | 6 | 2 | 3 | 0 |

BF ($\lambda =5$) | - | - | 11 | 6 | 5 | 2 | 1 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Altenbuchinger, M.; Berndt, H.; Kosch, R.; Lang, I.; Dönitz, J.; Oefner, P.J.; Gronwald, W.; Zacharias, H.U.; Investigators GCKD Study.
Bucket Fuser: Statistical Signal Extraction for 1D ^{1}H NMR Metabolomic Data. *Metabolites* **2022**, *12*, 812.
https://doi.org/10.3390/metabo12090812

**AMA Style**

Altenbuchinger M, Berndt H, Kosch R, Lang I, Dönitz J, Oefner PJ, Gronwald W, Zacharias HU, Investigators GCKD Study.
Bucket Fuser: Statistical Signal Extraction for 1D ^{1}H NMR Metabolomic Data. *Metabolites*. 2022; 12(9):812.
https://doi.org/10.3390/metabo12090812

**Chicago/Turabian Style**

Altenbuchinger, Michael, Henry Berndt, Robin Kosch, Iris Lang, Jürgen Dönitz, Peter J. Oefner, Wolfram Gronwald, Helena U. Zacharias, and Investigators GCKD Study.
2022. "Bucket Fuser: Statistical Signal Extraction for 1D ^{1}H NMR Metabolomic Data" *Metabolites* 12, no. 9: 812.
https://doi.org/10.3390/metabo12090812