# A Conversation on Data Mining Strategies in LC-MS Untargeted Metabolomics: Pre-Processing and Pre-Treatment Steps

^{1}

^{2}

^{*}

## Abstract

**:**

^{TM}software (Waters Corporation, Manchester, UK). Here, two parameters were varied: the intensity threshold (50–100 counts) and the mass tolerance (0.005–0.01 Da). After the pre-processing, the datasets were imported into SIMCA (Umetrics, Umea, Sweden) for more data cleaning and statistical modeling. In addition, different scaling (unit variance, Pareto, etc.) and data transformation (log and power) methods were explored. The results showed that the pre-processing parameters (or algorithms) influence the output dataset with regard to the number of defined features. Furthermore, the study demonstrates that the pre-treatment of data prior to statistical modeling affects the subspace approximation outcome: e.g., the amount of variation in X-data that the model can explain and predict. The pre-processing and pre-treatment steps subsequently influence the number of statistically significant extracted/selected features (variables). Thus, as informed by the results, to maximize the value of untargeted metabolomic data, understanding of the data structures and exploration of different algorithms and methods (at different steps of the data analysis pipeline) might be the best trade-off, currently, and possibly an epistemological imperative.

## 1. Introduction

## 2. Results and Discussion

#### 2.1. Data Processing Parameters: Mass Tolerance and Intensity Threshold

^{TM}Application Manager for MassLynx

^{TM}software (Waters Corporation, Manchester, UK) for data processing. As described in the experimental section, the MarkerLynx

^{TM}application uses the patented ApexTrack peak detection algorithm to perform accurate peak detection and alignment. Following the peak detection, the associated ions are analyzed (the maximum intensity, the Rt and exact m/z mass) and captured for all samples. The data matrix is then generated [37,38]. The data pre-processing steps and relevant parameters’ settings are detailed in the experimental section.

^{TM}are possible. In practice, the computational time to process one combination of a set of parameters could be in hours, depending on the size of the datasets. Furthermore, understanding of the underlying algorithms and steps involved in the data processing is essential so as to decide which parameter to vary. As indicated in the experimental section, parameters, such as mass tolerance and the intensity threshold (which define the real peak versus background noise), can be changed, within certain limits: for instance mass tolerance can be set to the mass accuracy of the acquired data (which was 4.9 mDa in this study) and twice this value; hence, in this study mass tolerance was varied in these limits (0.005 and 0.01 Da). The mass tolerance (mass accuracy) parameter is the basis by which the ApexTrack algorithm determines the regions of interest in the m/z domain, whereas the intensity threshold parameter is used in the peak removal step, defining the resultant noise level and redundancy in the data matrix. These two parameters are essential, hence this study explored the impact of these on the creation of the data matrix.

^{2}, were retained. Although, visually, the sample clustering in the PCA scores space (constructed from the first two PCs) show no significant difference across the four datasets (Figure 1A,B and Figure S1A,B), the model quality was clearly affected. This can be assessed by inspecting the PCA parameters and diagnostic tools, which are computed and displayed graphically or numerically. In computing a PC model, strong and moderate outliers (observations that are extreme or do not fit the model) are often formed. Strong outliers have a high leverage on the model, shifting it significantly and reducing the predictability, whereas the moderate outliers correspond to the temporary perturbations (in the process/study), indicating a shift in the process/study behavior [53,54].

^{2}range plots. The latter is a multivariate generalization of Student’s t-test, providing a check for observation adhering to multivariate normality [53]. When used in conjunction with a scores plot, the Hotelling’s T

^{2}defines the normality area corresponding to 95% confidence in this study. Inspecting the scores and Hotelling’s T

^{2}range plots for the calculated four PC models (Figure 1A,B and Figure S2), no strong outliers were observed. The moderate outliers, on the other hand, are identified by inspecting the model residuals (X-variation that was not captured by the PC model). The detection tool for the moderate outliers is the distance to the model in X-space (DModX), with a maximum tolerable distance (Dcrit) [53].

^{2}X) and predictive power (Q

^{2}) diagnostic parameters were evaluated for the computed four PC models. The model fit informs how well the data of the training set can be mathematically reproduced indicating, quantitatively, the goodness of fit for the computed model. The R

^{2}X, thus, quantitatively describes the explained variation in the modeled X-space [25,55]. The predictive ability of the model, on the other hand, was estimated using cross-validation, providing a quantitative measure of the predicted variation in X-space. A change in data processing parameters (mass tolerance and intensity threshold) clearly affected PCA, altering the model quality. The positive change in both mass tolerance and intensity threshold parameters resulted in an increase in R

^{2}X and Q

^{2}, with a substantial difference observed in the predicted variation, Q

^{2}(Table 2). These results demonstrate that the upstream metabolomic data processing and treatment affect the outcome of the statistical analyses, which then would impact, both quantitatively and qualitatively, the mining of “what the data says” [49].

^{2}and Q

^{2}values of the true model are compared with that of the permutated model. The test is carried out by randomly assigning to the two different groups, after which the OPLS-DA models are fitted to each permutated class variable. The R

^{2}and Q

^{2}values are then computed for the permutated models and compared to the values of the true models [57,58].

^{2}and Q

^{2}values (Figure 2B and Table 2) and, thus, the computed true OPLS-DA models are statistically far better than the 50 permutated models for each dataset. Assessing the total variation in X-space (predictive and orthogonal) explained by the models, the results show that the R

^{2}X values were different: a change in mass tolerance and intensity threshold affect the amount of variation explained by the computed models (Table 2). For variable selection, the OPLS-DA loading S-plots were evaluated (Figure 2C). This loading plot has an S-shape provided the data are centered/Pareto-scaled, and aids in identifying variables which differ between groups (discriminating variables), i.e., variables situated at the upper right or lower left sections in the S-plot. The p

_{1}-axis describes the influence of each X-variable on the group separation (modeled covariation), and the p(corr)

_{1}-axis represents the reliability of each X-variable for accomplishing the group separation (modeled correlation). Variables that combine high model influence (high covariation/magnitude) with high reliability (i.e., smaller risk for spurious correlation) are statistically relevant as possible discriminating variables [25,59]: |p[1]| ≥ 0.05 and |p(corr)| ≥ 0.5 in this study.

#### 2.2. Data Scaling and Transformation Influence

^{2}and Q

^{2}[2,25,55]. The inspection of these diagnostic metrics shows that scaling and/or transformation remarkably affected the amount of explained variation (the goodness of fit) by the model and its predictive ability (Table 3).

^{2}and Q

^{2}metrics, the CV-ANOVA was used to assess the reliability of the obtained models [56] and the response permutation test (with n = 50) was used to validate the predictive capability of the computed OPLS-DA models [57,58]. Furthermore, in both Section 2.1 and Section 2.2, predictive testing was also employed to assess the best pre-processing and pre-treatment workflow (Figures S6 and S7). The results tabulated in Table 3 demonstrate that the scaling and transformation methods affected significantly not only the explained variation R

^{2}(both predictive and orthogonal) but also the classification accuracy, reliability, predictive capability of the model and, subsequently, extracted variables (Figure 5). The supervised learning models computed following for instance UV-scaling and/or log-transformation (particularly in this case), would not be chemometrically/statistically trusted as the classification of these models could be by chance, as indicated by the permutation validation tests (lower R

^{2}values compared to the permutated models, Table 3).

## 3. Materials and Methods

#### 3.1. Dataset and Raw Data Processing

^{+}= 556.2766 and [M − H]

^{−}= 554.2615, was used as the lock mass, being continuously sampled every 15 s, thus producing an average intensity of 350 counts scan

^{−1}in centroid mode. By using a lock mass spray as a reference and continuously switching between sample and reference, the MassLynx

^{TM}software can automatically correct the centroid mass values in the sample for small deviations from the exact mass measurement.

#### 3.2. Dataset Matrix Creation and Data Pre-Treatment

^{TM}4.1 software (Waters Corporation, Manchester, UK). Only the centroid electrospray ionization (ESI) positive raw data were used in this study. The MarkerLynx

^{TM}application manager of the MassLynx software was used for data pre-processing (matrix creation). Four dataset matrices (hereafter referred to as Methods) were created by changing mass tolerance and intensity threshold settings: Method 1 (mass tolerance of 0.005 Da and intensity threshold of 10 counts), Method 2 (mass tolerance of 0.005 Da and intensity threshold of 100 counts), Method 3 (mass tolerance of 0.01 Da and intensity threshold of 10 counts), and Method 4 (mass tolerance of 0.01 Da and intensity threshold of 100 counts). For all of the Methods, the parameters of the MarkerLynx

^{TM}application were set to analyze the 1–15 min retention time (Rt) range of the mass chromatogram, mass range 100–1000 Da, and alignment of peaks across samples within the range of ±0.05 Da and ±0.20 min mass and Rt windows, respectively.

^{TM}application uses the patented ApexTrack (termed also ApexPeakTrack) algorithm to perform accurate peak detection and alignment. MarkerLynx

^{TM}initially determines the regions of interest in the m/z domain based on mass accuracy (mass tolerance). The ApexTrack algorithm controls peak detection by peak width (peak width at 5% height) and baseline threshold (peak-to-peak baseline ratio) parameters. In this study, these parameters were calculated automatically by MarkerLynx

^{TM}. The ApexTrack also calculates the baseline noise level using the slope of inflection points. Thus, for peak detection, the ApexTrack algorithm consists of taking the second derivative of a chromatogram and locates the inflection points, the local minima, and peak apex for each peak, to decide the peak area and height. A “corrected” Rt is then assigned and the data are correctly aligned, with the alignment of peaks across samples within the range of user-defined mass and Rt windows. Following the peak detection, the associated ions are analyzed (the maximum intensity, its Rt and exact m/z mass) and captured for all samples.

^{TM}also performs data normalization. In this study normalization was done by using total ion intensities of each defined peak. Prior to calculating intensities, the software performs a patented modified Savitzky-Golay smoothing and integration.

^{TM}-generated data matrices were exported into SIMCA software, version 14 (Umetrics, Umea, Sweden) for statistical analyses. An unsupervised method, principal component analysis (PCA), and a supervised modeling, orthogonal projection to latent structures-discriminant analysis (OPLS-DA), were employed. The data pre-treatment methods used included scaling and transformation. These two types of data pre-treatment were explored as described in Section 2.2. The scaling methods looked at were center (Ctr), autoscaling, (also known as unit variance, UV) and Pareto, and the transformation methods used were logarithmic and power transformation. The formulae (or mathematical description of these methods) can be found in the cited literature [27] and in the SIMCA version 13 manual (User’s Guide to SIMCA 13, 2012). In this study, the logarithmic transformation was 10Log (C1 × X + C2) where C1 = 1 and C2 = 0; and the power transformation was (C1 × X + C2)

^{C3}where C1 = 1, C2 = 0, and C3 = 2. As described in the results, the computed models were validated.

## 4. Conclusions and Perspectives

## Supplementary Materials

^{2}range plots of the four PCA models (Methods 1 to 4 in Table 2), Figure S3: DModX and a typical contribution plots (of PCA models for the Method 1 data set), Figure S4: OPLS-DA scores plots, Figure S5: DModX plots for the detection of moderate outliers, Figure S6: Predicted scores plots and DModXPS, Figure S7: The Coomans’ plots—distance to model predicted (DModXPS+) of two models.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Sévin, D.C.; Kuehne, A.; Zamboni, N.; Sauer, U. Biological insights through nontargeted metabolomics. Curr. Opin. Biotechnol.
**2015**, 34, 1–8. [Google Scholar] [CrossRef] [PubMed] - Tugizimana, F.; Piater, L.A.; Dubery, I.A. Plant metabolomics: A new frontier in phytochemical analysis. S. Afr. J. Sci.
**2013**, 109, 18–20. [Google Scholar] [CrossRef] - Okazaki, Y.; Saito, K. Recent advances of metabolomics in plant biotechnology. Plant Biotechnol. Rep.
**2012**, 6, 1–15. [Google Scholar] [CrossRef] [PubMed] - Bartel, J.; Krumsiek, J.; Theis, F.J. Statistical methods for the analysis of high-throughput metabolomics data. Comput. Struct. Biotechnol. J.
**2013**, 4, 1–9. [Google Scholar] [CrossRef] [PubMed] - Worley, B.; Powers, R. Multivariate analysis in metabolomics. Curr. Metabol.
**2013**, 1, 92–107. [Google Scholar] [CrossRef] [PubMed] - Choi, Y.H.; Verpoorte, R. Metabolomics: What you see is what you extract. Phytochem. Anal.
**2014**, 25, 289–290. [Google Scholar] [CrossRef] [PubMed] - Duportet, X.; Aggio, R.B.M.; Carneiro, S.; Villas-Bôas, S.G. The biological interpretation of metabolomic data can be misled by the extraction method used. Metabolomics
**2012**, 8, 410–421. [Google Scholar] [CrossRef] - Yanes, O.; Tautenhahn, R.; Patti, G.J.; Siuzdak, G. Expanding coverage of the metabolome for global metabolite profiling. Anal. Chem.
**2011**, 83, 2152–2161. [Google Scholar] [CrossRef] [PubMed] - Sumner, L.W.; Mendes, P.; Dixon, R.A. Plant metabolomics: Large-scale phytochemistry in the functional genomics era. Phytochemistry
**2003**, 62, 817–836. [Google Scholar] [CrossRef] - Allwood, J.W.; Ellis, D.I.; Goodacre, R. Metabolomic technologies and their application to the study of plants and plant-host interactions. Physiol. Plant.
**2008**, 132, 117–135. [Google Scholar] [CrossRef] [PubMed] - Goeddel, L.C.; Patti, G.J. Maximizing the value of metabolomic data. Bioanalysis
**2012**, 4, 2199–2201. [Google Scholar] [CrossRef] [PubMed] - Boccard, J.; Rudaz, S. Harnessing the complexity of metabolomic data with chemometrics. J. Chemom.
**2014**, 28, 1–9. [Google Scholar] [CrossRef] - Beisken, S.; Eiden, M.; Salek, R.M. Getting the right answers: Understanding metabolomics challenges. Expert Rev. Mol. Diagn.
**2015**, 15, 97–109. [Google Scholar] [CrossRef] [PubMed] - Misra, B.B.; van der Hooft, J.J.J. Updates in metabolomics tools and resources: 2014–2015. Electrophoresis
**2016**, 37, 86–110. [Google Scholar] [CrossRef] [PubMed] - Kell, D.B.; Oliver, S.G. Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. BioEssays
**2004**, 26, 99–105. [Google Scholar] [CrossRef] [PubMed] - Boccard, J.; Veuthey, J.-L.; Rudaz, S. Knowledge discovery in metabolomics: An overview of MS data handling. J. Sep. Sci.
**2010**, 33, 290–304. [Google Scholar] [CrossRef] [PubMed] - Goodacre, R.; Vaidyanathan, S.; Dunn, W.B.; Harrigan, G.G.; Kell, D.B. Metabolomics by numbers: Acquiring and understanding global metabolite data. Trends Biotechnol.
**2004**, 22, 245–252. [Google Scholar] [CrossRef] [PubMed] - Cicek, A.E.; Roeder, K.; Ozsoyoglu, G. MIRA: Mutual information-based reporter algorithm for metabolic networks. Bioinformatics
**2014**, 30, i175–i184. [Google Scholar] [CrossRef] [PubMed] - Toubiana, D.; Fernie, A.R.; Nikoloski, Z.; Fait, A. Network analysis: Tackling complex data to study plant metabolism. Trends Biotechnol.
**2013**, 31, 29–36. [Google Scholar] [CrossRef] [PubMed] - Brown, M.; Dunn, W.B.; Ellis, D.I.; Goodacre, R.; Handl, J.; Knowles, J.D.; O’Hagan, S.; Spasić, I.; Kell, D.B. A metabolome pipeline: From concept to data to knowledge. Metabolomics
**2005**, 1, 39–51. [Google Scholar] [CrossRef] - Sumner, L.W.; Amberg, A.; Barrett, D.; Beale, M.H.; Beger, R.; Daykin, C.A.; Fan, T.W.-M.; Fiehn, O.; Goodacre, R.; Griffin, J.L.; et al. Proposed minimum reporting standards for chemical analysis. Metabolomics
**2007**, 3, 211–221. [Google Scholar] [CrossRef] [PubMed] - Gromski, P.S.; Xu, Y.; Hollywood, K.A.; Turner, M.L.; Goodacre, R. The influence of scaling metabolomics data on model classification accuracy. Metabolomics
**2015**, 11, 684–695. [Google Scholar] [CrossRef] - Yang, J.; Zhao, X.; Lu, X.; Lin, X.; Xu, G. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis. Front. Mol. Biosci.
**2015**, 2, 1–10. [Google Scholar] [CrossRef] [PubMed] - Boccard, J.; Rudaz, S. Mass spectrometry metabolomic data handling for biomarker discovery. In Proteomic and Metabolomic Approaches to Biomarker Discovery; Elsevier: Amsterdam, The Netherlands, 2013; pp. 425–445. [Google Scholar]
- Trygg, J.; Holmes, E.; Lundstedt, T. Chemometrics in Metabonomics. J. Proteome Res.
**2007**, 6, 469–479. [Google Scholar] [CrossRef] [PubMed] - De Livera, A.M.; Sysi-Aho, M.; Jacob, L.; Gagnon-Bartsch, J.A.; Castillo, S.; Simpson, J.A.; Speed, T.P. Statistical methods for handling unwanted variation in metabolomics data. Anal. Chem.
**2015**, 87, 3606–3615. [Google Scholar] [CrossRef] [PubMed] - Van den Berg, R.A.; Hoefsloot, H.C.J.; Westerhuis, J.A.; Smilde, A.K.; Werf, M.J. Van Der Centering, scaling, and transformations: Improving the biological information content of metabolomics data. BMC Genom.
**2006**, 7, 1–15. [Google Scholar] [CrossRef] [PubMed] - Goodacre, R.; Broadhurst, D.; Smilde, A.K.; Kristal, B.S.; Baker, J.D.; Beger, R.; Bessant, C.; Connor, S.; Capuani, G.; Craig, A.; et al. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics
**2007**, 3, 231–241. [Google Scholar] [CrossRef] - Saccenti, E.; Hoefsloot, H.C.J.; Smilde, A.K.; Westerhuis, J.A.; Hendriks, M.M.W.B. Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics
**2013**, 10, 361–374. [Google Scholar] [CrossRef] - Buydens, L. Towards tsunami-resistant chemometrics. Anal. Sci.
**2013**, 813, 24–29. [Google Scholar] - Di Guida, R.; Engel, J.; Allwood, J.W.; Weber, R.J.M.; Jones, M.R.; Sommer, U.; Viant, M.R.; Dunn, W.B. Non-targeted UHPLC-MS metabolomic data processing methods: A comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics
**2016**, 12, 93. [Google Scholar] [CrossRef] [PubMed] - Godzien, J.; Ciborowski, M.; Angulo, S.; Barbas, C. From numbers to a biological sense: How the strategy chosen for metabolomics data treatment may affect final results. A practical example based on urine fingerprints obtained by LC-MS. Electrophoresis
**2013**, 34, 2812–2826. [Google Scholar] [CrossRef] [PubMed] - Defernez, M.; Gall, G. Le strategies for data handling and statistical analysis in metabolomics studies. In Advances in Botanical Research; Elsevier Ltd.: Amsterdam, The Netherlands, 2013; Volume 67, pp. 493–555. [Google Scholar]
- Moseley, H.N.B. Error analysis and propagation in metabolomics data analysis. Comput. Struct. Biotechnol. J.
**2013**, 4, 1–12. [Google Scholar] [CrossRef] [PubMed] - Trutschel, D.; Schmidt, S.; Grosse, I.; Neumann, S. Experiment design beyond gut feeling: Statistical tests and power to detect differential metabolites in mass spectrometry data. Metabolomics
**2015**, 11, 851–860. [Google Scholar] [CrossRef] - Moco, S.; Vervoort, J.; Bino, R.; Devos, R. Metabolomics technologies and metabolite identification. TrAC Trends Anal. Chem.
**2007**, 26, 855–866. [Google Scholar] [CrossRef] - Idborg, H.; Zamani, L.; Edlund, P.-O.; Schuppe-Koistinen, I.; Jacobsson, S.P. Metabolic fingerprinting of rat urine by LC/MS Part 2. Data pretreatment methods for handling of complex data. J. Chromatogr. B
**2005**, 828, 14–20. [Google Scholar] [CrossRef] [PubMed] - Stumpf, C.L.; Goshawk, J. The MarkerLynx application manager: Informatics for mass spectrometric metabonomic discovery. Waters Appl. Note
**2004**. 720001056EN KJ-PDF. [Google Scholar] - Veselkov, K.A.; Vingara, L.K.; Masson, P.; Robinette, S.L.; Want, E.; Li, J.V.; Barton, R.H.; Boursier-Neyret, C.; Walther, B.; Ebbels, T.M.; et al. Optimized preprocessing of ultra-performance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. Anal. Chem.
**2011**, 83, 5864–5872. [Google Scholar] [CrossRef] [PubMed] - Cook, D.W.; Rutan, S.C. Chemometrics for the analysis of chromatographic data in metabolomics investigations. J. Chemom.
**2014**, 28, 681–687. [Google Scholar] [CrossRef] - Peters, S.; Van Velzen, E.; Janssen, H.G. Parameter selection for peak alignment in chromatographic sample profiling: Objective quality indicators and use of control samples. Anal. Bioanal. Chem.
**2009**, 394, 1273–1281. [Google Scholar] [CrossRef] [PubMed] - Godzien, J.; Alonso-Herranz, V.; Barbas, C.; Armitage, E.G. Controlling the quality of metabolomics data: New strategies to get the best out of the QC sample. Metabolomics
**2014**, 11, 518–528. [Google Scholar] [CrossRef] - Misra, B.B.; Assmann, S.M.; Chen, S. Plant single-cell and single-cell-type metabolomics. Trends Plant Sci.
**2014**, 19, 1–10. [Google Scholar] [CrossRef] [PubMed] - Kohli, A.; Sreenivasulu, N.; Lakshmanan, P.; Kumar, P.P. The phytohormone crosstalk paradigm takes center stage in understanding how plants respond to abiotic stresses. Plant Cell Rep.
**2013**, 32, 945–57. [Google Scholar] [CrossRef] [PubMed] - Vidal, M. A unifying view of 21st century systems biology. FEBS Lett.
**2009**, 583, 3891–3894. [Google Scholar] [CrossRef] [PubMed] - Makola, M.M.; Steenkamp, P.A.; Dubery, I.A.; Kabanda, M.M.; Madala, N.E. Preferential alkali metal adduct formation by cis geometrical isomers of dicaffeoylquinic acids allows for efficient discrimination from their trans isomers during ultra-high-performance liquid chromatography/quadrupole time-of-flight mass s. Rapid Commun. Mass Spectrom.
**2016**, 30, 1011–1018. [Google Scholar] [CrossRef] [PubMed] - Masson, P.; Spagou, K.; Nicholson, J.K.; Want, E.J. Technical and biological variation in UPLC-MS-based untargeted metabolic profiling of liver extracts: Application in an experimental toxicity study on galactosamine. Anal. Chem.
**2011**, 83, 1116–1123. [Google Scholar] [CrossRef] [PubMed] - Hawkins, D.M. The Problem of overfitting. J. Chem. Inf. Comput. Sci.
**2004**, 44, 1–12. [Google Scholar] [CrossRef] [PubMed] - Broadhurst, D.I.; Kell, D.B. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics
**2006**, 2, 171–196. [Google Scholar] [CrossRef] - Armitage, E.G.; Godzien, J.; Alonso-Herranz, V.; López-Gonzálvez, Á.; Barbas, C. Missing value imputation strategies for metabolomics data. Electrophoresis
**2015**, 36, 3050–3060. [Google Scholar] [CrossRef] [PubMed] - Ilin, A.; Raiko, T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res.
**2010**, 11, 1957–2000. [Google Scholar] - Nelson, P.R.C.; Taylor, P.A.; MacGregor, J.F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemom. Intell. Lab. Syst.
**1996**, 35, 45–65. [Google Scholar] [CrossRef] - Wikström, C.; Albano, C.; Eriksson, L.; Fridén, H.; Johansson, E.; Nordahl, Å.; Rännar, S.; Sandberg, M.; Kettaneh-Wold, N.; Wold, S. Multivariate process and quality monitoring applied to an electrolysis process. Part I. Process supervision with multivariate control charts. Chemom. Intell. Lab. Syst.
**1998**, 42, 221–231. [Google Scholar] [CrossRef] - Eriksson, L.; Trygg, J.; Wold, S. A chemometrics toolbox based on projections and latent variables. J. Chemom.
**2014**, 28, 332–346. [Google Scholar] [CrossRef] - Hawkins, D.M.; Basak, S.C.; Mills, D. Assessing model fit by cross-validation. J. Chem. Inf. Comput. Sci.
**2003**, 43, 579–586. [Google Scholar] [CrossRef] [PubMed] - Eriksson, L.; Trygg, J.; Wold, S. CV-ANOVA for significance testing of PLS and OPLS
^{®}models. J. Chemom.**2008**, 22, 594–600. [Google Scholar] [CrossRef] - Triba, M.N.; Le Moyec, L.; Amathieu, R.; Goossens, C.; Bouchemal, N.; Nahon, P.; Rutledge, D.N.; Savarin, P. PLS/OPLS models in metabolomics: The impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol. BioSyst.
**2015**, 11, 13–19. [Google Scholar] [CrossRef] [PubMed] - Westerhuis, J.A.; Hoefsloot, H.C.J.; Smit, S.; Vis, D.J.; Smilde, A.K.; Velzen, E.J.J.; Duijnhoven, J.P.M.; Dorsten, F.A. Assessment of PLSDA cross validation. Metabolomics
**2008**, 4, 81–89. [Google Scholar] [CrossRef] - Wiklund, S.; Johansson, E.; Sjöström, L.; Mellerowicz, E.J.; Edlund, U.; Shockcor, J.P.; Gottfries, J.; Moritz, T.; Trygg, J. Visualization of GC/TOF-MS-based metabolomics data for identification of biochemically interesting compounds using OPLS class models. Anal. Chem.
**2008**, 80, 115–122. [Google Scholar] [CrossRef] [PubMed] - Ambroise, C.; McLachlan, G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA
**2002**, 99, 6562–6566. [Google Scholar] [CrossRef] [PubMed] - Smilde, A.K.; Westerhuis, J.A.; Hoefsloot, H.C.J.; Bijlsma, S.; Rubingh, C.M.; Vis, D.J.; Jellema, R.H.; Pijl, H.; Roelfsema, F.; van der Greef, J. Dynamic metabolomic data analysis: A tutorial review. Metabolomics
**2010**, 6, 3–17. [Google Scholar] [CrossRef] [PubMed] - Chong, I.-G.; Jun, C.-H. Performance of some variable selection methods when multicollinearity is present. Chemom. Intell. Lab. Syst.
**2005**, 78, 103–112. [Google Scholar] [CrossRef] - Mehmood, T.; Liland, K.H.; Snipen, L.; Sæbø, S. A review of variable selection methods in Partial Least Squares Regression. Chemom. Intell. Lab. Syst.
**2012**, 118, 62–69. [Google Scholar] [CrossRef] - Wilkinson, L. Dot plots. Am. Stat.
**1999**, 53, 276–281. [Google Scholar] - Bro, R.; Smilde, A.K. Centering and scaling in component analysis. J. Chemom.
**2003**, 17, 16–33. [Google Scholar] [CrossRef] - Van Der Greef, J.; Smilde, A.K. Symbiosis of chemometrics and metabolomics: Past, present, and future. J. Chemom.
**2005**, 19, 376–386. [Google Scholar] [CrossRef] - Breiman, L. Statistical modeling: The two cultures. Stat. Sci.
**2001**, 16, 199–215. [Google Scholar] [CrossRef] - T’Kindt, R.; Morreel, K.; Deforce, D.; Boerjan, W.; Bocxlaer, J. Van Joint GC-MS and LC-MS platforms for comprehensive plant metabolomics: Repeatability and sample pre-treatment. J. Chromatogr. B
**2009**, 877, 3572–3580. [Google Scholar] [CrossRef] [PubMed] - Tugizimana, F.; Steenkamp, P.A.; Piater, L.A.; Dubery, I.A. Multi-platform metabolomic analyses of ergosterol-induced dynamic changes in nicotiana tabacum cells. PLoS ONE
**2014**, 9, e87846. [Google Scholar] [CrossRef] [PubMed] - Sangster, T.; Major, H.; Plumb, R.; Wilson, A.J.; Wilson, I.D. A pragmatic and readily implemented quality control strategy for HPLC-MS and GC-MS-based metabonomic analysis. Analyst
**2006**, 131, 1075–1078. [Google Scholar] [CrossRef] [PubMed] - Sangster, T.P.; Wingate, J.E.; Burton, L.; Teichert, F.; Wilson, I.D. Investigation of analytical variation in metabonomic analysis using liquid chromatography/mass spectrometry. Rapid Commun. Mass Spectrom.
**2007**, 21, 2965–2970. [Google Scholar] [CrossRef] [PubMed] - Dunn, W.B.; Broadhurst, D.; Begley, P.; Zelena, E.; Francis-McIntyre, S.; Anderson, N.; Brown, M.; Knowles, J.D.; Halsall, A.; Haselden, J.N.; et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat. Protoc.
**2011**, 6, 1060–1083. [Google Scholar] [CrossRef] [PubMed] - Jenkins, H.; Hardy, N.; Beckmann, M.; Draper, J.; Smith, A.R.; Taylor, J.; Fiehn, O.; Goodacre, R.; Bino, R.J.; Hall, R.; et al. A proposed framework for the description of plant metabolomics experiments and their results. Nat. Biotechnol.
**2004**, 22, 1601–1606. [Google Scholar] [CrossRef] [PubMed] - Fiehn, O.; Sumner, L.W.; Rhee, S.Y.; Ward, J.; Dickerson, J.; Lange, B.M.; Lane, G.; Roessner, U.; Last, R.; Nikolau, B. Minimum reporting standards for plant biology context information in metabolomic studies. Metabolomics
**2007**, 3, 195–201. [Google Scholar] [CrossRef] - Salek, R.M.; Haug, K.; Conesa, P.; Hastings, J.; Williams, M.; Mahendraker, T.; Maguire, E.; Gonzalez-Beltran, A.N.; Rocca-Serra, P.; Sansone, S.-A.; et al. The MetaboLights repository: Curation challenges in metabolomics. Database
**2013**, 2013, bat029. [Google Scholar] [CrossRef] [PubMed] - Haug, K.; Salek, R.M.; Conesa, P.; Hastings, J.; de Matos, P.; Rijnbeek, M.; Mahendraker, T.; Williams, M.; Neumann, S.; Rocca-Serra, P.; et al. MetaboLights--an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res.
**2013**, 41, D781–D786. [Google Scholar] [CrossRef] [PubMed] - Rocca-Serra, P.; Salek, R.M.; Arita, M.; Correa, E.; Dayalan, S.; Gonzalez-Beltran, A.; Ebbels, T.; Goodacre, R.; Hastings, J.; Haug, K.; et al. Data standards can boost metabolomics research, and if there is a will, there is a way. Metabolomics
**2016**, 12, 14. [Google Scholar] [CrossRef] [PubMed] - Zhang, J.; Gonzalez, E.; Hestilow, T.; Haskins, W.; Huang, Y. Review of peak detection algorithms in liquid-chromatography-mass spectrometry. Curr. Genom.
**2009**, 10, 388–401. [Google Scholar] [CrossRef] [PubMed] - Rafiei, A.; Sleno, L. Comparison of peak-picking workflows for untargeted liquid chromatography/high-resolution mass spectrometry metabolomics data analysis. Rapid Commun. Mass Spectrom.
**2015**, 29, 119–127. [Google Scholar] [CrossRef] [PubMed] - Coble, J.B.; Fraga, C.G. Comparative evaluation of preprocessing freeware on chromatography/mass spectrometry data for signature discovery. J. Chromatogr. A
**2014**, 1358, 155–164. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**PCA score scatterplots and distance to the model (DModX) plot. (

**A**) Score scatterplot of the PCA model of data X (processed with Method 1: Table 1): a five-component model, explaining 78.6% variation in the Pareto-scaled data and the amount of predicted variation by the model, according to cross-validation, is 74.6%; (

**B**) Score scatterplot of the PCA model of data X (processed with Method 4: Table 1): a 6-component model, explaining 93.4% variation in the Pareto-scaled data X and the amount of predicted variation by the model, according to cross-validation, is 91.7%; (

**C**) The DModX plot of the PCA model in (

**A**) showing the moderate outliers (in red); and (

**D**) The DModX plot of the PCA model in (

**B**) showing the moderate outliers (in red).

**Figure 2.**OPLS-DA model for data X (processed with Method 4, Table 1). The labels C and T refer to control (green) and treated (blue), respectively. (

**A**) A score plot showing group separation in an OPLS-DA score space; (

**B**) the response permutation test plot (n = 50) for the OPLS-DA model in (

**A**): the R

^{2}and Q

^{2}values of the permutated model are represented on the left-hand side of the plot, corresponding to y-axis intercepts (Table 2): R

^{2}= (0.0, 0.271) and Q

^{2}= (0.0, −0.340); (

**C**) an OPLS-DA loading S-plot for the “Method 4” model. The x-axis is the modelled covariation and the y-axis is the loading vector of the predictive component (modeled correlation). Variables situated far out in the S-plot are statistically relevant and represent possible discriminating variables; and (

**D**) the dot plot of the selected (marked) variable in S-plot (

**C**), showing that such a variable is a very strong discriminating variable, as it has no overlap between groups.

**Figure 3.**Venn diagram displaying (comparatively) the statistically-selected discriminating variables from the four OPLS-DA models (of the four different pre-processing methods, Table 1 and Table 2). The four pre-processing methods, applied on the same raw data, generated four different data matrices; and the statistical analyses of the four matrices led to different discriminating variables (with some overlap), as graphically depicted in the diagram.

**Figure 4.**PCA score scatterplots. PCA models of the same data X, but with different scaling methods. (

**A**) A five-component model, explaining 78.6% variation in the Pareto-scaled data, X, and the amount of predicted variation by the model, according to cross-validation, is 74.6%; and (

**B**) A five-component model, explaining 44.3% variation in the unit variance (UV)-scaled data, X, and the amount of predicted variation by the model, according to cross-validation, is 35.0%.

**Figure 5.**Venn diagram displaying (comparatively) the statistically-selected discriminating variables from the four OPLS-DA models that are statistically valid (Table 3). As indicated in the diagram, there are unique and shared discriminating variables from the four models i.e., different data pre-treatment (scaling and transformation) methods led to different discriminating variables.

**Figure 6.**Flowchart displaying an overview of a typical LC-MS data mining pipeline. Different post-acquisition steps involved in data analysis: data pre-processing and pre-treatment (focus of this study) and machine learning/multivariate data analysis (MVDA). Each step consists of a typical workflow to follow and there are different methods and algorithms that can be employed.

**Table 1.**Parameters associated with the different datasets generated from MarkerLynx

^{TM}processing (Section 3.2).

Data Set | Mass Tolerance (Da) | Intensity Threshold (counts) | X-Variable | Noise Level (%) |
---|---|---|---|---|

Method 1 | 0.005 | 10 | 6989 | 24 |

Method 2 | 0.005 | 100 | 720 | 9 |

Method 3 | 0.01 | 10 | 7309 | 23 |

Method 4 | 0.01 | 100 | 765 | 8 |

**Table 2.**Generated PCA and OPLS-DA models of the four dataset matrices described as Methods 1–4 (Section 3.2).

Data Set | Model Quality and Description | ||||||||
---|---|---|---|---|---|---|---|---|---|

PCA | OPLS-DA | ||||||||

#PC | R^{2}X (cum) | Q^{2} (cum) | R^{2}X (cum) | R^{2}Y (cum) | Q^{2} (cum) | CV-ANOVA p-Value | Permutation (n = 50) | ||

R^{2} | Q^{2} | ||||||||

Method 1 | 5 | 0.786 | 0.746 | 0.740 | 0.997 | 0.995 | 0.000 | (0.0, 0.573) | (0.0, −0.330) |

Method 2 | 5 | 0.926 | 0.902 | 0.857 | 0.988 | 0.987 | 0.000 | (0.0, 0.0552) | (0.0, −0.212) |

Method 3 | 6 | 0.793 | 0.744 | 0.689 | 0.989 | 0.986 | 0.000 | (0.0, 0.304) | (0.0, −0.358) |

Method 4 | 6 | 0.934 | 0.917 | 0.894 | 0.997 | 0.997 | 0.000 | (0.0, 0.271) | (0.0, −0.340) |

**Table 3.**Statistics of computed PCA and OPLS-DA models illustrating the effect of scaling and transformation on the dataset matrix for Method 1.

Data Treatment | Model Quality and Description | ||||||||
---|---|---|---|---|---|---|---|---|---|

PCA | OPLS-DA | ||||||||

Scaling | Trans-Formation | R^{2}X (cum) | Q^{2} (cum) | R^{2}X (cum) | R^{2}Y (cum) | Q^{2} (cum) | CV-ANOVA p-Value | Permutation (n = 50) | |

R^{2} | Q^{2} | ||||||||

None | None | 0.995 | 0.981 | 0.981 | 0.852 | 0.849 | 5.34 × 10^{−23} | (0.0, 0.128) | (0.0, −0.213) |

Center | None | 0.959 | 0.923 | 0.923 | 0.991 | 0.988 | 0.000 | (0.0, 0.161) | (0.0, −0.329) |

UV | None | 0.443 | 0.350 | 0.337 | 0.992 | 0.986 | 0.000 | (0.0, 0.650) | (0.0, −0.294) |

Pareto | None | 0.786 | 0.746 | 0.740 | 0.997 | 0.995 | 0.000 | (0.0, 0.573) | (0.0, −0.330) |

UV | Log | 0.641 | 0.517 | 0.548 | 0.998 | 0.996 | 0.000 | (0.0, 0.665) | (0.0, −0.222) |

Pareto | Log | 0.667 | 0.517 | 0.548 | 0.998 | 0.996 | 0.000 | (0.0, 0.633) | (0.0, −0.184) |

UV | Power | 0.435 | 0.336 | 0.307 | 0.994 | 0.988 | 0.000 | (0.0, 0.649) | (0.0, −0.311) |

Pareto | Power | 0.948 | 0.900 | 0.922 | 0.993 | 0.990 | 0.000 | (0.0, 0.267) | (0.0, −0.480) |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tugizimana, F.; Steenkamp, P.A.; Piater, L.A.; Dubery, I.A.
A Conversation on Data Mining Strategies in LC-MS Untargeted Metabolomics: Pre-Processing and Pre-Treatment Steps. *Metabolites* **2016**, *6*, 40.
https://doi.org/10.3390/metabo6040040

**AMA Style**

Tugizimana F, Steenkamp PA, Piater LA, Dubery IA.
A Conversation on Data Mining Strategies in LC-MS Untargeted Metabolomics: Pre-Processing and Pre-Treatment Steps. *Metabolites*. 2016; 6(4):40.
https://doi.org/10.3390/metabo6040040

**Chicago/Turabian Style**

Tugizimana, Fidele, Paul A. Steenkamp, Lizelle A. Piater, and Ian A. Dubery.
2016. "A Conversation on Data Mining Strategies in LC-MS Untargeted Metabolomics: Pre-Processing and Pre-Treatment Steps" *Metabolites* 6, no. 4: 40.
https://doi.org/10.3390/metabo6040040