Pure Ion Chromatograms Combined with Advanced Machine Learning Methods Improve Accuracy of Discriminant Models in LC–MS-Based Untargeted Metabolomics

Untargeted metabolomics based on liquid chromatography coupled with mass spectrometry (LC–MS) can detect thousands of features in samples and produce highly complex datasets. The accurate extraction of meaningful features and the building of discriminant models are two crucial steps in the data analysis pipeline of untargeted metabolomics. In this study, pure ion chromatograms were extracted from a liquor dataset and left-sided colon cancer (LCC) dataset by K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0 (KPIC2). Then, the nonlinear low-dimensional embedding by uniform manifold approximation and projection (UMAP) showed the separation of samples from different groups in reduced dimensions. The discriminant models were established by extreme gradient boosting (XGBoost) based on the features extracted by KPIC2. Results showed that features extracted by KPIC2 achieved 100% classification accuracy on the test sets of the liquor dataset and the LCC dataset, which demonstrated the rationality of the XGBoost model based on KPIC2 compared with the results of XCMS (92% and 96% for liquor and LCC datasets respectively). Finally, XGBoost can achieve better performance than the linear method and traditional nonlinear modeling methods on these datasets. UMAP and XGBoost are integrated into KPIC2 package to extend its performance in complex situations, which are not only able to effectively process nonlinear dataset but also can greatly improve the accuracy of data analysis in non-target metabolomics.


Introduction
Metabolomics aims at the unbiased and comprehensive quantification of metabolites in organisms, tissues, or cells [1,2]. For untargeted metabolomics, its goal is the simultaneous detection of as many metabolites as possible in samples and the discovery of metabolomic changes between groups [3]. At present, the combination of chromatography and mass spectrometry has become the key technology for the analysis of metabolites in biological systems [4,5]. Compaed with gas chromatography coupled to mass spectrometry (GC-MS) [6][7][8], high-performance liquid chromatography-mass spectrometry (LC-MS) can analyze compounds with semi-polar and lower volatility in a wider mass range without derivatization [9][10][11]. The samples are separated by the chromatographic column after injection and identified by analyzing the spectra acquired by the mass spectrometer. Since each eluted metabolite produces multiple mass signals, such as fragments, adducts and isotope peaks, LC-MS data contain thousands of metabolomic features for complex samples. Therefore, the pre-processing methods are needed to extract meaningful features for further statistical analysis.
The t-distributed stochastic neighbor embedding (t-SNE) is a non-linear dimensionality reduction method that converts the similarity of data points into joint probabilities and minimizing the Kullback-Leibler divergence between low-dimensional data and highdimensional data [33]. However, t-SNE suffers from limitations such as loss of global data structure, slow calculation speed and inability to meaningfully represent big datasets [34]. The unified manifold approximation and projection (UMAP) has been constructed from a theoretical framework based on Riemannian geometry and algebraic topology [35,36]. It can achieve a better balance between the local data structures and global data structures than t-SNE and PCA.
The discriminant models should be built to learn the decision rules, predict new samples and screen the biomarkers. Therefore, advanced machine learning methods and variable selection methods are required. Partial least squares discriminant analysis (PLS-DA) [37], orthogonal partial least squares discriminant analysis (OPLS-DA) [38] and Random Forest (RF) [39] were implemented in KPIC2 for pattern recognition and biomarker selection. PLS-DA is widely used in chemometrics and metabolomics, and it is based on the group membership encoding and the partial least squares regression [40]. OPLS-DA [41] divides the variables into predictive and orthogonal information with orthogonal signal correction (OSC) technology [42]. Compared with PLS-DA, it provides better visualization and interpretation and has been widely used in modeling and biomarker discovery for metabolomics [43,44]. However, PLS-DA and OPLS-DA are suitable for linear and collinearity datasets and may fail on the nonlinear datasets. RF is an ensemble machine learning method based on classification and regression tree (CART), and it can be used to classify high-dimensional and nonlinear datasets well [39]. In 2014, extreme gradient boosting (XGBoost) was released by Chen et al., which can achieve the goal of fast calculation and excellent performance [45,46]. It builds multiple weak learners on the bootstrapped data, trains the subsequent models, and aggregates the models to reduce both the variance and bias of the prediction.
In this study, advanced machine learning methods, UMAP and XGBoost, are integrated into KPIC2 to improve its performances on complex LC-MS-based metabolomic datasets. The KPIC2 framework was applied to perform extraction of PICs, peak detection, peak alignment, grouping and missing value filling for the liquor and LCC datasets. Peak tables containing retention time, m/z and intensity were obtained. The features were visualized by PCA, t-SNE and UMAP respectively, and XGBoost was used to build the discriminant models. Since XCMS is still widely used, it is necessary to study the performance of the XGBoost model based on KPIC2 compared with XCMS. Therefore, the proposed pipeline comprising KPIC2 and XGBoost was compared with XCMS by evaluating the accuracy of the models. In order to evaluate the advantages of the XGBoost model, the XGBoost model based on the features extracted by KPIC2 was compared with PLS-DA, support vector machine (SVM) and Random Forest (RF). The schematic diagram of this proposed data analysis pipeline is depicted in Scheme 1. Scheme 1. Schematic diagram of the proposed data analysis pipeline for complex liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics. It can be divided into five parts: the extraction of metabolites from samples, LC-MS analysis, data preprocessing, visualization and statistical analysis. In the study, K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0 (KPIC2) is used to extract pure ion chromatograms. The samples from different groups are visualized through uniform manifold approximation and projection (UMAP). Extreme gradient boosting (XGBoost) is used to build discriminant models and screen differential metabolites.

Theory of KPIC2
The details of the theoretical parts of KPIC2 have been described in the reference [29]. KPIC2 is an integrated framework and has been developed for metabolomics studies. It can extract pure ions by the optimal k-means clustering, detect pure ions accurately, align PICs across samples, group PICs to identify isotope and potential adduct PICs, fill missing peaks and perform multivariate pattern recognition. KPIC2 is an effective analytical framework for metabolomics datasets that integrates the concept of pure ion chromatograms, which can improve the accuracy of quantification, classification and biomarker identification. In addition, KPIC2 is implemented in R programming language and can be used as an open-source software package.

Visualization Methods
UMAP is a novel manifold learning technique for dimensionality reduction, which can be used for visualization and non-linear dimensionality reduction [35]. The algorithm is based on three assumptions about the data: (1) the data are uniformly distributed on the Riemannian manifold; (2) the Riemannian metric is locally constant; (3) the manifold is locally connected. Based on the above assumptions, UMAP can model manifolds with fuzzy topology. The embedding is found by searching for the low-dimensional projection that has the closest equivalent fuzzy topology data. Compared with t-SNE and PCA, its dimensionality reduction calculation speed is fast and can achieve a better balance between the local and global data structures. This allows us to generate high-quality embedding of larger datasets in two or three dimensions for visualization.

XGBoost
XGBoost is an optimized and highly efficient gradient boosting decision tree (GBDT) implementation and can solve problems beyond billions of samples [45]. It performs the second-order Taylor expansion of the loss function and adds a regular term, which effectively avoids overfitting and speeds up the convergence. XGBoost continuously creates new decision trees and fits the residuals of previous predictions to improve the accuracy. It can be expressed in a form of addition in Equation (1): Among them, ∧ y i is the predicted value of the model. K is the number of trees. f k represents the k-th sub-model. x i is the i-th sample in data; F represents the set of all trees.
First, a number of training samples consisting of input vector x and output variable y are randomly given T = (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x N , y N ). The XGBoost model is trained by optimizing the objective function. The objective function of XGBoost consists of a loss function and a regular term: obj is the objective function of the t-th iteration.
is the predicted value of the t-th iteration. Ω( f k ) represents the regular term of model of the t-th iteration. The γ and λ are the parameters of the regular term to control the complexity of the decision trees. T is the number of leaf nodes in the decision tree. The regular term Ω( f k ) can simplify models and prevent over-fitting.
Taylor expansion was applied to this objective function. l(y i , ∧ y i (t−1) ) is the loss function of the i-th samples with ∧ y i (t−1) as the independent variable, and g i and h i are their first and second derivative, respectively. This makes the value of the objective function only depend on g i and h i . The minimum of the objective function can be found by the derivation. Then, important variable features are selected by finding the optimal split point, and the scoring formula used for node splitting in the tree model can be derived: Here, I is a subset of the available observations in the current node and I L , I R are subsets of the available observations in the left and right nodes after the split. Finally, the XGBoost boosted tree is obtained through successive loop iterations.
There are three reasons XGBoost has better performance and efficiency than GBDT: (1) The regular term is introduced in the XGBoost model to control the complexity of the model, which makes the trained model simpler and prevents overfitting. (2) Before the XGBoost model is trained, the data are sorted in advance and then saved as a block structure. Therefore, the gain calculation of each feature can be performed in multiple threads. (3) Column-sampling works similarly to random forests, which not only reduces overfitting but also reduces calculations.

Datasets 2.3.1. Liquor Dataset
This dataset is publicly available, and it was described in reference [47]. Six types of liquor samples (BJ-1, BJ-2, BJ-3, BJ-4, BJ-5, BJ-6) were purchased from the local market, and 6 repeats were sampled for each type. Each type of liquor was brewed through a special winemaking process. The specific experimental procedure is as follows: 500 µL of liquor sample was added to the sample tube, and the sample was diluted to 1 mL using water/methanol (1/1, v/v) solution. The samples were then vortexed for 5 min and analyzed by Agilent 1290-6545 UPLC-QTOF (Agilent Technologies, Santa Clara, CA, USA).

LCC Dataset
This is also publicly available on the MetaboLights repository with the identifier MT-BLS1129. Samples of the left-sided colon cancer (LCC) dataset were obtained from surgical colectomy specimens, and they were selected from male patients who were ≥55 years old. All normal colon tissues were selected from stage I-IV colorectal cancer (CRC) patients (n = 27), and tumor tissue samples were selected from LCC stage I-III (n = 54). The detailed experimental protocol of this dataset was described in reference [48]. A UPLC system (H-Class ACQUITY, Waters Corporation, MA, USA) coupled to a quadrupole time-of-flight (QTOF) mass spectrometer (Xevo G2-XS QTOF, Waters Corporation, MA, USA) was used for MS data acquisition.

Comparison of KPIC2 and XCMS
The raw data of LC-MS instruments are often stored in a proprietary format. Therefore, raw files were exported in mzXML format and converted from mzXML to mzML format using OpenMS (version = 2.4.0, Python package) [19]. The mzML files were imported into XCMS (version = 3.11.3, R package) for data preprocessing. The mzXML files were imported into KPIC2 (version = 2.4.0, R package) for data preprocessing.
KPIC2 is an integrated framework for metabolomics research. It can accurately detect PICs, align PICs between samples, group PICs to identify isotopes and potential adducts, and fill in missing peaks. XCMS is the de facto standard to process untargeted metabolomic data, which comprises nonlinear retention time alignment, matching filtering, peak detection, peak matching and missing value filling. CAMERA [49] (version = 1.44.0, R package) is used to annotate the isotope peaks and adduct ions in peak lists detected by XCMS. KPIC2 and XCMS were applied to the liquor dataset and the LCC dataset. Feature detection, peak alignment, grouping between samples and missing value filling were performed on these datasets. The peak tables containing retention time, m/z and intensity were obtained for subsequent statistical analysis.

Pattern Recognition
With the extracted features in the previous section, advanced machine learning methods were applied for pattern recognition of the LC-MS-based metabolomic datasets, including visualization and modeling. The features were normalized by removing the mean and scaling to unit variance. Samples in the liquor dataset and the LCC dataset were dimensionally reduced and visualized through PCA, t-SNE and UMAP. The low-dimensional plots can discover the patterns in the datasets. Then, each dataset was divided randomly into the training set (67%) and test set (33%). The training set was used to build the XGBoost models, and the test set was used to evaluate their performances. In order to establish reliable and accurate models, some important parameters in XGBoost were optimized. In addition, some traditional methods, including PLS-DA, SVM (kernel = "rbf") and RF, were applied to the classification between different samples in each dataset, and their parameters were optimized through Grid Search. Finally, their classification results were evaluated by evaluation criteria for discriminant models.

Evaluation Criteria
The performance of the XGBoost models must be evaluated by the proper evaluation criteria. In this study, the accuracy, precision, recall, F1 score (F1_score) and receiver operating characteristic (ROC) curve were used as the criteria to evaluate the XGBoost models.
The accuracy is the ratio of the number of samples correctly classified by the classifier to the total number of samples in the dataset, as shown in Equation (5): where TP is the number of positive samples predicted to be positive by the model, TN is the number of negative samples predicted to be negative by the model, FP is the number of negative samples predicted to be positive by the model and FN is the number of positive samples predicted to be negative by the model. The precision is the ratio of the number of positive samples predicted correctly to the number of positive samples predicted, as shown in Equation (6): The recall (or sensitivity) is the ratio of the number of positive samples predicted correctly to the total number of positive samples, as shown in Equation (7).
The specificity is the ratio of the number of negative samples predicted correctly to the total number of negative samples, as shown in Equation (8).
F1_score is the harmonic mean of precision and recall, as shown in Equation (9).
The ROC curve is a comprehensive criterion that reflects the sensitivity and specificity of continuous variables. For binary classification problems, each point of the curve represents a threshold, and the classifier gives each sample a score. If the score is greater than the threshold, we consider it as a positive sample, and if the score is less than the threshold, we consider it as a negative sample. The horizontal axis of the curve is the false positive rate, that is, the ratio of the number of negative samples predicted error to the total negative samples. The vertical axis of the curve is the true positive rate, that is, the ratio of the number of positive samples predicted correctly to the total number of positive samples.  As shown in Figure 1, the numbers of features by KPIC2 and XCMS were 763 and 735, respectively, and 504 common features were extracted by both methods. Therefore, approximately 75% of the features can be extracted by both KPIC2 and XCMS, which means that PICs detected by KPIC2 method are reliable. There are also some unique features for each method because of their different principles. A detailed description of the rationality of KPIC2 has been explained in the reference [29].

LCC Dataset
The LCC dataset was processed by KPIC2 and XCMS. Similarly, the parameters of XCMS are optimized by IPO. The parameters of KPIC2 were set according to XCMS. The numbers of extracted features are shown in Figure 2. It can be seen from the figure that approximately 70% of features can be detected by both KPIC2 and XCMS. The numbers of unique features by KPIC2 and XCMS were 433 and 287, respectively, and 958 common features were extracted by both methods. There are also some unique features for each method because of their different principles. There are more features extracted by KPIC2 than XCMS, which may be because of KPIC2 can eliminate noise signals and the true features are not covered by noise [29].

Discriminant Models for Liquor and LCC Datasets
The XGBoost model is an advanced machine learning algorithm with fast calculation speed and excellent performance, and it is robust to overfitting. Therefore, XGBoost was used to build discriminant models based on the liquor and LCC datasets. The discriminant model was trained with the training set to learn decision rules for future prediction. Since XCMS is still widely used, it is necessary to study the performance of the XGBoost model based on KPIC2 compared with XCMS. In this study, the training sets of the liquor and LCC datasets were used for training the XGBoost models. With the established models, we can compare the discriminant ability of features extracted by XCMS and KPIC2.
The parameters of the XGBoost directly affect the performance of the model. The basis of the XGBoost algorithm is the gradient boosting algorithm, so the parameters (including n_estimators, learning_rate, silent and subsample) related to the boosting algorithm are first optimized. Then, the parameters (including max_depth, booster, gamma, min_child_weight and colsample_bytree) related to the weak learner are optimized through cross-validation. The optimal parameters were used to build the XGBoost model. The performance of the models was evaluated by the evaluation criteria in Section 2.6. The optimized parameters and evaluation criteria of the models are shown in Table 1. The results show that the XGBoost model on the test sets based on KPIC2 has high accuracy, precision, recall and F1 score. This means that the extracted features by KPIC2 have a higher discriminant ability. Therefore, the combination of KPIC2 and XGBoost is reasonable, which can more efficiently and accurately classify different sample groups. Moreover, this lays the foundation for the subsequent accurate analysis of differential metabolites. The confusion matrices of XGBoost models on the test sets of the liquor and LCC datasets are shown in Tables 2 and 3, respectively. From the results of the liquor dataset listed in Table 2, it can see that the KPIC2-based XGBoost model achieves 100% testing accuracy, while XCMS only has 92% testing accuracy. The result of classification in the LCC dataset is shown in Table 3, and the classification performance of the extraction results based on KPIC2 is better than that of XCMS.   The trained model assigns a probability score to each peak group as a true peak. At the threshold cut-off value of each score, true positive rate and false positive rate were determined. The ROC curve was used to investigate the performance of the machine learning model at each threshold. In the liquor dataset, the ROC curves of the XGBoost model with features extracted by KPIC2 and XCMS are displayed in Figure 3A,B. The AUC of the test set of KPIC2 is 1.00, indicating that the model can better classify samples from different groups. Since it is multi-class problem, the true positive rate and false positive rate of each class under the threshold of each score are calculated and shown in Figure 3. In the LCC dataset, the AUC of each class is shown in Figure 3C,D, and the AUC value based on the results extracted by KPIC2 is better than the results of XCMS on the whole.

Visualization of Liquor and LCC Datasets
Here, PCA, t-SNE and UMAP were used to reduce dimensionality and visualize samples in the liquor and LCC datasets. KPIC2 was used to extract the peak tables from these datasets. Then, the peak tables were visualized by PCA, t-SNE and UMAP. Since the data of real metabolomic samples often have some uncertainties, and we did not know ground-truth global and local structures. Therefore, we chose the Woolly Mammoth dataset to evaluate the differences between PCA, t-SNE and UMAP ( Figures S1 and S2). It can be seen from the results that UMAP can achieve the best visualization results in the preservation of both local and global structures in the reduced dimensions compared to PCA and t-SNE.
In Figure 4, the PCA, t-SNE and UMAP analysis of the liquor dataset is shown. In Figure 4A, BJ-3, BJ-5 and BJ-6 are not well separated. Compared with the results of PCA, t-SNE can separate different liquor samples ( Figure 4B). In Figure 4C, UMAP analysis can separate different liquor samples under a tighter coordinate axis, and the degree of aggregation within each group is closer. This is because it can keep both the global data structure and the local data structure by adjusting the parameter values of n_neighbors, and min_dist is used to adjust the tightness between different samples. Therefore, the separation trend of UMAP is better than PCA and t-SNE between different clusters of the liquor dataset. For the LCC dataset, the separation boundary between the two groups is not obvious, and the samples in the group are too scattered in the PCA plot ( Figure 5A). Compared with the PCA plot, the t-SNE analysis makes the trend of separation between the two groups more obvious, but the group is still in a state of dispersion ( Figure 5B). In Figure 5C, the two sets of samples also have a clear trend of separation, and UMAP can make the aggregation trend of samples better on a smaller axis. Therefore, UMAP can better show the aggregation tendency of the sample, and the tightness within the sample is smaller. This shows that the features extracted by KPIC2 lay the solid foundation for subsequent statistical analysis. The introduction of the UMAP method can better visualize the aggregation trend between different samples. Compared to PCA, UMAP is a non-linear dimensionality-reduction method, which has wider applicability to datasets.

Comparison of Classification Models Based on KPIC2
PLS-DA is a linear discriminant model based on PLS regression, which can be used for predictive and descriptive modeling as well as for discriminative variable selection. SVM can specify different kernel functions for decision-making functions. RF is an ensemble learning method that works by constructing multiple decision trees. Both SVM and RF are methods that can be applied to nonlinear modeling. Based on the feature detection results of KPIC2, different classification algorithms (PLS-DA, SVM, RF and XGBoost) were compared. The classification results are evaluated in terms of the accuracy, precision, recall and F1 value of performance measurement indicators of the machine learning model. The classification results of the two datasets are shown in Table 4. It can be seen from Table 4 that the traditional linear discriminant model PLS-DA is obviously not applicable, and the five evaluation indicators of the model are significantly lower than other modeling results. SVM and RF can deal with nonlinear problems, which have a wider range of applications. The kernel type of the SVM model can be chosen, and adding a regular term can avoid overfitting. Therefore, the performance of SVM can sometimes be better classified compared to RF. It can be seen from Figures 4A and 5A that samples of different groups do not have a good clustering trend even using the nonlinear dimensionality reduction methods. Therefore, these two datasets have a certain degree of non-linearity. This once again confirms the poor effect of the PLS-DA model. For the liquor dataset, the classification accuracy, precision, recall rate, specificity and F1 score of SVM are all higher than RF. The performance of the XGBoost model is much higher than that of PLS-DA, SVM and RF. This shows that the introduction of XGBoost can classify different samples well, and the accuracy of the discriminant model will be greatly improved. This may be because the XGBoost model uses the iteratively learning weak classifiers to reduce both bias and variance of the ensemble model and introduces the regular terms to avoid overfitting. For the LCC dataset, the classification performance of XGBoost model is still higher than PLS-DA, SVM and RF, and the classification results of the nonlinear discriminant models are better than PLS-DA. Among them, the classification performance of RF is better than SVM. This may be since that the data in the mapped space of SVM is not so linearly separable, which reduces the generalizability of the model, resulting in lower accuracy than RF. Therefore, the introduction of the XGBoost model can greatly improve the accuracy of data analysis, and it can be widely used in linear and non-linear datasets. In the future, the screening and identification of differential metabolites will become more accurate.
In addition, the quality of the model is crucial for accurately extracting differential metabolites. After the boosting tree in XGBoost is created, the importance score of each feature can be directly obtained. Therefore, the trained XGBoost model can automatically calculate feature importance, and the differential metabolites can be screened by feature importance of XGBoost. In the study, the higher accuracy of the model was obtained by the introduction of XGBoost, which will improve the accuracy of the differential metabolites screened.

Conclusions
In this study, we extended the KPIC2 with UMAP and XGBoost to analyze the complex LC-MS-based untargeted metabolomic datasets. KPIC2 was used to extract PICs from the liquor and LCC datasets. UMAP and XGBoost are used to visualize and discriminate complex samples, respectively. The performance of the XGBoost model based on the extraction results of KPIC2 is compared with XCMS. Results show that the features extracted by KPIC2 can achieve better classification accuracy (100%) on the test sets of the liquor dataset when compared with XCMS (92%). The result based on KPIC2 is also better than that of XCMS on the LCC dataset. Therefore, the combination of KPIC2 and XGBoost model is reasonable, and it can be used to classify features extracted based on KPIC2. By visualizing the Woolly Mammoth dataset, it can be shown that UMAP has the best visualization results in the preservation of both local and global structures in the reduced dimensions. The PCA, t-SNE and UMAP methods were used to visualize the liquor and LCC datasets, which also shows that UMAP can generate reasonable visualization in reduced dimensions for metabolomic datasets. Finally, the performance of the XGBoost model with features extracted by KPIC2 exceeds PLS-DA, SVM and RF on these two datasets. The combination of KPIC2, UMAP and XGBoost have the potential to be a promising pipeline to analyze LC-MS-based untargeted metabolomic datasets of complex samples. In addition, this method can effectively avoid overfitting due to the introduction of regular terms in the XGBoost model. In the future, we will use the feature importance of XGBoost to screen the biomarkers for further pathway analysis and mechanism investigations.