1. Introduction
Direct infusion mass spectrometry (DIMS) has become the method of choice for characterizing human metabolomes in just a few minutes. Comparatively, liquid or gas chromatography mass spectrometry (LC-MS/GC-MS) typically require 15 to 30 min for analysis. The time consumption imposes a significant limitation in screening the risks of diseases. To enhance the capacity of mass spectrometry, the DIMS method was proposed as an alternative to GC or LC-MS. Here, samples are directly sent to the mass detector without prior LC separation [
1,
2,
3].
The limitation of the DIMS method is related to the interpretation of the resulting mass spectra. It is often extremely difficult to reliably identify which metabolites are present in the biological sample under study. According to the Human Metabolome Database (HMDB [
4]), about 3000 endogenous metabolites with the status of “Detected and Quantified” or “Detected but not Quantified” have been detected in human blood via mass spectrometry using electrospray ionization. More than 20,000 metabolites in HMDB carry the status of “Expected but not Quantified” or “Predicted”. That is just a small number of all the compounds involved in cellular metabolism that have been experimentally found in human plasma.
In order to increase the number of metabolites identified in plasma, a pipeline for processing direct infusion mass-spectrometric data—DataAnalysis/MatLab—was proposed by Lokhov and colleagues [
5]. The pipeline involves obtaining lists of masses in the COMPASS DataAnalysis program and comparing the masses found in a series of samples with the author’s alignment algorithm implemented in MatLab [
6]. The algorithm created in 2011 involved building a matrix of mass-spectrometric peaks, then calculating the correlation coefficient for each row of the matrix and combining rows with a negative correlation. After alignment, the peaks were annotated, taking into account the biological context. The annotation algorithm was implemented in MatLab by Rogers and coauthors [
7], and later improved [
8].
The DataAnalysis/MatLab approach [
5] for processing metabolomic direct infusion mass spectra is primarily based on the capabilities of the commercial program COMPASS DataAnalysis (Bruker Daltonics, Billerica, MA, USA). Usually, the COMPASS program allows the detection of peaks that can correspond to low-molecular-weight compounds. The peak annotation of mass spectra has also been performed with the web tool MASSTrix [
9]. However, for the subsequent statistical analysis of annotations, the MatLab development environment (MathWorks, MA, USA) is required. Some home-made scripts implemented in the MatLab environment are not available for installation through repositories. We collected some of the original DIMS data from open repositories, including DI-ESI-QTOF datasets in MetaboLights [
10] or the Metabolomics Workbench [
11]. These data show that preprocessing methods are limited, either by the application of commercial programs or due to the lack of laboratory scripts shared on GitHub. As a consequence, it is currently impossible to repeat the computational experiment for determining the metabolomic profile obtained by direct infusion mass spectrometry. This significantly distinguishes metabolomics from proteomics and genomics, for which examples of public software for data processing and analysis have been developed, such as MaxQuant [
12], SearchGUI [
13] and Guppy basecaller [
14].
This work aims to develop an alternative approach for processing direct infusion mass spectra based on freely available software. The authors propose the MALDIquant/Mummichog pipeline based on the functionality of the publicly available MALDIquant package [
15]. Our method, similar to that described above [
5], implements the following steps (see
Figure 1 and
Figure S1). First, mass-spectrometric scans are combined into one consensus mass spectrum, and then intensity smoothing, baseline removal, peak detection and equalization are performed. A comparison of m/z values with metabolites and metabolic pathways is carried out using the MetaboAnalystR–Mummichog module.
In this work, we rely on the equivalence of the algorithms for processing the mass spectra obtained on MS platforms with two different types of ionization: matrix-activated and electrospray. The MALDIquant program, which is part of the freely distributed R language package, is applicable to matrix-activated ionization mass spectra. The mass spectra obtained by electrospray ionization (ESI) with direct infusion into the ion source of the QTOF instrument are similar to MALDI-MS spectra—in both cases, the sample is not subjected to chromatographic separation.
For a comparative analysis of the DataAnalysis/MatLab and MALDIquant/Mummichog approaches, previously obtained by Lokhov and colleagues, data pertaining to the metabolomic profiling of blood plasma have been used [
5]. In that study the mass spectra were acquired on a high-resolution hybrid quadrupole time-of-flight mass analyzer (QTOF, electrospray ionization), maXis Impact II (Bruker Daltonics, Billerica, MA, USA).
3. Results and Discussion
We aimed to determine whether the MALDI spectra processing tool was applicable to the data obtained by direct infusion with the electrospray ion source on a hybrid quadrupole time-of-flight mass analyzer (DI-ESI-QTOF). A routinely used pipeline for processing mass-spectrometric data (peak detection, annotation of compounds, mapping to metabolic pathways) was implemented using open-source programs [
15,
17,
19,
20,
22]. The most highly represented (by the number of annotated compounds) metabolic pathways were compared between two approaches: DataAnalysis/MatLab (commercial software) and MALDIquant/Mummichog (open-source solution).
When using the open-source solution, the essential criterion was comparability with the published approach [
5], in terms of the detected number of peaks. Recently, the DEIMoS package appeared in the open-source domain [
25]. This represents a significant step in metabolomics from proprietary data and algorithms to reproducible pipelines. We inspected DEIMoS using our dataset. Due to the wide functionality of the package, we failed to cope with baseline correction and peak assembly, which are simple in MALDIquant. Additionally, XCMS has gained well-deserved popularity and was originally developed for processing LC-MS data. We tested the pipeline in XCMS for processing the direct infusion mass spectrometry FT-ICR data [
26]. Utilizing the XCMS package, we detected 447 ± 141 peaks per sample, which is substantially lower than when processing DIMS spectra in the DataAnalysis program (see
Table 1). The MALDIquant package, which was originally developed for two-dimensional mass spectrometry data, led to the detection of 9274 ± 297 peaks and so, for further analysis, MALDIquant was chosen.
Table 1 presents the results of employing combinations of moving average and Savitzky–Golay filters with two noise-reduction methods: median absolute deviation (MAD) and SuperSmoother. Using the moving average method as an example, we have shown the dependences between the numbers of detected peaks and the smoothing half-window size parameter (hws). With coarsened smoothing (hws = 1), a relatively larger number of peaks was obtained—up to 10,000 on average. With a wider window (hws = 4), the number of peaks was 30–40% lower than the reference value. However, if we smoothed not with a moving average, but with the Savitzky–Golay algorithm, then the number of peaks and the range of the average value (9274 ± 297) almost exactly matched the data obtained using DataAnalysis (9333 peaks). We stopped at the Savitzky–Golay method in the application of the MALDIquant approach for further analysis (indicated in
Table 1 through the context). Note that the coincident number of peaks does not mean that these peaks obtained by different tools result in identical m/z values. Therefore, in order to correctly compare the previously published approach [
5] with the proposed pipeline, it was necessary to map the metabolomic profiles to biochemical pathways.
Figure 2a shows that the DataAnalysis/MatLab approach made it possible to annotate the chemical names for 390 metabolites (
p-value < 0.01) in plasma samples from volunteers with the third stage of obesity. The developed MALDIquant/Mummichog algorithm made it possible to annotate 920 metabolites (with a coarsened
p-value = 0.99) in the same data. When setting a cutoff
p-value < 0.01 (Wilcoxon test result when comparing volunteers and patients) in our pipeline, we obtained only 7 metabolites (see below and
Table S2), so when building the Venn diagram, the set of 920 metabolites in MALDIquant/Mummichog, annotated with a coarsened
p-value, was used (
Figure 2a).
To compare the results of the approaches, the metabolites involved in the biosynthesis of steroid hormones and found in samples from patients in the third stage of obesity were selected as an example (see
Figure 2b). The intersection of the sets of metabolites obtained by DataAnalysis/MatLab (
n = 40) and MALDIquant/Mummichog (
n = 59) equaled 50% (
n = 33) of the total number of metabolites found by the two approaches (
n = 66). Importantly, all nodal metabolites, i.e., the metabolites with the most edges, were successfully annotated using both approaches (
Figure 2c,d).
One of the possible reasons for the incomplete intersection in the lists of metabolites is the errors made in determining the MALDIquant peaks. When using MALDI, it is common to collect signals of 10,000 laser shots to compute a consensus spectrum. Unlike MALDI, DIMS records only 60 spectra per minute, and this can affect the quality of peak recognition when building a consensus spectrum.
Another possible reason for the discrepancies in the sets of metabolites observed in
Figure 2a,b is the difference in approaches in terms of the m/z value and the compound identifier in KeGG; in Mummichog, the variation range of m/z values is expanded due to the combinations of adducts [
20].
Using the Statistical Analysis module of the MetaboAnalyst 5.0 web platform, we conducted a comparative statistical analysis to search for differential metabolic markers of obesity. A comparison was carried out using the Wilcoxon test regarding the values of the intensities of metabolites in healthy volunteers compared with, in one case, all stages of obesity, and in the second case, only the third stage.
Table S2 includes 96 m/z values that differ significantly in intensity for the two listed cases. The table shows that the number of annotated metabolites is negligible compared to what was published in the previous article [
5]. A total of nine m/z values were mapped to KeGG IDs. Among the values that arose when comparing the norm and all stages of obesity, there were seven metabolites, including glycans; orotic, glutamic and arachidonic acids; and nicotinamide ribotide (see
Table S3). Comparing the norm with the third stage of obesity, only four metabolites with statistically significant intensity differences were identified.
As can be seen from
Table S2, several metabolites may give the same m/z values when searched in the KeGG database. For example, the Mummichog program compared six different metabolites with a statistically significant peak (
p-value < 0.01) characterized by m/z = 148.059. Among them, 7,12-Dimethylbenz[a]anthracene 5,6-oxide most likely has an exogenous origin, entering the body with plant foods (see
Table S2). Considering the sampling conditions (in the morning with an empty stomach), we can assume that this is an example of a false positive result, since the presence of plant metabolites in the studied human blood samples is unusual.
The negligible number of metabolites identified by statistical analysis is unsurprising for two reasons. Firstly, the absolute value of mass spectrum intensities in different samples is incomparable due to the absence of external or internal calibration. Incidentally, in the previous work [
5], such calibration was carried out. Secondly, it is necessary to take into account the key feature of Mummichog’s work [
20]. This algorithm is not intended for metabolite annotation, i.e., the matching of putative analytes to KeGG identifiers is essentially a by-product. The main problem solved by Mummichog is the prediction of metabolic pathways. Therefore, our further study did not compare m/z values or lists of KeGG identifiers, but instead analyzed the enrichment of certain metabolic pathways.
The results of the enrichment assessment obtained using the MetaboAnalyst 4.0/5.0 web platform are shown in
Figure 3. The ranked lists of metabolic pathways obtained by the MSEA tool [
27] are shown for two compared approaches: DataAnalysis/MatLab (
Figure 3a) and MALDIquant/Mummichog (
Figure 3b).
Figure 3a shows the MSEA result obtained using the COMPASS DataAnalysis approach and the annotation algorithm in MatLab [
6].
Figure 3b shows a diagram constructed in a similar way for a case in which MALDIquant was used in combination with Mummichog instead of DataAnalysis and MatLab.
The application of the DataAnalysis/MatLab approach made it possible to detect metabolic pathways with high-significance values (
p-value < 0.01). The first of the metabolic pathways, steroid biosynthesis, differs in intensity between obese and normal cohorts by more than four orders of magnitude (see the Fold Enrichment parameter). The MALDIquant/Mummichog pipeline, however, did not allow for the identification of enriched pathways associated with obesity: statistical differences in the norm compared with patients were not significant (
p-value > 0.2). For example, in
Figure 3a, the metabolic pathway of steroid biosynthesis shows the highest confidence. However,
Figure 3b shows that when using our approach to process DIMS data, this path has no statistical significance and barely reaches Fold Enrichment = 0.6.
In addition to MSEA, the MetaboAnalyst platform provides another tool for characterizing metabolomes.
Figure 4 shows the results of the Pathway Analysis module, obtained by analyzing the results of processing the DIMS spectra in the frame of our MALDIquant/Mummichog pipeline.
The Y-axis in
Figure 4a,b characterizes the relationship reliability of a compound group detected by mass spectrometers with one or another metabolic pathway. Two situations are depicted: people with a normal body mass index (
Figure 4a) and patients suffering from stage III obesity (
Figure 4b). The dots on the scatter plot correspond to pathways in which metabolites are found relatively more frequently than would be expected. It can be seen that the compounds measured using DIMS are more likely to be related to the primary bile acid metabolism.
Figure 4 indicates that the proposed data analysis solution does not reveal significant differences in the profiles of metabolic processes between the normal and obese cohorts; in both studied cases (
Figure 4a,b), first place is held by the formation of primary bile acids, the second is the biosynthesis of steroids, the third is the biosynthesis of unsaturated fatty acids and then the metabolism of retinol and steroid hormones follows.
Although our method did not reveal the features of metabolic pathways characteristic of obesity, it is effective for exploring the metabolic pathways in normal and obese patients. This can be seen from an individual-level metabolic pathway comparison (
Figure 5). Graphically, this comparison is shown in
Figure 5, where the published results of MSEA are matched to the results of the Pathway Analysis. It is important to recall that a list of metabolites annotated by the Mummichog program was loaded into Pathway Analysis, provided that each m/z value was given a
p-value = 0.99. The results of the Pathway Analysis are ranked in descending order of
p-value, which in this case characterizes the probability of the misidentification of the metabolic pathway (see
Figure 5b).
Figure 5a,b shows that, for example, bile acid biosynthesis is not unrelated to metabolic disorders in obesity (
p-value < 10
−8). If we compare the first five metabolic pathways shown in
Figure 5b with the results of MSEA (see
Figure 5a), taking into account the peculiarities of the biochemical origin of metabolites in the body, then steroid biosynthesis (“steroidogenesis”) can be attributed to “primary bile acid biosynthesis” and “steroid biosynthesis”, since the compounds included in both metabolic pathways are included by their chemical nature in the class of steroids and are derivatives of cholesterol.
Similarly, using the structural formulas of metabolites, the other two points from the set enrichment of metabolites (MSEA) were correlated. Using the MALDIquant/Mummichog method, the compounds involved in the metabolism of androgens, estrogens, androstenedione and estrone were identified. The listed compounds are steroid hormones and can be related to the biosynthesis of steroid hormones (
Figure 5a,b). The statistical significance of the pathways determined using the Pathway Analysis module of the MetaboAnalyst 5.0 web platform is characterized by a
p-value < 0.003. Such pathways include 16 to 60 metabolites, as shown in the “Match Status” column in
Figure 5b. Interestingly, 16 compounds fell into the pathway labeled “Retinol metabolism” in the KeGG database. Retinol metabolism includes biochemical processes associated with the conversion of vitamin A in the human body. A number of articles [
28,
29] indicate that retinoic acids affect the differentiation of adipocytes in both white and brown adipose tissue. Returning to
Figure 2c,d, it can be seen that the MALDIquant/Mummichog approach covered 19 more metabolites of the steroid hormone biosynthesis metabolic pathway than the COMPASS/MatLab approach.
4. Conclusions
Direct infusion ESI-QTOF mass spectrometry has made it possible to obtain a “snapshot” of a person’s phenotype through the blood, via its metabolomic profile. The blood metabolome reflects the genetically determined features of metabolism and changes in the biochemical processes of organ systems, as well as the lifestyle and potential habits of the individual. We have demonstrated the DIMS–data processing pipeline, using the example of a problem related to the metabolomics of obesity.
One of the more recently published data processing algorithms used for direct electrospray infusion mass spectrometry in connection with the problem of obesity involved preprocessing mass spectra using the COMPASS DataAnalysis software developed by Bruker Daltonics (Billerica, MA, USA) [
5]. Previously, the same group showed [
5,
6,
30] that the data obtained by the DIMS method reflect the human molecular phenotype, so this approach can be used for the laboratory diagnostics of socially significant diseases.
The process of obtaining mass spectra by direct infusion is similar to obtaining the spectra of matrix-activated ionization (MALDI-TOF). When using the DIMS method, a set of mass spectra separated by certain time intervals (1 s) is generated, similar to the process by which MALDI performs a series of laser strikes on a target, which it also does with time fixation. Both with direct electrospray infusion and with matrix-activated ionization, the system for recording the mass-to-charge characteristics of ions is the same: time-of-flight (TOF). Despite the fact that the QTOF mass analyzer (maXis) operates in a tandem mode, when using DIMS, the second quadrupole (q2) is not used, and metabolite ions enter the TOF analyzer without fragmentation in exactly the same way as in experiments using MALDI-TOF. We have shown that the computer processing method developed for the MALDI proteomic platform can also be used in the analysis of the metabolome obtained by direct infusion.
To investigate the metabolome, we applied free MALDIquant processing programs to DIMS data and demonstrated the following:
(a) MALDIquant makes it possible to determine the comparable values of peaks in mass spectra (9333 peaks in COMPASS DataAnalysis compared to 9227 peaks in MALDIquant);
(b) the use of MALDIquant together with the Mummichog module of the MetaboAnalystR package provided > 50% coverage of metabolites from previously published results [
5], in which metabolites were identified with commercial packages and home-made scripts;
(c) our approach, based on a public bank of programs, made it possible to determine the previously obtained metabolic pathways [
5] associated with the biochemical features of obesity development.
However, our results differ significantly from those previously published [
5]. When explaining the differences, it is important to note that we applied a non-standard method, using the same algorithm for processing mass spectra for laser and electrospray ionization. We used the Mummichog module as a peak annotation algorithm, which is designed for the direct transition from the mass-to-charge characteristics of ions to metabolic pathways. The problem of the unambiguous correct annotation of metabolites using the m/z value has not been solved. Our results also demonstrate the different degrees of reliability of the annotations made using the statistical approach (Mummichog) and the Bayesian approach [
5]. Annotation based on biological context makes it possible to detect more accurate differences in metabolomic profiles compared to using the Mummichog statistical approach. However, these discrepancies are common in metabolite identification, and the two different approaches—Mummichog and the Bayesian approach—can be considered complementary.
With the appropriate normalization of the peak intensities, the proposed approach will allow the use of the DIMS data obtained with the QTOF instrument to determine differences associated with overweight and obesity. The approach has been implemented using a cloud service, which allows for replicating a virtual machine and reproducibly reanalyzing proprietary collections of DIMS data in order to build a human digital image [
31].