Next Article in Journal
Previous Article in Journal
Previous Article in Special Issue

Metabolites 2013, 3(3), 623-636; doi:10.3390/metabo3030623

Article
Tackling CASMI 2012: Solutions from MetFrag and MetFusion
Christoph Ruttkies *,, Michael Gerlich *, and Steffen Neumann
Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental Biology, Weinberg 3, DE-06120 Halle (Saale), Germany; E-Mail: sneumann@ipb-halle.de
These authors contributed equally to this work.
*
Authors to whom correspondence should be addressed; E-Mails: cruttkie@ipb-halle.de (C.R.); mgerlich@ipb-halle.de (M.G.); Tel.: +49-345-5582-1471 (C.R.); Fax: +49-345-5582-1409 (C.R.).
Received: 24 April 2013; in revised form: 29 July 2013 / Accepted: 30 July 2013 /
Published: 5 August 2013

Abstract

: The task in the critical assessment of small molecule identification (CASMI) contest category 2 was to determine the identification of (initially) unknown compounds for which high-resolution tandem mass spectra were published. We focused on computer-assisted methods that tried to correctly identify the compound automatically and entered the contest with MetFrag and MetFusion to score candidate structures retrieved from the PubChem structure database. MetFrag was combined with the metabolite-likeness score, which helped to improve the performance for the natural product challenges. We present the results, discuss the performance, and give details of how to interpret the MetFrag and MetFusion output.
Keywords:
mass spectrometry; metabolite identification; MetFrag; MetFusion; metabolite likeness; molecular formula

1. Introduction

The critical assessment of small molecule identification contest (CASMI) was organised in 2012 by Emma Schymanski and Steffen Neumann, to call upon the computational mass spectrometry community and demonstrate the performance of compound identification from mass spectrometry data.

At the Leibniz Institute of Plant Biochemistry (IPB), we are developing several tools for metabolite identification. The MetFrag system [1] is able to perform in silico fragmentation of candidate structures, which can be retrieved from compound databases or obtained through structure generation [2]. The IPB is also part of the MassBank consortium [4], which collects a large number of reference spectra, particularly of soft electrospray ionisation (ESI) spectra. Our MetFusion system [5] integrates these two strategies to obtain a more reliable identification compared to each individual approach taken alone.

In the CASMI contest, our tools did not officially take part because one author was in the organisation team and some of the challenge spectra were obtained at the IPB. Nevertheless, we tried to approach the challenges in as unbiased a manner as possible, and did not use our inside knowledge to tune any parameters in order to obtain better results. We also restricted the participation to category 2 (“best structure identification for high resolution liquid chromatography/mass spectrometry (LC/MS) data”) and did not submit the molecular formulas to category 1 (“best molecular formula for high resolution LC/MS data”).

2. Methods

The spectra preprocessing steps and the elimination of redundant candidate structures are the same for both MetFrag and MetFusion.

2.1. Spectra Processing and Neutral Mass Heuristics

All of the challenges were measured in a single ionization mode, but with multiple ionization energies. If a challenge provided two or more spectra, the spectra were merged to create a corresponding composite spectrum. This processing step was recommended by the MassBank consortium [4] for a more reliable identification. Challenges 2, 10 and 12 each consisted of only one spectrum, so the spectra merging was not applied to them. We used the mzClust grouping algorithm in xcms (version 1.37.0) [6,7]. The composite spectrum contains the unique peaks where m/z values are averaged and the maximum intensity across all spectra is used. The R-code for the merging is shown in Appendix B.

To determine the neutral mass of a compound, we used a simple heuristic which located the lowest m/z in the isotope pattern as a monoisotopic peak and then removed the adduct, taking the polarity of the measurement into account to automatically deduce the neutral exact mass of the compounds for the candidate search.

2.2. Eliminating Redundant Candidates

Both MetFrag and MetFusion obtain candidate structures from chemical databases. They often contain redundant structures which increase the candidate lists without adding chemical diversity. In addition, mass spectrometry can, in general, not distinguish between the stereoisomers of a compound and the identification methods we use assign identical scores to isomers. Therefore, we eliminate redundant candidate structures with an InChIKey-based filtering.

The InChIKey is a string that is characteristic of the molecular structure, where the first block of 14 characters is determined by the molecular skeleton (or connectivity). More information regarding both InChI and InChIKey can be found elsewhere [8]. We calculate the InChIKey for each candidate and keep only candidates with a unique first InChIKey block.

2.3. In silico Fragmentation with MetFrag

We used MetFrag as described in Wolf et al. [1], with the composite spectra as explained in Section 2.1 to submit candidates for all challenges in CASMI category 2. We queried a local PubChem [3] mirror (created September 2010) for the candidate retrieval and filtered as explained in Section 2.2. For the candidate selection we used the putative neutral exact mass and a mass window of 5 ppm and 0.001 Da mass deviation for the fragment matching. For later resubmissions for Challenge 5, we adapted the mass window to 10 ppm and 0.002 Da for, due to the higher mass error. For this paper, we additionally used a molecular formula candidate search using the correct formulas which were not known during the contest but given in the solutions. This allows estimation of the MetFrag performance the correct molecular formulas are used as input.

The score calculated by MetFrag evaluates the match of in silico-generated fragments of the candidate molecules to the given challenge tandem mass spectra The mass as well as the intensity of the peak matched by a fragment are considered in the score.

Compounds for challenges 1 to 6 were known to be natural products, as explained on the CASMI website. Because large compound databases, such as PubChem [9], contain many non-natural compounds, several filtering strategies have been developed for metabolomics data. While Kind and Fiehn [10] proposed filter criteria based on the molecular formula, Peironcely et al. [11] used machine learning to train a random forest model [12] on metabolite structures from the Human Metabolome Database (HMDB) [13] and structures from the ZINC database [14] to predict a metabolite-likeness score (MLS) based on structural fingerprints.

We used the MLS to prefer biological compounds for challenges 1–6. For those challenges, we used the adapted version of the final score:

Scorefinal = ScoreMetFrag + ω · MLS
to obtain the ranking, where ω represents the weight of the MLS which we arbitrarily set to 0.5 to give it a lower influence in the final score than the MetFrag score. In the future we plan to optimise ω by learning from given data. The influence of the metabolite-likeness score on the rankings of candidates was investigated by comparing the rankings of results with ω =0 and ω =0.5.

2.4. MetFusion: Integration of MetFrag with Spectral Libraries

We also applied MetFusion [5] to generate submissions for all Category 2 challenges. We used the MassBank spectral library and PubChem compound database, which in this case was queried online in January and March 2013. For the candidate selection we used the putative neutral mass and a mass window of 10 ppm. A mass window of 10 ppm is sufficient as all Category 2 challenges promise an accuracy of <10 ppm. For the fragment matching, we applied a window of 0.002 Da and 10 ppm. As explained above, we used composite query spectra and the InChIKey-based candidate filtering.

MassBank provides separate search forms for either a precursor mass search or peak list search. The combination of both types of information is currently not available, although it would be possible to search MassBank with both an MS/MS spectrum and explicitly apply the precursor neutral mass as filter afterwards. This search strategy is used by, e.g., the Metlin database. Instead, MetFusion invokes the peak list search, so MassBank will also return compounds with similar MS/MS spectra in order to possibly return also structurally similar compounds. MetFusion then implicitly combines the fragmentation similarity from MassBank with the exact mass hit from PubChem.

All challenges were queried against all available ESI spectra in MassBank [4]. For the resubmissions, we also included instruments with ion sources at atmospheric-pressure levels, namely chemical ion­ization (APCI) and photoionization (APPI). This instrument selection covers triple quadrupole (QqQ), quadrupole time-of-flight (QTOF) and Orbitrap devices i.e., both nominal and accurate mass spectra were queried.

Besides the peak list and instrument selection, the number of result hits and the intensity cut-off are the only parameters for the MassBank peak search. The result limit was set to 100 hits and the intensity cut-off was set to 5. The intensity cut-off determines which peaks are ignored due to having a lower intensity than the specified cut-off. MassBank internally applies a fixed 0.3 Da mass window when matching peaks. MassBank also utilizes the intensity information for spectra comparison, i.e., low intensity peaks have less weight in the resulting scores.

For the MassBank query results, we also performed an InChIKey-based filtering where among the duplicates only the entry with the best MassBank score, i.e., the highest spectral similarity, was kept. The MetFusion workflow and the scoring have been described earlier [5].

In the next section we also discuss the chemical similarity, e.g., between the correct solution and the most similar MassBank record. We used the Tanimoto similarity based on the fingerprints of the structures as implemented in the CDK [15]. A Tanimoto score of 0 indicates that no structural features are shared in both structures. Conversely, a Tanimoto score of 1 indicates that all investigated structural features (determined by the fingerprint) are present in both structures. A Tanimoto score ≥0.8 indicates reasonable structural similarity, whereas scores ≥0.95 indicate very high structural similarity.

The whole set of challenges was processed with the command line version of MetFusion. Results were stored in a structure data file (SDF), which is better known by the *.sdf file extension. This file keeps the molecular structure and associated information, like compound name, score, and additional properties, for each candidate. In addition to the integrated result list as an SD file, we also keep the individual intermediate result lists and create a spreadsheet file containing the result lists and the coloured similarity matrices which can be used to examine the results in more detail.

3. Results and Discussion

In this section we discuss the results of our resubmissions and note where and why they differ from the original submissions. The challenges 2, 4, 5 and 6 from category 2 were not calibrated when initially offered to the participants, resulting in higher than stated ppm deviations. This was recognised after the contest closed, and the data of these challenges was recalibrated and made available to the participants online for the articles in the proceedings. Each participant was allowed to resubmit their findings. Additionally, our hypotheses for the neutral mass of challenges 11 and 12 were wrong in the first submission. The correct neutral mass for challenge 12 could be extracted from the available meta-data that all participants had access to. Challenge 11 did not provide [M+H]+ ions, instead the [M-H2O]+ fragment was the major ion suitable for back-tracking the neutral mass by an experienced mass spectrometrist. We used the correct neutral mass from the published CASMI solution for challenge 11.

For both MetFrag and MetFusion we report the number of candidates and the absolute rank for each challenge, and the median rank broken down to the natural compound and environmental challenges. The median is used because the distribution of ranks is heavily tailed and a few challenges with very poor ranking severely skew the mean values. In addition to the absolute rank, we also report the relative ranking position (RRPCASMI), defined as Metabolites 03 00623 i001 where BC and WC are the number of candidates ranked better and worse than the correct solution, and TC is the number of total candidates, respectively. See [16] for more details.

3.1. MetFrag

In the initial submission, the correct solution was missing for Challenges 2, 4 and 6 because the measured mass was outside the 5 ppm margin. In addition, the simple precursor heuristics described in Section 2.1 missed the neutral mass of challenges 11 and 12. These cases were corrected with the updated information for the resubmissions.

Table 1 shows the number of candidates obtained from the PubChem snapshot with a search for the neutral mass and the absolute rank of the correct solution. For Challenges 1 to 6 we also show the ranks with the MLS score included.

Table Table 1. MetFrag results with neutral exact mass filter after resubmission. Shown are the number of candidates per challenge(#Cand.), the InChiKey filtered MetFrag rank and the relative ranking position (RRP). Additionally, for challenges 1–6 the InChiKey filtered MetFrag rank with the metabolite-likeness score (MLS) included is shown.

Click here to display table

Table 1. MetFrag results with neutral exact mass filter after resubmission. Shown are the number of candidates per challenge(#Cand.), the InChiKey filtered MetFrag rank and the relative ranking position (RRP). Additionally, for challenges 1–6 the InChiKey filtered MetFrag rank with the metabolite-likeness score (MLS) included is shown.
Natural Product ChallengesEnvironmental Challenges
Chall.#Cand.RankRRPMLSRRPChall.#Cand.RankRRP
104472600.441
199450.99640.99711465230.976
224830.99230.992121531360.978
31094120.99090.99313103150.998
422345470.7574540.79714125270.810
528919880.67912380.5731518251730.907
6186018600.4392810.85016194819480.453
17475150.970
Median14772800.8741450.921753320.939

The results achieved with the molecular formula database query are shown in Table A1. For every challenge MetFrag found the correct hit among the candidates with both types of queries, where the mass window result sets contain twice as many candidates. The absolute ranks obtained with the formula query decrease the median rank (Challenges 1–6: 280⇒270; Challenges 10–17: 32⇒22.5) compared to the ranks of the mass query, but on the other hand the median RRP is lower (Challenges 1–6: 0.874⇒0.607; Challenges 10–17: 0.939⇒0.917) with the use of the molecular formula filter, because compounds within the mass search window but with the wrong molecular formula often obtain a lower MetFrag score compared to the correct solution. The molecular formula filter eliminates these worse candidates (WC) from the outset, which reduces the RRP.

Next, we describe the outcome if the metabolite-likeness score is considered together with the MetFrag score for the Challenges 1–6. The number of candidates remains unchanged, but natural compounds (including the correct solution) should obtain better scores and improve both the absolute rank and the RRP.

Indeed, except for Challenge 5 all ranks are better or equal with the MLS contribution in the score as shown in Table 1. The median absolute rank decreases from 280⇒145 (RRP: 0.874⇒0.921) and even more for the molecular formula candidate search, where the median rank improves from 270⇒119 (RRP:0.607⇒0.797).

Reticuline (the correct candidate of Challenge 5) has the lowest metabolite-likeness score of 0.296 among all challenge compounds and therewith the worst rank (1209) related solely to the MLS (see Table 2), which explains why the final result for Reticuline was even worse with MLS.

Table Table 2. The metabolite-likeness score (MLS) of the compounds of Challenges 1 – 6 and their rankings among the retrieved candidates based on the MLS alone, while Table 1 uses he combined score.

Click here to display table

Table 2. The metabolite-likeness score (MLS) of the compounds of Challenges 1 – 6 and their rankings among the retrieved candidates based on the MLS alone, while Table 1 uses he combined score.
ChallengeTrivial nameInChIKey (first block)MLSMLS rank
1Kanamycin ASBUJHOSQTJFQJX0.50847
21,2-Bis-O-sinapoyl-beta-D-glucosideKQDOTXAUJBODDM0.71635
3GlucolesquerellinZAKICGFSIJSCSF0.4743
4EscholtzinePGINMPJZCWDQNT0.436439
5ReticulineBHLYRWXGMIUIHG0.2961209
6RhoeadineXRBIHOLQAKITPP0.374132

Challenges 6 and 16 were very problematic for MetFrag, which could only assign to the given spectrum a single fragment of the correct molecule for the first case and no fragments of the correct molecule for the second case. Although the MLS improved the final rank for challenge 6, this is only based on the (second lowest among all challenges) MLS of 0.374. Figure A1 shows the rankings related the the calculated scores of all candidates of challenges 1 to 6.

The results show that MetFrag is able to rank four molecules of the total 14 challenges among the top ten hits when applying mass filtering. The number can be increased to five by including knowledge of the molecular formula of the correct compound.

The external participants Dunn et al. [17] and the internal participant Meringer et al. [19] both used MetFrag in conjunction with other methods for the identification. The combined MetFrag and manual interpretation method of Dunn et al. had better ranks than MetFrag alone, but missed a lot more challenges because the Kyoto Encyclopedia of Genes and Genomes (KEGG) [18] was used for candidate retrieval, which only contains a subset of the challenge compounds.

3.2. MetFusion

The overall results for MetFusion are shown in Table 3. PubChem has grown considerably over the past two years and consequently the online query against PubChem yields more candidates: for the first six challenges, MetFrag retrieved 1477 candidates (median) from our PubChem snapshot (September 2010), whereas the corresponding online query against PubChem from January 2013 yields 3582 candidates (median)— more than twice as many, and more than three times for the environmental challenges. The same observation can be made for the remaining challenges 10–17. The rapid growth of PubChem over even short time periods becomes obvious; e.g., for Kanamycin A. In January 2013, 37 isomers with an identical first block of their InChIKey were retrieved, whereas only eight weeks later three additional isomers were found.

Table Table 3. MetFusion results per challenge after resubmission. Shown are number of candidates per challenge (#Cand.), the InChIKey filtered MetFusion rank as well as the maximum Tanimoto similarity (Max. TS) between the candidates and the MassBank results and finally the relative ranking position (RRP).

Click here to display table

Table 3. MetFusion results per challenge after resubmission. Shown are number of candidates per challenge (#Cand.), the InChIKey filtered MetFusion rank as well as the maximum Tanimoto similarity (Max. TS) between the candidates and the MassBank results and finally the relative ranking position (RRP).
Natural Product ChallengesEnvironmental Challenges
Chall.#Cand.RankMax. TSRRPChall.#Cand.RankMax. TSRRP
1010859810.400.096
1222911.01.01114441700.280.883
262540.930.9951237721360.280.964
32945140.990.99513334411.01.0
44219740.840.9831450731.00.996
5428014260.420.66715339411.01.0
66175250.790.99616442713510.330.695
171848880.350.953
Median3582200.890.99525961120.380.959

The results for challenges 1 to 6 and challenges 10 to 17 show that more similar spectra are present in MassBank for the biological compounds than for the environmental challenges. The median Tanimoto similarity between the challenges and the most similar compound in MassBank is 0.89 for the natural compounds, compared to 0.38 for the environmental challenges where the reference spectra did not contribute significantly to the integrated MetFusion score in five cases. This can be attributed to a much larger chemical diversity of natural products in MassBank. This is also evident by the low maximum spectral similarity. The lack of reference spectra for diverse non-biological compounds is the major reason for the mediocre performance of MetFusion in these cases. We expect a considerable improvement in this area as contributions to MassBank from the environmental community have recently increased.

In addition to the ranked list of candidates, MetFusion also creates a ranked similarity matrix, where the columns correspond to the result list from MassBank (best hits on the left, ordered by the MassBank score) and the rows correspond to the MetFrag results. Each cell contains the Tanimoto similarity (TS) of the corresponding structures from MassBank and MetFrag. Examples are shown in Figure 1 and Figure 2. Tanimoto similarities are also visualised through a colour code ranging from red via yellow to green with increasing TS.

Metabolites 03 00623 g001 1024
Figure 1. The top-left part of the reranked similarity matrix from MetFusion for Challenge 6. The correct compound rhoeadine is ranked 25th (CID 5318652) and is highlighted with a green border. The maximum Tanimoto similarity (TS) for rhoeadine has bicuculline with a similarity of 0.79, but a MassBank score of only 0.3 (data not shown). There are other alkaloids with better similarity that are thus ranked higher. Six columns were removed for better readability, altogether with a low maximum TS of 0.4.

Click here to enlarge figure

Figure 1. The top-left part of the reranked similarity matrix from MetFusion for Challenge 6. The correct compound rhoeadine is ranked 25th (CID 5318652) and is highlighted with a green border. The maximum Tanimoto similarity (TS) for rhoeadine has bicuculline with a similarity of 0.79, but a MassBank score of only 0.3 (data not shown). There are other alkaloids with better similarity that are thus ranked higher. Six columns were removed for better readability, altogether with a low maximum TS of 0.4.
Metabolites 03 00623 g001 1024

Overall, MetFusion was able to rank the correct candidate in the top position for the three challenges 1, 13 and 15. Challenges 2 and 14 had the correct compound ranked at position 4 and 3, respectively.

For Challenge 6, using MetFrag alone have a very poor result because 3812 candidates had an identical score of 0.0. MassBank does not contain spectra for the correct compound rhoeadine, and the most similar spectrum returned is palmatine (KOX00837), with a low 0.42 TS to the correct structure (as shown in Figure 1), while the structurally most similar entry (bicuculline, TS = 0.79) in MassBank has a poor spectral score of only 0.3. The main contribution from the MassBank results are three spectra from other alkaloids (allocryptopine, noscapine, and hydrastine) with a similarity between 0.59 and 0.77.

For Challenge 14, shown in Figure 2, MassBank returned a spectrum of carbazole ranked first, an isomer of the correct 1H-Benz[g]indole, followed by three spectra of compounds with both a different molecular formula and lower TS than the MetFrag candidates. During the contest, spectra of the correct 1H-Benz[g]indole measured on the same instrument as the challenge data were submitted to MassBank by one of the MassBank consortium members. The UF011410 hit in MassBank was only ranked fifth, with an unexpectedly low MassBank score of only 0.70, most likely because we used a merged query spectrum and MassBank applies a 5% intensity cut-off. These two factors led to a greater difference between the merged spectrum and the deposited reference spectrum. The available Orbitrap spectra would benefit from a lower cut-off threshold of 2 rather than 5, but we relied on the default cut-off. With this low spectral similarity, the MassBank contribution was unable to lift the correct compound to the first rank, but only to rank 25.

Metabolites 03 00623 g002 1024
Figure 2. Excerpt of reranked similarity matrix from MetFusion for Challenge 14. The correct compound is ranked 3rd (CID 98617) and highlighted with a green border. The two better ranking candidates have slightly higher MetFrag scores that add to their corresponding MetFusion scores. Compound 6854 is carbazole, a structurally highly similar compound towards the correct 1H-Benz[g]indole. The presence of Tanimoto similarities with value of 1.0 indicate perfect structural matches according to corresponding reference spectra available in MassBank for both 1H-Benz[g]indole (UF011410) and carbazole (UF026313).

Click here to enlarge figure

Figure 2. Excerpt of reranked similarity matrix from MetFusion for Challenge 14. The correct compound is ranked 3rd (CID 98617) and highlighted with a green border. The two better ranking candidates have slightly higher MetFrag scores that add to their corresponding MetFusion scores. Compound 6854 is carbazole, a structurally highly similar compound towards the correct 1H-Benz[g]indole. The presence of Tanimoto similarities with value of 1.0 indicate perfect structural matches according to corresponding reference spectra available in MassBank for both 1H-Benz[g]indole (UF011410) and carbazole (UF026313).
Metabolites 03 00623 g002 1024

For challenges 1 to 6 MetFusion performed significantly better than MetFrag, and the median rank of the correct compound was 20, compared to 280 with MetFrag and 145 with MLS. This is even more remarkable because we used the online PubChem query, which returned 3145 candidates (median), whereas the PubChem snapshot only provided 1063 candidates (median) over all challenges.

MetFusion results for challenges 10 to 12 were significantly worse when compared to MetFrag alone. This can be attributed to the low Tanimoto similarity of the correct candidate to any of the spectral hits. For each of these challenges, the MassBank scores are between 0.31 and 0.68 for the top hit, indicating a lack of reference spectra for these compound classes. The missing spectral coverage is expressed in both mediocre spectral scores and almost no Tanimoto similarity, visualised by the red-orange coloured matrix cells with maximum Tanimoto similarity of 0.4. This indicates the case where the spectral library cannot confirm any of the in silico candidates, thus leaving the user with no additional information.

4. Conclusions

The IPB entered the CASMI contest unofficially, because as part of the organising team and challenge data providers we could not be considered independent. However, we entered CASMI as internal participants with MetFrag and MetFusion and did not tune the parameters to obtain optimal results for the initial submission.

The use of small, domain-specific compound databases like KEGG, focussing on natural compounds increases the risk that the correct compound is missed. While such a compound may be more likely to be found in PubChem or ChemSpider, the number of false positives will increase due to the large number of synthetic compounds. We used the metabolite-likeness score [11] as an additional term in the scoring function of MetFrag. The metabolite-likeness score penalizes synthetic compounds and improved the rankings for the natural product challenges 1–6 in all but one case. Moreover, we see potential for further improvement of these preliminary results by optimisation of the weight factor ω and the evaluation on a larger dataset than available in the CASMI contest.

MetFusion was used without additional scoring terms, such as the metabolite-likeness score. The similarity matrices provide a deeper insight into the integrated MetFusion score to (manually) assess the reliability of the MassBank spectral summary.

Both approaches were applied fully automatically to the challenge data, but the selection of the neutral mass for the candidate failed in two cases, and the scoring did not always rank the correct solution in the top positions. Although expert knowledge is still required for a reliable interpretation, our approaches can reduce the manual effort for small compound identification.

We are looking forward to participating in the next CASMI contest as external participants.

Acknowledgements

We thank Julio Peironcely for releasing the program to calculate the metabolite-likeness score as Open Source, and similarly Sebastian Wolf for his previous work on MetFrag. Christoph Ruttkies acknowledges funding from Deutsche Forschungsgesellschaft (DFG) grant NE/1396/5-1.

Conflict of Interest

The authors declare no conflict of interest.

References

  1. Wolf, S.; Schmidt, S.; Müller-Hannemann, M.; Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinforma. 2010, 11, 148. [Google Scholar] [CrossRef]
  2. Schymanski, E.L.; Gallampois, C.M.J.; Krauss, M.; Meringer, M.; Neumann, S.; Schulze, T.; Wolf, S.; Brack, W. Consensus structure elucidation combining GC/EI-MS, structure generation, and calculated properties. Anal. Chem. 2012, 84, 3287–3295. [Google Scholar] [CrossRef]
  3. Bolton, Evan E.; Wang, Y.; Thiessen, Paul A.; Bryant, Stephen H.; Wheeler, Ralph A.; Spellmeyer, David C. Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities. Elsevier 2008, 4, 217–241. [Google Scholar]
  4. Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; et al. MassBank: A public repository for sharing mass spectral data for life sciences. J. Mass. Spectrom. 2010, 45, 703–714. [Google Scholar] [CrossRef]
  5. Gerlich, M.; Neumann, S. MetFusion: Integration of compound identification strategies. J. Mass Spectrom. 2013, 48, 291–298. [Google Scholar] [CrossRef]
  6. Smith, C.; Want, E.; O’Maille, G.; Abagyan, R.; Siuzdak, G. XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching and identification. Anal. Chem. 2006, 78, 779–787. [Google Scholar] [CrossRef]
  7. Kazmi, S.; Ghosh, S.; Shin, D.; Hill, D.; Grant, D. Alignment of high resolution mass spectra: Development of a heuristic approach for metabolomics. Metabolomics 2006, 2, 75–83. [Google Scholar] [CrossRef]
  8. Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. InChI-the worldwide chemical structure identifier standard. J. Cheminf. 2013, 5, 7. [Google Scholar] [CrossRef]
  9. Bolton, E.E.; Wang, Y.; Thiessen, P.A.; Bryant, S.H.; Wheeler, R.A.; Spellmeyer, D.C. Chapter 12 PubChem: Integrated platform of small molecules and biological activities. Ann. Rep. Comput. Chem. 2008, 4, 217–241. [Google Scholar] [CrossRef]
  10. Kind, T.; Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 2007, 8, 105. [Google Scholar] [CrossRef]
  11. Peironcely, J.E.; Reijmers, T.; Coulier, L.; Bender, A.; Hankemeier, T. Understanding and classifying metabolite space and metabolite-likeness. PLoS One 2011, 6, e28966. [Google Scholar]
  12. Liaw, A.; Wiener, M. Classification and regression by randomforest. R News 2002, 2, 18–22. [Google Scholar]
  13. Wishart, D.S.; Tzur, D.; Knox, C.; Eisner, R.; Guo, A.C.; Young, N.; Cheng, D.; Jewell, K.; Arndt, D.; Sawhney, S.; et al. HMDB: The human metabolome database. Nucleic Acids Res. 2007, 35, D521–D526. [Google Scholar] [CrossRef]
  14. Irwin, J.J.; Sterling, T.; Mysinger, M.M.; Bolstad, E.S.; Coleman, R.G. ZINC: A free tool to discover chemistry for biology. J. Chem. Inf. Model 2012, 52, 1757–1768. [Google Scholar] [CrossRef]
  15. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The chemistry development kit (CDK): An open-source java library for chemo-and bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43, 493–500. [Google Scholar] [CrossRef]
  16. Schymanski, E.L.; Neumann, S. CASMI: And the winner is …. Metabolites 2013, 3, 412–439. [Google Scholar] [CrossRef]
  17. Allwood, J.W.; Weber, R.J.; Zhou, J.; He, S.; Viant, M.R.; Dunn, W.B. CASMI–The small molecule identification process from a Birmingham perspective. Metabolites 2013, 3, 397–411. [Google Scholar] [CrossRef]
  18. Gerlich, M.; Neumann, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27 1999, 1, 29–34. [Google Scholar]
  19. Meringer, M.; Schymanski, E.L. Small molecule identification with MOLGEN and mass spectrometry. Metabolites 2013, 3, 440–462. [Google Scholar] [CrossRef]

Appendix

A. Additional MetFrag Results

Table Table A1. MetFrag results with molecular formula filter after resubmission. Shown are the number of candidates per challenge, the InChIKey filtered MetFrag rank and the relative ranking position (RRP). Additionally, for challenges 1-6 the InChIKey filtered MetFrag rank with the metabolite-likeness score (MLS) included is shown.

Click here to display table

Table A1. MetFrag results with molecular formula filter after resubmission. Shown are the number of candidates per challenge, the InChIKey filtered MetFrag rank and the relative ranking position (RRP). Additionally, for challenges 1-6 the InChIKey filtered MetFrag rank with the metabolite-likeness score (MLS) included is shown.
Natural Product ChallengesEnvironmental Challenges
Chall.#Cand.RankRRPMLSRRPChall.#Cand.RankRRP
102571700.377
1950.50040.6251110490.961
24311.00011.00012950260.975
3220.50011.000132240.929
420055340.7354440.77914111190.859
524297540.7149200.6231517891720.905
6125012500.4162340.81416139713970.438
17415150.966
Median6462700.6071190.79733622.50.917
Metabolites 03 00623 g003 1024
Figure A1. Scores plot of challenges 1–6. The MetFrag and metabolite-likeness score (MLS) as well as the final scores of the candidates are shown for the challenges, respectively. The green line marks the position of the correct candidate and the given score. The width of each line correlates with the represented value of the score, respectively.

Click here to enlarge figure

Figure A1. Scores plot of challenges 1–6. The MetFrag and metabolite-likeness score (MLS) as well as the final scores of the candidates are shown for the challenges, respectively. The green line marks the position of the correct candidate and the given score. The width of each line correlates with the represented value of the score, respectively.
Metabolites 03 00623 g003 1024

B. Spectral Merging

Metabolites 03 00623 i002

Metabolites EISSN 2218-1989 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert