2.1. Description of Methods Applied
A research team from The University of Birmingham competed in the CASMI open challenge, specifically categories 1 and 2 related to liquid chromatography-mass spectrometry. All challenges were performed with the exception of challenges 11, 12 and 16; these challenges were assessed, though, because of complexity in the data, specifically, in-source fragmentation, so it was decided not to submit responses. No results were submitted for categories 3 and 4 related to gas chromatography-mass spectrometry.
Workflows previously developed by one of the authors (W.D.) and colleagues were applied to compete in category 1. Workflows 1 and 2 of the PUTMEDID-LCMS workflow series [
13] were applied to annotate different metabolite features as the [M+H]
+ or [M−H]
− ions or as isotopic peaks (for example
13C and
34S), applying retention time (RT), correlation coefficient analysis,
m/z differences and median peak areas. The molecular mass of the uncharged metabolite was calculated from these data and matched to a large reference file containing accurate molecular masses and their associated molecular formula (13,061 in total, derived from PubChem and containing the elements C, H, N, O, P, S, Br, Cl, F and Si). A mass tolerance range of 5 ppm was applied, unless stated otherwise. Where more than one molecular formula was reported, the relative isotopic abundances (RIA) for carbon and sulfur were calculated using response data and accurate mass differences to filter the number of molecular formulae.
The authors have an interest in performing annotation of metabolites not present in mass spectral libraries. We applied MetFrag [
15] to construct
in silico fragmentation patterns and compare these data to experimental MS/MS data, because MetFrag software is freely available. Here, the molecular formula or formulae reported in category 1 of the same challenge were inputted on a single and manual basis in to the on-line MetFrag software, followed by searching for the molecular formula in the KEGG and/or ChemSpider databases and reporting of all molecular structures with the defined molecular formula. In the second stage,
in silico fragmentation of each putative molecular structure was performed applying MetFrag and matched to the experimental MS/MS data provided. The match scores provided by MetFrag were applied to report putative molecular structures after manual assessment by the authors to ensure that the match scores reflected the different structures reported.
2.2. Results
The processes followed to construct the results submitted to the CASMI open contest, for each challenge in categories 1 and 2, are described below. We describe the data provided for challenge 1 to inform the readers of the typical data available.
2.2.1. Challenge 1
For challenge 1, four data files were available; (i) the MS1 raw data (mzXML and netCDF formats), (ii) the MS1 peak list (txt format), (iii) MS2 raw data acquired at three different collision energies (mzXML format) and (iv) the MS2 peak lists for each collision energy at which MS/MS data were acquired/applied. Information on the instrument applied to acquire the data, the mass resolution and expected mass accuracy and retention time were also provided to assist the contestants. Similar data and further information were available for all other challenges. The three m/z values defined in the challenge as being detected in positive ion mode were analysed, applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as defined in the experimental summary). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C18H36N4O11, which was present in the trimMMD_sortAmass.txt file applied in workflow 2. The correct molecular formula was submitted. This molecular formula and the fragmentation mass spectrum acquired at 30eV were submitted to MetFrag applying KEGG as the chosen database. MS/MS data were provided at three different collision energies, and the data acquired at 30eV was chosen, as these MS/MS data provided the greatest number of product ions to allow structural information to be deduced most accurately. Two metabolites were reported, Kanamycin A and C, with in silico fragmentation data matching to 10 experimentally derived product ions for the former and eight for the latter metabolite. As a greater number of product ions were matched for Kanamycin A, this metabolite was submitted as the molecular structure. The correct molecular structure was submitted.
2.2.2. Challenge 2
The four m/z values defined in the challenge as being detected in negative ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as reported for the same instrument in challenge 1). The results showed that the [M-H]− ion was detected and reported no matches to a molecular formula present in the trimMMD_sortAmass.txt file applied in workflow 2. The search was repeated with a mass accuracy of 10 ppm, but no matches were reported. The m/z of the uncharged metabolite (592.1969 Da) was manually calculated and submitted to MetFrag with a mass accuracy of 5 ppm, where applying the KEGG database provided no hits and where applying the ChemSpider database provided 193 hits related to twenty-nine possible molecular formula. When all molecular formula containing F, Cl, Si or Br were removed (as it was not expected that the correct metabolite would contain these elements), 12 molecular formula remained. The data showed no evidence for the presence of sulfur in the molecular formula (as defined by relative isotopic abundance), and eight molecular formula containing sulfur were removed to leave four molecular formulae. Applying the relative isotopic abundance for carbon showed that 29 carbons were present in the molecular formula and one molecular formula was removed (C21H32N6O14). Three molecular formula remained; C32H32O11, C33H28N4O7 and C38H28N2O5. The correct molecular formula was not submitted, as the experimentally derived mass error (>30 ppm) was greater than the mass error reported with the data and expected for the mass spectrometer applied. The CASMI organisers have now provided data following recalibration; this provides an accurate result as defined by them. Submitting the fragmentation mass spectrum acquired at 20eV (MS/MS data at one collision energy of 20eV were provided) to MetFrag and applying ChemSpider as the chosen database reported six metabolites, and these were submitted to the contest with the MetFrag reported scores. The correct molecular structure was not submitted, because the correct molecular formula was not applied. One important point was observed in this challenge; although the mass accuracy of a specific mass spectrometer can be reported as a specific ppm range (+/− × ppm), this may not always be true for a subset of metabolites (for example, with a low response or where ion statistics do not allow an accurate determination of peak shape and apex).
2.2.3. Challenge 3
The five m/z values defined in the challenge as being detected in negative ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as reported for the same instrument in challenge 1). The results showed that the [M-H]− ion was detected and reported no matches to a molecular formula present in the trimMMD_sortAmass.txt file applied in workflow 2. The process was repeated with a mass accuracy of 10 ppm, and one molecular formula was reported, C13H19N7O7S2. On assessing the sulfur relative isotopic abundance, it was calculated that three sulfur atoms were present in the molecular formula and, therefore, that this molecular formula may be incorrect. The mass (or molecular weight) of the uncharged metabolite (449.0826 Da) was manually calculated, assuming a [M−H]− ion was detected (448.0754 + 1.0077 − 0.00055) and submitted to MetFrag with a mass accuracy of 5 ppm, where applying the KEGG database provide one molecular formula, C14H27N1O9S3. This molecular formula matched to the experimental relative isotopic abundance for sulfur and was submitted. The correct molecular formula was submitted. Submitting the fragmentation mass spectrum acquired at 20eV (chosen from data acquired at four collision energies, as these MS/MS data provided the greatest number of product ions to allow structural information to be deduced most accurately) to MetFrag, applying KEGG as the chosen database and performing in silico fragmentation, reported a single metabolite, glucolesquerellin (6-methylthiohexyl glucosinolate), with three product ions being matched to in silico-derived fragmentation ions. This single metabolite was submitted to the contest. The InChI submitted to the contest did not match the correct InChI provided by the organisers. However, the InChI submitted to the contest almost matched the correct structure, differing only in the structural configuration of the hexose substructure.
2.2.4. Challenge 4
The three m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as defined in the Experimental Summary). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C16H21NO4S. However, the relative isotopic abundance observed in the data showed no evidence of a sulfur-containing molecular formula. The experimental information provided defined that mass accuracy “should be below 5 ppm”, though did not guarantee this mass accuracy in the view of the authors. Therefore, the workflows were operated with a mass accuracy of 10 ppm and produced a second molecular formula (C19H17NO4), which was present in the trimMMD_sortAmass.txt file applied in workflow 2. This molecular formula was submitted. The correct molecular formula was submitted. The fragmentation mass spectrum acquired at 30eV was submitted to MetFrag, applying KEGG as the chosen database. MS/MS data at three collision energies were provided, data acquired at 30eV was chosen, as they provided the greatest number of product ions to most accurately define the structure of the metabolite. Two metabolites were reported, rutacridone epoxide and stylopine, with in silico fragmentation data matching to 20 experimentally-derived product ions for the former and 10 for the latter metabolite. Both of these metabolites were submitted to the challenge with scores of 1.0 and 0.5, respectively. The correct molecular structure was not submitted.
2.2.5. Challenge 5
The four m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 10 ppm (the experimental notes defined a mass accuracy of 5 ppm, though the results from challenge 4 showed a mass accuracy of 10 ppm was appropriate). The results showed that the [M+H]+ ion was detected and reported two molecular formula of C19H23NO4 and C16H27NO4S, which were present in the trimMMD_sortAmass.txt file applied in workflow 2. However, the relative isotopic abundance data showed no evidence of a sulfur-containing molecular formula; so, C16H27NO4S was removed, and C19H23NO4 was submitted to the contest. The correct molecular formula was submitted. This molecular formula and the fragmentation mass spectrum acquired at 10 eV were submitted to MetFrag, applying KEGG as the chosen database. MS/MS data were acquired at two collision energies; the data acquired at 20eV appeared to be inaccurate, as the highest m/z reported was greater than the molecular weight of the metabolite, and therefore, the data acquired at 10 eV data was applied. Five metabolites were reported; four metabolites matched two experimentally-derived product ions (of a possible 16) to in silico-derived product ions, and the one metabolite reported one product ion match. The latter metabolite was removed, because of the lower number of matches, and the four metabolites were submitted to the contest. As confidence in these four metabolites was not high, because only two of 16 product ions were matched, all were reported with the same score, as no discrimination in confidence could be obtained. The correct molecular structure was submitted.
2.2.6. Challenge 6
The four m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 10 ppm (the experimental notes defined a mass accuracy of 5 ppm, though the results from challenge 4 showed a mass accuracy of 10 ppm was appropriate). The results showed that the [M+H]+ ion was detected and reported two molecular formula of C21H21NO6 and C14H25NO11, which were present in the trimMMD_sortAmass.txt file applied in workflow 2. An error by our team was not to assess the carbon relative isotopic abundance, as had been performed in other challenges and which would have removed the C14H25NO11 option. Instead, C21H21NO6 and C14H25NO11 were submitted to the contest with scores of 0.5 and 1.0. The correct molecular formula was submitted as the second ranked possible molecular formula. These molecular formulae and the fragmentation mass spectrum acquired at 20 eV were submitted to MetFrag, applying KEGG as the chosen database. MS/MS data for three collision energies were available; data acquired at 20 eV data was chosen, as this included a m/z peak representing the molecular ion and which the authors prefer to observe in MS/MS data. Seven metabolites were reported; four metabolites were reported with a molecular formula of C21H21NO6, and three metabolites were reported with a molecular formula of C14H25NO11. The three latter metabolites matched a greater number of experimentally-derived product ions to in silico-derived product ions. These three metabolites were submitted to the contest with the MetFrag calculated scores. The correct molecular structure was not submitted.
2.2.7. Challenge 10
The three m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as would be expected with a hybrid LTQ-Orbitrap mass spectrometer). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C14H9NO2, which was present in the trimMMD_sortAmass.txt file applied in workflow 2. The correct molecular formula was submitted. This molecular formula and the fragmentation mass spectrum (MS/MS data were only acquired at one collision energy of 10eV data) were submitted to MetFrag, applying KEGG as the chosen database. Three metabolites were reported; one metabolite matched in silico fragmentation data to two experimentally-derived product ions, whereas two metabolites matched one product ion. As the number of matches was low and none were conclusive, all three metabolites were submitted to the contest. The correct molecular structure was not submitted.
2.2.8. Challenge 13
The three m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as would be expected with a hybrid Orbitrap mass spectrometer). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C9H16N4O7, which was present in the trimMMD_sortAmass.txt file applied in workflow 2. The correct molecular formula was not submitted. Further research after the results were released shows that the correct molecular formula (C19H17OP) is not present in the reference file applied in workflow 2. The molecular formula C9H16N4O7 and the fragmentation mass spectrum collected applying collision-induced dissociation (CID) at a normalized collision energy (NCE) of 45% were submitted to MetFrag, applying KEGG as the chosen database. Four MS/MS datasets were available, CID at 45 and 75% and higher-energy C-trap dissociation (HCD) at 45 and 75%. CID at 45% was chosen, as it provided as many product ions as the other data provided, though HCD at 45% provided the same number of product ions. No matches were reported, and the process was repeated applying ChemSpider. Three metabolites were reported; one metabolite matched in silico fragmentation data to five experimentally-derived product ions, whereas two metabolites matched two product ions. The former metabolite (N-hydroxy-6-(hydroxyamino)-5,6-dihydrocytidine) was submitted to the contest, as this showed a significantly better score in MetFrag. The correct molecular structure was not submitted, because the correct molecular formula was not applied.
2.2.9. Challenge 14
The two m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as would be expected with a hybrid Orbitrap mass spectrometer). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C12H9N, which was present in the trimMMD_sortAmass.txt file applied in workflow 2. The correct molecular formula was submitted. To the team, this appeared to be related to a chemical rather than a metabolite, and therefore, ChemSpider, and not KEGG, was applied in MetFrag. This molecular formula and the fragmentation mass spectrum collected applying HCD at 180V were submitted to MetFrag applying ChemSpider as the chosen database. MS/MS data were provided at two different collision energies, and the data acquired at 120 V was chosen, as these MS/MS data provided the greatest number of product ions to allow structural information to be deduced most accurately. Sixty-five chemicals were reported; many of these were defined as chemically unusual, as they contained C-N triple covalent bonds or C-C triple covalent bonds or three fused benzene rings or two fused C-C double bonds. These chemicals were removed to leave 23 possible molecular structures. The 23 molecular structures were submitted to the contest with MetFrag scores. The correct molecular structure was submitted and was ranked as 12th in possible molecular structures.
2.2.10. Challenge 15
The two m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as would be expected with a hybrid Orbitrap mass spectrometer). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C12H13NO2, which was present in the trimMMD_sortAmass.txt file applied in workflow 2. The correct molecular formula was submitted. This molecular formula and the fragmentation mass spectrum collected, applying HCD at a 120V, were submitted to MetFrag, applying KEGG as the chosen database. MS/MS data were acquired at two different collision energies; both provided the same number of product ions, and the data acquired at 120V was chosen. Three metabolites were reported; one metabolite matched in silico fragmentation data to 10 experimentally derived product ions, whereas the other two metabolites matched one and no product ions. The former metabolite (indole-3-butyric acid) was submitted to the contest, as this showed a significantly higher score in MetFrag. The correct molecular structure was not submitted. The submitted and correct structure had the same sub-structure (indole), the additional substructures were different for the correct and submitted structures.
2.2.11. Challenge 17
The three m/z values defined in the challenge as being detected in positive ion mode were analysed applying workflows 1 and 2 of the PUTMEDID_LCMS collection of workflows with a mass accuracy of 5 ppm (as would be expected with a hybrid Orbitrap mass spectrometer). The results showed that the [M+H]+ ion was detected and reported a single molecular formula of C13H13N3, which was present in the trimMMD_sortAmass.txt file applied in workflow 2. The correct molecular formula was submitted. This molecular formula and the fragmentation mass spectrum collected applying HCD 90V (CID and HCD data were provided; HCD data provided more product ions) were submitted to MetFrag, applying KEGG as the chosen database. MS/MS data were provided applying two different fragmentation techniques (CID and HCD); the data acquired applying HCD was chosen, as these MS/MS data provided the greatest number of product ions to allow structural information to be deduced most accurately. Three metabolites were reported; only one metabolite matched in silico fragmentation data to experimentally-derived product ions, whereas the other two metabolites matched no product ions. The former metabolite (3-amino-1,4-dimethyl-5H-pyrido[4,3-b]indole) was submitted to the contest. The correct molecular structure was not submitted. The correct structure was similar to the submitted structure, as both contained two sub-structures that were identical (benzene and aniline), though the difference between both structures was the chemical sub-structure connecting these two sub-structures.
2.3. Discussion
Applying two separate workflows to putatively annotate metabolites was an enjoyable process and tested the authors’ knowledge of chemistry, metabolites and metabolite annotation applying automated workflows and manual interpretation. The results presented here were acquired, applying one workflow for each challenge. However, different workflows were also assessed, but were not submitted to the CASMI contest, because of the opportunity to submit only one result for each challenge. Other workflows investigated included MI-Pack [
14] and Mass Frontier [
27], both showed good results, but will not be discussed further here. The authors were ranked as first in the contest for category 1; they submitted results to 11 challenges, of which their highest probability match was correct in eight challenges, their second highest probability match was correct in one challenge and their submission was not correct in two challenges. Of these two challenges providing incorrect submissions, the mass error of the metabolite was higher than reported in the contest information for challenge two (>30 ppm compared to an expected mass accuracy of 5 ppm). This highlights an important point that the mass accuracy of any mass spectrometer does not always meet the specifications provided by instrument companies, caused by either analyst error (including mass calibration errors) or inadequate ion populations to provide accurate determination of the ion peak shape and apex. For challenge 13, the molecular formula of the correct metabolite was not present in the trimMMD_sortAmass.txt file applied in workflow 2 of PUTMEDID_LCMS. The authors did not submit entries for three challenges (11, 12 and 16), as in-source fragmentation was present, and it is known that PUTMEDID_LCMS does not report accurate molecular formula for metabolites undergoing uncommon in-source fragmentation (though it operates well for loss of H
2O, HCO
2H and NH
3). The results submitted to category 1 have shown that the PUTMEDID_LCMS operates very well in defining the molecular formula; in only two of thirteen submissions were the results not correct, one due to a limitation of the reported data and one due to a limitation of a reference file applied in PUTMEDID_LCMS.
The authors’ accuracy in defining chemical structures in category 2 was significantly lower than for category 1; they submitted results to 11 challenges, of which their highest probability match was correct in one challenge (challenge 1), their submission was not correct in eight challenges and, in two challenges, the correct structure was ranked by the authors as fourth (challenge 5) and 12th (challenge 14). All challenges were performed, with the belief that all chemicals were endogenous or exogenous metabolites, and this logic was applied in the processes employed to define molecular structure. In the eight challenges where the submission was not correct, the authors applied only a metabolite-specific database (KEGG) or in challenges where there were no matches to KEGG, ChemSpider was applied, but results were filtered to remove chemicals (by a single author, W.B.D.) not believed to be derived from endogenous and exogenous metabolism. This logic is applied in all metabolomics studies by the authors, though the contest does not state anywhere that chemicals are endogenous or exogenous metabolites, and so, applying this logic was not appropriate. Here, the application of chemical rather than metabolite-specific libraries when integrated with the application of MetFrag would be expected to provide greater accuracy in the annotation of metabolites, though this has not been experimentally assessed by the authors.
This observation highlights an important aspect of the annotation process. The search space for chemicals is very large; PubChem contains more than 31 million entries [
19]. The metabolite search is a sub-component of the chemical search space and is smaller than the chemical search space, though, depending on the biological sample, it can be comprised of thousands of unique metabolite structures. For example, yeast, plant and human metabolic reconstructions contain only hundreds or thousands of metabolites [
29,
30,
31], whereas some other databases contain over 40,000 metabolites (e.g., HMDB [
21]). Some metabolites are not specific to a single organism or biological sample, whereas other metabolites can be specific to a single organism or biological sample. Some databases are specific to chemicals and are large (for example, PubChem [
19]), whereas some databases are metabolite-specific (for example, KEGG [
20,
32]). To provide accuracy in the annotation process, applying information on the organism and environment will always be beneficial. For example, when performing a metabolite search in human biofluids, you would include drugs and their metabolites in the search space, whereas in plants and microbes, you would not, unless they were specifically added to the environment. Following on from the previous discussion, the choice of database or databases to apply is important. Again, this should be organism-specific if the organism or biological sample is known, so as to reduce the search space and number of returned hits, and a greater number of organism-specific databases are being constructed. However, when information is limited and the complexity of samples is high, then chemical rather than metabolite databases should be applied, reducing the specificity of the search by increasing the number of possible matches. This was the case for the CASMI challenge, as no information was provided on the origin of the biological sample, or for challenges based on single authentic chemical standards; no information was provided on the biological sample type where the chemical is expected to be observed. This logic can also be applied for mass spectral library searches, specifically the decision of whether to apply metabolite-specific (for example, METLIN [
25] HMDB [
21] and MassBank [
26]) or chemical-specific mass spectral libraries (for example, the NIST12 MS/MS database [
33]).
The authors choose to apply
in silico fragmentation to aid in their putative annotation process, as they are interested in this process for the annotation of metabolites not present in mass spectral libraries and the appropriateness of applying this process. MetFrag was chosen to perform
in silico fragmentation, as it was freely available to the academic research community. The authors also applied Mass Frontier, a commercial software package available from HighChem, though did not submit any results from these data, as only single submissions were available for each challenge. A second process, which the authors apply, is to submit experimentally-derived MS/MS data to freely available mass spectral MS/MS libraries, including METLIN [
25], MassBank [
26] and HMDB [
21]. However, because of the interests of the authors, we decided to submit the
in silico-derived data from MetFrag only for this contest.
The authors would like to emphasise that all data presented here for unknowns (but not challenges based on chemical standards) are provided as putative annotations (level 2, according to the MSI [
34]). To provide level 1 identifications, authentic chemical standards would need to be purchased and data acquired applying the same analytical methods.