3. Mass Spectral Database Search for Fast Annotations
Mass spectral database search is currently the fastest and most accurate way for initial compound annotations. Current public and commercial mass spectral databases contain around 1–2 million spectra of one million unique compounds. Most of these spectra are EI mass spectra for GC-MS, while fewer are available for LC-MS/MS analysis. Traditionally, these databases have been derived from authentic experimental reference compounds and were collected from the literature [
46]. Lately, computationally generated in silico spectra have also gained in importance, as discussed below. The experimentally derived as well as the in silico generated databases are enriched with metadata such as instrument types, collision energies, ionization mode and structural information such as the InChIKey [
47] and SPLASH (spectral hash code) for uniqueness calculations [
48]. Both InChIKey and SPLASH are important as unique identifiers in the structural and spectral domain. Errors during reference library building can be curated using software or manual data correction [
49].
Table 3 lists a selection of commonly used mass spectral databases, see recent reviews for a complete coverage of mass spectral databases [
19,
50].
In terms of coverage, up to 400 metabolites were identified from NIST plasma reference standards utilizing multiple platforms and database matching [
56]. The NIH Common Fund metabolomics ring trial with the participation of multiple US labs annotated around 1000 metabolites using multiple technologies and reference spectra matching. However, literature references for the plasma or serum metabolome covered up to 5000 compounds by combining targeted and non-targeted metabolomics analysis from five platforms [
57]. It is therefore clear that matching experimental reference spectra to experimental reference databases is a severely limited process and covers only a fraction of the detectable metabolome.
Many modern algorithms for peak detection and mass spectral deconvolution have in-built database search algorithms. That includes freely available search algorithms such as the NIST MS Search GUI (graphical user interface), NIST MS PepSearch or MS-DIAL [
58]. Commercial software from mass spectrometry vendors use similar algorithms.
Scoring mass spectra has been traditionally performed by a number of algorithms such as probability match searching, dot-product search and other similarity measures [
19]. Recently, a novel hybrid similarity search method has been introduced that can annotate unknown spectra. The method does not account for the precursor
m/
z and instead utilizes similar neutral losses and fragmentation patterns [
59]. Spectral similarity can also indicate structural similarity and this information can be used for annotation of unknown compounds [
60]. Clustering approaches that use the cosine similarity of product ion spectra by clustering structurally similar compounds can improve the annotation of unknown metabolites [
61]. Despite the advantages of a fast library search, it is becoming clear that mass spectral scoring algorithms have to be improved, especially for product ion spectra that contain only few fragments [
46] or for those libraries that integrate spectra from multiple instrumentation types. Here, approaches that can calculate false discovery rates (FDR) will be useful to improve spectral match and annotation quality [
62,
63].
Community efforts have positively impacted the sharing of mass spectra. The MassBank database (
http://massbank.jp) is one of the most successful examples, with a wide user base and contributors from many different countries [
52]. In a coalition of database servers, the European MassBank efforts (
https://massbank.eu/) [
64] and MassBank of North America (
http://massbank.us/) enable immediate sharing of mass spectra of annotated structures, including autocuration of spectra and chemical structure information (InChI keys). In comparison, the GNPS [
54] spectral database utilizes crowd sourcing approaches to annotate unknown compounds. Commercial libraries such as NIST17 still play an important role because of high levels of manual curation, overall good data quality and wide coverage of substances.
4. In Silico Generation of Mass Spectra and MS/MS Spectra
As described before, scientists today have access to around 100 million known compounds in PubChem and ChemSpider. However, fewer than one million compounds have associated electron ionization (EI) mass spectra (for GC-MS applications), and even fewer LC-MS/MS tandem mass spectra are available. Generating in silico mass spectra, therefore, is a unique opportunity to close this gap. Research into computational generation of mass spectra has gained much traction during the last five years. Four general methods can be distinguished: quantum chemistry, machine learning, heuristic-based methods and chemical reaction-based methods.
Quantum chemistry methods use first-principles and purely physical and chemical information to generate mass spectra. In a major breakthrough for computational mass spectrometry, Grimme described in 2013 how Born–Oppenheimer ab initio molecular dynamics can be used to generate in silico electron ionization mass spectra of any given compound [
65,
66,
67,
68,
69]. An overview of methods for in silico generation of mass spectra, including commercially or freely available algorithms is listed in
Table 4.
Machine learning-based methods such as CFM-ID developed by Allen et al. allow for the computation of CID-MS/MS [
70] and EI-MS spectra [
71] directly from molecular structures. It is a very versatile approach useful for small molecules and peptides up to 1000 Da [
72]. The methodology requires diverse and large training sets which subsequently will improve overall accuracy during training.
Heuristic approaches such as LipidBlast are advantageous for compound classes that have reoccurring and predictive fragmentations such as lipids [
73]. However, the heuristic approach cannot be expanded to molecules with very diverse structural scaffolds. The libraries themselves can be easily extended to include novel or recently discovered lipid classes [
40,
45,
74].
Reaction-based approaches are covered in the Mass Frontier software (HighChem Ltd., Bratislava, Slovakia) and based on thousands of reactions discovered in the literature. Novel molecules can be fragmented based on observed reaction pathways. Only bar code spectra can be generated, hence peak abundances are missing.
The accuracy of in silico generated peaks and their abundances have to be largely improved. A comparison between QCEIMS and CFM-ID has shown that both algorithms perform well enough to get correct identifications for half of the 61 investigated molecules [
75]. However, certain rearrangement reactions, including McLafferty rearrangements, remain underestimated. The highly accurate and fast OM2 and OM3 semiempirical methods [
76] have been further improved by the GFN-xTB Hamiltonian into QCEIMS [
77]. Independent approaches described DFT reaction pathway and transition state modelling to model EI mass spectra [
78] or Monte Carlo sampling to obtain EI mass spectra for select cases [
79].
Currently, there is no fully automatic software for the generation of in silico MS/MS spectra based on LC-MS collision induced dissociation (CID). Several groups have shown interest in this challenging topic and have provided steps that can finally lead to a fully automated stand-alone solution. That includes workflows to automatically find the correct protonation sites in a molecule [
80,
81], ways to utilize rotamers, conformers, Boltzmann averaging and the evaluation of semiempirical and density functional methods (DFT) to calculate fragments.
The validation of generated in silico spectra is probably the most crucial aspect, especially when ‘blindly’ applying software models to large molecule repositories. For example, the original CFM-ID models were trained on a number of small metabolites. Therefore, these initial models are focused on lower molecular weight molecules and may not be feasible for the generation of in silico spectra of high molecular weight lipids or large complex secondary metabolites. In order to obtain high accuracy, the CFM-ID models have to be retrained with adequate lipid and secondary metabolite training sets. As always, external validation with mass spectra that were not available during training is highly recommended. For ab initio models, large validation sets with thousands of compounds have to be generated to obtain confidence scores.
Furthermore, regarding in silico spectra, two major problems will arise in the future. First, calculational processes follow the normal distribution; hence a large number of average accuracy in silico spectra will be observed. The flanks will consist of a small number of inaccurate spectra as well as a small number of high-quality spectra. Here, research needs to focus on ways to improve the average accuracy of in silico spectra predictions, but also to exclude such low-quality in silico spectra. In addition, the community will need to develop improved MS/MS match confidence scores. Otherwise, wrong spectra and publications with false compound annotations lead to many false-positive annotations in databases. The second problem is the generation of millions of very similar in silico spectra, because compound databases host millions of structurally very similar compounds. This will lead to an effect called database poisoning, filling mass spectral databases with compound spectra that cannot be easily distinguished by database search alone. Here, research has to focus on orthogonal filtering methods such as ion mobility or retention time filters.
5. In Silico Fragmentation Software
In silico fragmentation approaches for the annotation of unknown molecules are used in those cases where no reference mass spectra are available for database matching [
82]. These generally involve matching experimental spectra against a selection of in silico generated fragments calculated on candidates retrieved from known compound databases (see
Figure 2). Instead of searching mass spectral databases which cover only one million compounds, in silico fragmentation algorithms have access to molecular structure databases including ChemSpider and PubChem covering almost 100 million compounds [
83].
These in silico fragmentation approaches aim to identify “known unknowns”—i.e., compounds present in molecular structure databases but without any reference spectra—by calculating a score between the experimental spectra and the predicted spectra (or predicted fragments). The major disadvantage is that “unknown–unknown” compounds cannot be elucidated in such a way. Below, we discuss some of the tools that have participated in structure elucidation challenges and can be used for batch annotations of unknown compounds (see
Table 5). Additional software including iMet [
84], MAGMa [
85], MIDAS [
86] and Midas-G [
87] are discussed elsewhere. Most of the approaches below have been discussed in much greater technical detail in a series of excellent reviews [
88,
89,
90,
91].
MetFrag [
92] is a combinatorial fragmenter that retrieves candidate structures from PubChem, ChemSpider, KEGG, and a few other more specific compound databases. Candidates are fragmented using a bond dissociation approach and are finally matched to experimentally obtained spectra. MetFrag and MetFusion [
93] have been actively developed and improved, allowing local or web-based use. The LipidFrag tool was developed later to increase confidence in lipid annotations [
94].
MS-FINDER [
84] is a Windows based GUI software aiding the structure elucidation process by in silico fragmentation of all predicted molecular formulas, determined from the accurate mass, isotope ratio, and product ion information [
95], which are retrieved from 15 databases that are embedded into MS-FINDER [
96,
97]. The structures are then ranked by variety of factors including nine hydrogen rearrangement rules as the most contributing factor to the final score calculations.
CSI:FingerID [
98] is a freely available web-service and uses a two-step scheme: first, a kernel-based approach is utilized to predict molecular fingerprints [
99] from its MS/MS spectrum and then the predicted molecular fingerprints are matched against a molecular compound database. Included is a module that combines computation and comparison of fragmentation trees for the prediction of molecular properties of the unknowns as well as the molecular formula generation. Novel algorithms such as IOKR (input output kernel regression) [
100] are now integrated into the workflow. The stand-alone SIRIUS GUI software [
101] is used to calculate fragmentation trees and, subsequently, molecular formulas [
102]. SIRIUS is now directly coupled to the CSI:FingerID online server that matches fingerprints against a database and retrieves ranked structure candidates.
CFM-ID (competitive fragmentation modeling) is a suite of software tools that can perform spectra prediction and compound identification. It is based on a machine-learning approach including chemical rules andva is available for ESI MS/MS data as well as EI mass spectra. CFM-ID can be used as a web server or can be called locally through command line utilities on Windows, Linux and MacOS. For larger datasets, the software can be deployed to clusters to reduce the computational times.
ChemDistiller [
103] is a Python-based tool that uses structural fingerprints and fragmentation patterns together with a machine learning algorithm to annotate unknown compounds. It utilizes multiple target databases covering more than 130 million compounds to annotate unknowns and the output is presented in a web interface for further inspection. It is a very fast and highly parallelized tool that makes use of modern multi-core CPUs.
Mass Frontier [
82], developed by HighChem, is based on observed experimental gas-phase fragmentation reactions. It contains basic fragmentation rules as well as an exhaustive library of over 100,000 known fragmentation rules collected from published data which also allows for fragmentation predictions and annotation of unknowns [
104]. The software supports electron ionization (EI) and collision induced dissociation (CID) ESI MS/MS modes. Mass Frontier can search internal databases or the mzCloud database and is commercially available.
To improve the annotation rates, database type restrictions such as environmental, plant, metabolic pathway databases can be applied. Taxonomy restrictions are also useful when researching specific organisms. Generally, in silico fragmentation algorithms still need to improve tremendously. A comparison of four algorithms using the CASMI test compounds as input has shown that pure in silico algorithms could only identify 17–25% of the compounds correctly [
105]. Boosting the output by adding MS/MS search and bio-database focused lookups as well as combining the outputs of multiple software tools led to much higher identification rates of up to 93% accuracy [
106]. Combining multiple in silico fragmentation software with a-priori information is a valuable option when facing a structure elucidation challenge [
106].
6. Retention Time Prediction
Retention times are important as orthogonal filters during the structural determination in metabolic profiling experiments. A number of MS/MS and retention time databases have been developed for metabolic profiling [
55]. However, these tools usually contain only a few hundred experimentally obtained retention time values. It is therefore useful to predict theoretical retention times utilizing the millions of existing compounds in compound databases by quantitative structure-retention relationship (QSRR) modelling [
107]. This field of research has been active for more than 30 years. Traditionally, group-contribution methods were used for GC-MS modelling by assigning small retention index increments to specific substructures [
108]. However, a vast amount of different separation columns and an infinite combination of solvent buffer systems and chromatographic conditions exist in LC-MS, locking the predicted models to very specific conditions [
109].
Another major reason why there is no universal retention prediction method for LC-MS/MS is the lack of large and diverse training sets. A minimum of a thousand compounds covering all major chemical scaffolds in hydrophilic interaction liquid chromatography (HILIC) or reversed-phase chromatography (RP) are required to generate a robust retention prediction model useful for metabolic profiling.
An additional important consideration for retention time models is the applicability domain or structural space used in model building [
110]. In short, if a natural product training set is used, it should be used for the prediction of natural product predictions and not for drugs. A simple measure would be to perform a principal component analysis on the substructure feature space for training samples and new predictive compounds and to confirm that the space overlaps. However, a recent approach utilized 1955 synthetic screening compounds that cover a similar scaffold space as small metabolites and used artificial neural networks to predict LC-MS retention indices for 202 endogenous metabolites [
111]. This approach is particularly interesting because plated screening compounds are commonly less expensive than endogenous metabolites. By massively increasing the structural scaffold space, the retention model can become more robust, even if these molecules will never be annotated in biological samples. Many retention time prediction models are usually locked to a specific LC column and a solvent and buffer system, unless a “retention projection” method can be applied to transfer data to other chromatographic systems [
112,
113,
114].
Retention times can be predicted by using chemical descriptors as input parameters which can be computed directly from structures by tools such as Dragon [
115], MOLD2 [
116] or PaDel [
117]. Dragon 7 now calculates 5270 molecular descriptors, covering fragment counts, topological and geometrical descriptors. Low-energy three dimensional conformer structures can be generated by a number of tools [
118] and even better with quantum chemical methods [
119]. Subsequently, regression models can be built using the descriptor data as input and the retention time as a target function. Over 200 machine learning models, preferably with deep neural networks [
120] or fast random forest methods [
121], are now available. To improve accuracy and prediction power, complex gradient boosting methods (XGBoost/LightGBM) and ensemble methods such as bagging, stacking and averaging are now routinely employed [
122]. In the past, a wide variety of retention prediction models have been proposed for HILIC and reversed phase columns based on different machine learning approaches. These included partial least square methods [
123,
124,
125], multiple linear regression [
126,
127,
128], support vector regression [
129,
130], random forests [
131] and artificial neural networks [
132,
133,
134].
In summary, the success of the retention time modelling depends on the size and the diversity of the compound training data set. Currently, most RT models are locked to specific columns and conditions, unless a retention projection method is used. For useful retention time prediction models, the only remedies are large and diverse training sets covering multiple compound classes to obtain reliable, highly predictive and accurate models.
7. Ion Mobility and the Use of Collision Cross Section (CCS) Values
LC-MS/MS alone will often be unable to discriminate between stereoisomers and regioisomers, unless chiral columns are utilized. It is therefore useful to couple ion mobility analyzers to LC-MS/MS to allow for a higher number of features to be separated and detected [
135]. Ion mobility is a technique that separates ions in an inert buffer gas (nitrogen, hydrogen) under the influence of an electric field [
136,
137]. Several types of ion mobility analyzers are available, among them drift tube ion mobility (DTIMS), traveling wave ion mobility spectrometry (TWIMS) and FAIMS [
138].
For DTIMS and TWIMS, the observed drift times are influenced by relative molecule size and conformational parameters. For DTIMS, cross-section values (CCS) can be directly measured and computed [
139,
140], and for TWIMS the CSS values can be obtained from calibrations with known standards [
141]. The FAIMS technology has limited peak capacity [
142,
143], but can be used as an orthogonal filter to separate different classes of compounds and to improve signal/noise ratios during measurements [
144]. For FAIMS, no collision cross-section values (CCS) can be determined [
138].
The experimental CCS values have a very high reproducibility and CCS values with relative standard deviation (RSD) of <1–2% can be routinely obtained [
139,
145,
146]. This opens up the LC-IMS-MS/MS technology for orthogonal filtering approaches utilizing CSS values [
147] (see
Figure 3) and more importantly for predictive technologies utilizing CCS values in a similar to retention time predictions. Such predictive approaches can include computational and quantum chemical models [
148,
149] as well as machine learning predictions [
150] such as artificial neural networks [
132,
151]. Prediction errors as low as 3% have been reported for CCS models [
152]. Once these models are applied to structures from large metabolomic databases, they can be used for filtering during the compound identification process [
138,
153], and such predicted values are covered in publicly available databases such as MetCCS [
152] or LipidCCS [
154,
155]. Currently, an estimated total of 3000–4000 experimental small molecule CCS values have been reported in a recent review [
150] with the largest single collection containing CCS values for 1420 compounds [
145]. Focused collections for sterols [
156], metabolites and xenobiotics are also available [
139,
157].
Several considerations have to be taken into account when working with CCS values and predictive databases. CCS values of individual compounds depend on many additional parameters such as buffer gas, solvents, temperature, pH, ion activation voltage and conformer/rotamer ensembles [
158,
159]. For example, different ion species such as [M + H]
+ and [M + Na]
+ have different CCS values, differing on average ±7 Å
2 based on values obtained from [
139]. This is related to conformational changes and subsequently leads to the conclusion that different adducts have to be modelled and predicted separately. Furthermore, different protonation sites or protomers can lead to different CCS values [
145]. Drugs such as benzocaine can have N- or O-protonated forms leading to different CCS values for the same compound [
160]. The different protomers can be determined with the help of quantum chemical methods [
161,
162] and cheminformatics methods that calculate different protonation sites. Reference standards themselves may not be enantiomerically pure and therefore can lead to measurement of multiple experimental CCS values. Furthermore, while CCS values predicted on the same instrument type have low RSD measurement errors <1% [
163], the experimental CCS values may differ between different instrumental setups (DTIMS/TWIMS), as well as prediction models. The drug Indomethacin for the proton adduct has a reported CCS value of 183.54 Å
2 measured on a drift tube IMS (DTIMS) [
139]; the same compound has a CCS value of 179.039 Å
2 measured on a TWIMS setup, and the predicted value is 197.7 Å
2 and therefore falls outside the 3% median prediction error [
164].
Because of the IMS capability of separating stereoisomers and other isobaric compounds, the routine use of CCS values will become more and more important. The excellent experimental reproducibilities of CCS measurements compared to retention times will also improve identification rates. Once larger CCS datasets become publicly available, they can be combined, average consensus values can be calculated and CCS prediction methods can be retrained with larger compound numbers and therefore will automatically become more accurate. Technological advances such as printed circuit board (PCB)-based devices led to ion elevators and escalators in multilevel structures [
165]. Therefore, such structures for lossless ion manipulations (SLIM) have demonstrated unprecedented ultra-high resolution ion mobility [
166].
8. Compound Identification: Hybrid and Orthogonal Approaches
The following section discusses some general compound identification workflows as well as a few selected cases of single compound identification examples via mass spectrometry. Workflows are important for highly reproducible and repeatable metabolomics analysis. Among those are Galaxy workflows [
161] such as Workflow4metabolomics.org, as well as Taverna and KNIME workflows, but with a considerably lower user base [
167]. A conceptual compound ID workflow has been described that includes in silico metabolic synthesis, in silico fragmentation [
168] and finally annotation of compounds via database scoring [
169]. The same paper discusses the importance of meta-integration of multiple tools and multiple layers of information to improve confidence in compound identification. Another related review discusses the importance of inclusion of MS
1 peak relationships such as adducts and neutral losses, the inclusion of MS/MS data and biochemical knowledge as well as modelling of retention times as an orthogonal filter. A knowledge-based workflow for metabolite annotations that includes ionization rules, adduct formation rules and retention time rules was described in [
170].
However, even in-source fragmentation LC-MS mass spectra when used together with retention times of authentic compounds can be sufficient for ‘Level 1’ annotations in metabolomics [
171]. A pipeline that uses multicriteria scoring, including retention times, intensity profiles and adduct patterns was developed for high-resolution mass spectral data [
172]. The extraction of common occurring substructures from MS/MS data can help during higher level annotations [
173]. Another workflow included the use of multiple identification criteria such as accurate mass, retention time, MS/MS spectrum, and product/precursor ion intensity ratios to support reversed phase and HILIC based metabolic profiling [
174]. Two in silico fragmenters and two retention prediction models were utilized to annotate hydrophobic compounds [
175]. A tool for improved and automated adduct detection was discussed in [
176], leading to 83% correct annotations of adduct ions. The dereplication of natural products with the help of a fragment database was described in [
177]. Pitfalls, limitations and general recommendation during data processing and compound identifications were discussed in [
24,
178,
179].
Full structure elucidation of single novel compounds with chromatography and mass spectrometric analysis is possible but is harder than with the isolation of compounds and NMR analysis. A clear benefit of LC-MS/MS approaches is the limited amount of material needed, in comparison to LC-MS/MS-NMR methods. A recent report annotated N
1-acetylisoputreanine and N
1-acetylisoputreanine-gamma-lactam by metabolic profiling and used custom synthesis to confirm the commercially unavailable metabolite [
180]. Another approach used multiple-stage tandem mass spectrometry (MS
4) and custom synthesis to identify and confirm
N,
N,
N-trimethyl-l-alanyl-l-proline betaine in human plasma. Novel glycolipids were found in yeast annotated by combining multiple mass spectrometric platforms and chiral chromatography to ascertain stereoisomer configuration [
181]. Another approach showed the combined use of high-resolution MS/MS data and use of the metabolic in-silico network expansion database (MINE) for the discovery of novel methylated epi-metabolites including N-methyl-UMP [
45]. Natural products can be manually annotated with high success rates [
182], but such approaches require deep mass spectral knowledge. In the future, such manual approaches must be translated into practical expert-algorithms and software that allows non-experts to perform such complicated analysis to a certain degree [
27]. Finally, all pipelines and workflows must be validated by independent and external benchmark sets such as the CASMI competitions discussed below.
Abbreviations and Glossary
MSn | Multiple stage mass spectrometry |
CASMI | Critical Assessment of Small Molecule Identification |
CCS | Collisional cross-section |
CFM-ID | Competitive Fragmentation Modeling for Metabolite Identification |
FAHFAs | Fatty Acid ester of Hydroxyl Fatty Acids |
Fragmentation tree | Mass spectral fragmentation pathway of a compound |
GNPS | Global Natural Products Social molecular networking |
HMDB | Human Metabolome Database |
IM | Ion mobility |
InChIKey | Hash key or short unique structure code |
LipidBlast | In silico generated database for lipid identification |
MassBank | Mass spectral database |
MetaboBASE | Mass spectral library developed by Bruker |
MoNA | MassBank of North America |
NIST | National Institute of Standards and Technology |
NMR | Nuclear Magnetic Resonance |
ReSpect | RIKEN MSn spectral database for phytochemicals |
SPLASH | Hashed code or unique identifier for mass spectra |