De Novo Structure Prediction from Tandem Mass Spectra: Algorithms, Benchmarks, and Limitations
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe review article titled "De Novo Structure Prediction from Tandem Mass Spectra:
Algorithms, Benchmarks, and Limitations" presents a comprehensive and timely review of de novo molecular structure prediction from tandem mass spectrometry (MS/MS), with a particular emphasis on modern machine-learning-based generative approaches. The authors provide a clear historical taxonomy, a technically rigorous discussion of data representations, benchmarking practices, and evaluation metrics, and most importantly a critical analysis of data leakage and evaluation realism, which is currently one of the most pressing issues in this field.
While the research topic is interesting and the authors have provided enough evidences to establish the topic, there are some issues in the current manuscript that require revisions. Detailed comments are provided below.
Comment-1: The manuscript correctly emphasizes Top-k SMILES accuracy as the most stringent metric. However, the discussion of fingerprint-based similarity metrics (Tanimoto, Fraggle) could benefit from a clearer warning about their limitations in generative settings such as high similarity can mask incorrect functional group placement; different fingerprints emphasize different chemistry; and fragment-based metrics may still overestimate practical interpretability.
Recommendation:
Add a short paragraph explicitly cautioning against over-interpreting high Top-k Tanimoto values as “near-correct structures,” especially in metabolomics contexts.
Comment-2: The manuscript acknowledges that stereochemistry is poorly resolved by MS/MS, but this limitation is not consistently reflected in evaluation discussions. Similarly, ion chemistry (adduct formation, rearrangements, in-source fragmentation) is mentioned but not deeply integrated into the critique of current models.
Recommendation:
Clarify whether reported Top-k SMILES accuracy refers to fully specified stereochemistry, or graph-level equivalence ignoring stereo. If stereo is ignored in most benchmarks, this should be stated explicitly to avoid confusion.
Comment-3: The manuscript excels at identifying shortcomings but occasionally underplays what has been learned from earlier generations of models.
For examples, decoupled pipelines are primarily discussed through the lens of error propagation and sequence-to-sequence models are framed mainly as a transitional failure mode.
Recommendation:
Add short summaries at the end of Sections 3.2 and 3.3, highlighting which ideas remain relevant and have been inherited by graph-based models.
Comment-4: Table 1 is useful, but adding a column indicating evaluation benchmark used (GNPS vs MassSpecGym vs NPLIB1) would greatly improve interpretability.
Author Response
Comments 1:
The manuscript correctly emphasizes Top-k SMILES accuracy as the most stringent metric. However, the discussion of fingerprint-based similarity metrics (Tanimoto, Fraggle) could benefit from a clearer warning about their limitations in generative settings such as high similarity can mask incorrect functional group placement; different fingerprints emphasize different chemistry; and fragment-based metrics may still overestimate practical interpretability.
Response 1:
We thank the Reviewer for this observation. We agree that while fingerprint-based and fragment-based metrics are important for quantifying structural resemblance, they can be misleading in generative settings if interpreted without caution. In particular, (i) high similarity can mask errors in functional group placement and regiochemistry; (ii) different fingerprints emphasize different chemical features and can yield qualitatively different similarity judgments for the same pair of structures; and (iii) fragment-based metrics may still overestimate practical interpretability by assigning credit to shared fragments even when chemically critical substructures are incorrect.
Accordingly, we added a dedicated cautionary paragraph at the end of the structural evaluation discussion to explicitly summarize these limitations and to recommend interpreting fingerprint/fragment similarity alongside stricter correctness metrics (e.g., Top-k SMILES accuracy) and chemically grounded error analysis.
Comments 2:
The manuscript acknowledges that stereochemistry is poorly resolved by MS/MS, but this limitation is not consistently reflected in evaluation discussions. Similarly, ion chemistry (adduct formation, rearrangements, in-source fragmentation) is mentioned but not deeply integrated into the critique of current models.
Response 2:
We thank the Reviewer for this observation and agree that evaluation discussion must consistently reflect the physical constraints of MS/MS, especially for stereochemistry and ion chemistry. We made two targeted revisions.
-
Stereochemistry / what Top-k SMILES accuracy actually measures. We updated Section 5.2 to clarify that Top-k SMILES accuracy, as commonly reported, corresponds to graph-level equivalence and that standard 1D MS/MS often lacks the information content needed to reliably distinguish stereoisomers (e.g., chiral centers). We explicitly note that “correct” predictions under these metrics do not imply stereochemical correctness unless additional evidence is available.
-
Ion chemistry integrated into model critique. We expanded Section 5.6 to integrate the effects of adduct formation, in-source fragmentation, and gas-phase rearrangements into the critique of both models and evaluation. We emphasize that standard structural similarity metrics are generally insensitive to mechanistic continuity in ion chemistry, which can lead to misleading penalization when the predicted neutral structure is chemically plausible but the measured fragments arise from expected gas-phase transformations.
To make this limitation concrete, we redesigned Figure 7 to visually demonstrate: (i) metric insensitivity to stereochemistry using an enantiomeric example (L-DOPA vs D-DOPA), and (ii) the discrepancy induced by gas-phase transformations using the cyclization of a linear dipeptide (Ala–Ala) into a cyclic oxazolone-type b2-ion, showing how mechanistically expected rearrangements can cause an apparent collapse in metrics such as Tanimoto similarity, MCES distance, and Fraggle similarity.
Comments 3:
The manuscript excels at identifying shortcomings but occasionally underplays what has been learned from earlier generations of models. For examples, decoupled pipelines are primarily discussed through the lens of error propagation and sequence-to-sequence models are framed mainly as a transitional failure mode.
Response 3:
We thank the Reviewer for this insightful point and agree that our initial framing underemphasized the lasting contributions of earlier approaches. We revised Sections 3.2 and 3.3 to provide a more balanced historical perspective that (i) acknowledges limitations, but (ii) explicitly highlights key ideas inherited by modern graph-native generators.
-
Section 3.2 (decoupled pipelines): We now conclude with a paragraph that first summarizes their core limitations (e.g., error propagation and decoupling from raw spectral features), and then highlights three foundational ideas that remain central today: (i) explicit conditioning on molecular formula; (ii) modular pretraining / factorized representations (e.g., separate spectral encoder and structure generator components); and (iii) leveraging large structure-only databases to compensate for scarce spectrum–structure pairs.
-
Section 3.3 (sequence-to-sequence models): We restructured the conclusion to acknowledge the key limitation (forcing a linear order on non-sequential molecular graphs) while enumerating durable innovations: (i) transformer-based spectral encoding that influenced later architectures; (ii) evidence that oracle/formula conditioning improves accuracy, motivating explicit formula constraints; and (iii) exploration of robust string representations, which underscored the importance of syntactic validity and error modes later addressed by graph-native methods.
Comments 4:
Table 1 is useful, but adding a column indicating evaluation benchmark used (GNPS vs MassSpecGym vs NPLIB1) would greatly improve interpretability.
Response 4:
We thank the Reviewer for this constructive suggestion and agree that explicitly indicating the evaluation benchmark materially improves Table 1’s interpretability and enables fairer cross-study comparison. We therefore added a dedicated column specifying the benchmark used for evaluation (e.g., GNPS, MassSpecGym, NPLIB1) for each model entry.
Additionally, to avoid redundancy and improve consistency, we removed the “number of training spectra” column, since this information is inconsistently reported across studies and does not reliably support comparative interpretation. This change helps focus Table 1 on the most decision-relevant descriptors: architecture, output type, conditioning, and evaluation benchmark.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis review provides a comprehensive and in-depth analysis of the rapid progress in the field of de novo structure prediction from tandem mass spectrometry. The article is well-structured, logically rigorous, and detailed in content, offering an objective analysis of the strengths and weaknesses of existing methods, as well as valuable insights into future research directions. It covers a wide range of topics, from data preprocessing and molecular representation to the classification of different architectural approaches, and discusses datasets, benchmarks, and evaluation metrics in detail. In particular, it highlights the impact of data leakage issues on the performance evaluation of early models and emphasizes the importance of standardized preprocessing and leakage-aware dataset splitting. Finally, it proposes promising research avenues such as multimodal generative modeling and uncertainty calibration, providing valuable guidance for the further development of the field. I suggest acceptance of this timely review in Metabolites.
Comments:
1. I suggested the authors to provide more details on how to fuse data from different modalities into generative models when discussing multimodal methods, for instance, regarding model architectures and training strategies.
- Case Studies: It is recommended to include case studies demonstrating the effectiveness of de novo structure prediction methods in practical applications, such as specific examples in drug discovery or metabolomics.
- Consider add visualizations, such as comparisons of molecular structures generated by different models and distributions of different evaluation metrics, to more intuitively demonstrate model performance.
- The two sentences on the spectra in Figure 1 need to be revised to read as sentences.
- Figures could be made more aesthetically appealing. For instance, in Figure 2, the ionization modes could be visually enhanced. For Figure 5, consider adding algorithm icons. Please try to unify the font styles and sizes across different figures.
In summary, this is a high-quality review article of great significance to the research field of de novo structure prediction from tandem mass spectrometry.
Author Response
Comments 1:
I suggested that the authors provide more details on how to fuse data from different modalities into generative models when discussing multimodal methods, for instance, regarding model architectures and training strategies.
Response 1:
We thank the Reviewer for this insightful suggestion. To address it, we expanded Subsection 6.4 to include a concise technical taxonomy of multimodal fusion architectures and training strategies used in current generative systems. Specifically, we now distinguish two main architectural paradigms: (i) early token-level serialization, where peaks and metadata are converted into tagged sequences (or natural-language-like prompts) processed in a single transformer context; and (ii) modular representation-level fusion, where modality-specific encoders are combined via cross-modal attention or learned fusion layers.
We also added explicit discussion of commonly used training protocols to align heterogeneous signals, including: (i) contrastive learning to map spectra and structures into a shared latent space; (ii) multi-task objectives (e.g., functional group prediction) to emphasize chemically meaningful substructures; and (iii) modality dropout to improve robustness to missing or incomplete experimental modalities.
Comments 2:
Case Studies: It is recommended to include case studies demonstrating the effectiveness of de novo structure prediction methods in practical applications, such as specific examples in drug discovery or metabolomics.
Response 2:
We thank the Reviewer for this valuable suggestion and agree that practical case studies help connect benchmark metrics to real laboratory workflows and clarify realistic use cases. In response, we added a short paragraph explicitly titled “Practical case studies” in Section 6.4, outlining two prototypical application scenarios:
(i) untargeted metabolomics, where de novo generators are integrated into GNPS-style molecular networking workflows and combined with in silico MS/MS prediction and spectral-library search to prioritize candidate structures for unknown features; and
(ii) early drug discovery / ADME and impurity profiling, where de novo models propose plausible metabolites or degradation products consistent with observed MS/MS, which are triaged using pathway-aware metabolome databases and confirmed when needed via orthogonal spectroscopy or synthesis.
These are framed as realistic integration “recipes” based on existing tools and datasets (rather than new experimental claims), consistent with the review scope.
Comments 3:
Consider add visualizations, such as comparisons of molecular structures generated by different models and distributions of different evaluation metrics, to more intuitively demonstrate model performance.
Response 3:
We thank the Reviewer for this constructive suggestion and agree that comparative visualizations can substantially improve interpretability. A full reproduction across all baseline models and datasets (including distributional metric plots) is currently not feasible because (i) several pipelines rely on proprietary/licensed components (e.g., SIRIUS), limiting reproducibility even when weights exist; and (ii) most works do not release raw model outputs needed to reconstruct comparable metric distributions at scale.
To address the intent of the request in a practical and meaningful way, we added a representative visualization (Figure 7) illustrating how typical structural variations and chemically plausible transformations affect multiple evaluation metrics. This figure provides an intuitive, visually grounded understanding of metric behavior under realistic generative error modes and ion-chemistry-induced discrepancies, complementing the expanded evaluation critique in Section 5.6.
Comments 4:
The two sentences on the spectra in Figure 1 need to be revised to read as sentences.
Response 4:
We thank the Reviewer for pointing this out. We revised the Figure 1 captions to ensure they are complete, grammatically correct sentences. The updated captions now read:
-
“Mass spectrum showing supposed molecular ions M1, M2, M3, etc. The spectrum is congested and offers limited structural information.”
-
“MS/MS spectrum of M1 with proposed fragment ions F1, F2, F3, etc., which are useful for structural elucidation.”
Comments 5:
Figures could be made more aesthetically appealing. For instance, in Figure 2, the ionization modes could be visually enhanced. For Figure 5, consider adding algorithm icons. Please try to unify the font styles and sizes across different figures.
Response 5:
We thank the Reviewer for the valuable feedback on figure presentation and agree that improved visual consistency enhances readability. In response, we implemented the following updates:
-
Figure 2: We enhanced the visual depiction of ionization modes by improving visual separation and emphasis to make the modes easier to distinguish at a glance.
-
All figures: We reviewed and unified font styles and sizes across figures to ensure consistent typography throughout the manuscript.
Regarding the suggestion to add algorithm icons in Figure 5, we considered it carefully but opted not to include icons for two reasons: (i) many methods are hybrid and do not map cleanly to a single iconographic category (e.g., diffusion + transformer elements), and (ii) there are no widely standardized visual identifiers for most models, which risks introducing inconsistent or misleading symbolism. We instead prioritized a uniform visual language and clarity of categorization through text and layout.
