Review Reports - Exploring the Bottleneck in Cryo-EM Dynamic Disorder Feature and Advanced Hybrid Prediction Model

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This submission is an original research article in which the author reports new methodological work (the CLTC hybrid model) alongside a large‐scale meta-analysis of cryo-EM single-particle datasets. The author compiles and categorizes every cryo-EM SPA entry in the PDB from 2000–2024 into “Legacy,” “Span,” and “Recent” sets, use missing-coordinate patterns to annotate structural disorder, benchmark existing predictors (AlphaFold2 pLDDT, flDPnn, IUPred), and then develop a multi-scale CNN–LSTM–Transformer–CRF model (CLTC) that outperforms these tools in identifying hard- and soft-missing regions. The work provides a curated database of cryo-EM disorder labels and a new hybrid architecture for improved disorder prediction in highly dynamic protein regions. I find this manuscript offers valuable insights into Cryo-EM structural absences and presents a framework for disorder prediction. I recommend it for publication pending minor revisions. Please find my comments below.

Introduction and Methods

Could the author clarify the rationale behind the 2000–2022 and 2022–2024 temporal bins/classification and specify whether particular Cryo-EM advances (e.g., the introduction of direct electron detectors) informed these boundaries?
It would be helpful to specify the exact criteria used to distinguish “soft missing” from “hard missing” residues. Were any sequence gaps or alternative conformations considered, and how were ambiguous cases handled?
When calibrating pLDDT, flDPnn, and IUPred scores with logistic regression on the Legacy set, what probability threshold was selected to define the disorder, and how robust is this threshold across the Span, Recent, and Recent_LH datasets?
The CLTC hybrid model integrates CNN, BiLSTM, Transformer, and CRF layers. Could the authors summarize some experiments performed to justify the multi-stage architecture versus simpler baselines?
In reporting precision, recall, and F₁, would the inclusion of per-class confusion matrices or ROC curves for Span, Recent, and Recent_LH improve interpretability and allow direct comparison to existing disorder predictors?

Results and Discussion

Could the author clarify how the entity/alignment ratio was calculated and whether any proteins with unusually high replication skew the soft‐missing rate reported for the Span group?
In Figure 1D, the resolution bins (<2.5 Å, 2.5–3.5 Å, etc.) appear arbitrary. Would reporting exact resolution thresholds tied to hardware milestones (e.g., pre- and post-direct detector adoption) improve clarity?
The consistent ~20 % hard-missing rate across datasets is intriguing. Might the author comment on whether certain protein classes (e.g., membrane vs. soluble) deviate from this average?
In the PFAM annotation analysis (Fig. 2A), what statistical tests were applied to assess significance between groups, and could a heatmap of p-values help readers understand which families truly shift over time?
For the disorder score distributions in Fig. 3A, would adding ROC curves or AUC values for hard- vs. modeled-residue separation strengthen the claim that pLDDT outperforms other metrics?
The author highlights “actionable strategies for temporal parameter optimization” in the Discussion. Could a brief example (e.g., adjusting vitrification temperature or particle count) describe how such guidelines might be implemented in practice?

Author Response

Comments 1:

Could the author clarify the rationale behind the 2000–2022 and 2022–2024 temporal bins/classification and specify whether particular Cryo-EM advances (e.g., the introduction of direct electron detectors) informed these boundaries?

Response 1:

Thank you for raising this important question regarding the temporal classification in this study. The rationale for selecting 2022 as a key boundary stems from several considerations.

First, 2022 marks the culmination of a significant period of technological maturation in cryo-EM. Between 2012 and 2020, critical advancements, including the introduction of direct electron detectors (2014), the development of software packages such as RELION (2012) and CryoSPARC (2017), and the 2017 Nobel Prize recognition, drove substantial research interest, improved data collection, and enhanced resolution and accessibility. These advances are reflected in PDB deposition trends that cryo-EM structures were relatively uncommon before 2014 but grew rapidly thereafter and continuing to expand through 2024.

Second, the 2022 boundary enables balanced comparisons between "Legacy" (pre-2022) and "Recent" (post-2022) datasets, with a nearly equivalent ratio (7:6) that minimizes bias from unequal sample sizes.

Furthermore, 2022 coincides with the emergence of AlphaFold, which introduced capabilities distinct from earlier hardware and software advancements. This transition prompted an examination of whether cryo-EM could continue solving structures beyond the reach of prediction methods, even with modern technical refinements.

Regarding specific cryo-EM advances, MANOVA analysis of PFAM domain residue ratio shifts revealed no significant global differences across the three time periods. However, it identified more cases with significant ratio shifts in the Legacy vs. Span pair comparison, followed by Span vs. Recent, and fewer in Legacy vs. Recent.

In summary, this research aims to identify persistent challenges in cryo-EM structure determination following major technological advancements and explore potential solutions. Additional context is provided in the Introduction, and further clarification is gladly offered if needed.

Thank you again for your time and feedback.

Comments 2:

It would be helpful to specify the exact criteria used to distinguish “soft missing” from “hard missing” residues. Were any sequence gaps or alternative conformations considered, and how were ambiguous cases handled?

Response 2:

Thank you for this question.

The criteria distinguishing "soft missing" from "hard missing" residues are specified in the Methods section, with attached reference and figure to illustrate more clearly.

Regarding sequence gaps and alternative conformations, the following approach was used: Entries with identical sequences were aligned to a reference sequence (first by UniProt ID, then via mmseqs alignment). Any residue position in the final aligned sequence containing gaps or alternative conformations was marked as "undefined"—a category distinct from modeled, soft-missing, or hard-missing residues—and excluded during model input processing.

We hope this addresses the handling of ambiguous cases. Please let us know if further clarification is needed.

Comments 3:

When calibrating pLDDT, flDPnn, and IUPred scores with logistic regression on the Legacy set, what probability threshold was selected to define the disorder, and how robust is this threshold across the Span, Recent, and Recent_LH datasets?

Response 3:

Sorry for any confusion caused previously. Here is a detailed explanation of our logistic regression approach:

We used one-vs-rest (OvR) logistic regression for classification, and this is how the study’s logistic regression works like:

At first, to clarify our classification approach, we used one-vs-rest (OvR) logistic regression. In this setup, each class has a decision boundary calculated as -intercept/coefficient, and the predicted class for a given value is the one with the highest score, where the score for each class is computed as (score = intercept + coefficient × feature value). We didn’t fix probability threshold at first place.

For example of pLDDT, the boundaries are as follows: "Modeled (T1)" has a boundary of 92.9, "Soft-missing (T2)" has a boundary of 112.2 (which makes it invalid since pLDDT ranges from 0 to 100), and "Hard-missing (T3)" has a boundary of 64.6.

To illustrate with examples, a residue with pLDDT value of 62 falls below 64.6, so the score for T3 is the highest, leading to classification as T3. For a value of 75, which is between 64.6 and 92.9, T3 still has the highest score, so it’s classified as T3. For a residue pLDDT value of 96, which is above 92.9, T1 has the highest score, resulting in classification as T1.

Since T2's boundary exceeds the pLDDT range, only two classes are functionally used. This rule was applied uniformly across all datasets (Legacy, Span, Recent, Recent_LH).

We tested alternative thresholds (automatically or manually), but observed no performance improvement, so we retained this approach. The manuscript has been updated to clarify these details. Thank you for your valuable feedback.

Comments 4:

The CLTC hybrid model integrates CNN, BiLSTM, Transformer, and CRF layers. Could the authors summarize some experiments performed to justify the multi-stage architecture versus simpler baselines?

Response:

This question has also been raised by another reviewer, and attention has been paid to presenting more data with detailed explanations. To address this, the manuscript now includes a summary of ablation tests (new Table 1) that illustrates the step-by-step construction of the model architecture—from single layers (LSTM or Transformer) to two, three, and finally four layers (CLTC). Differences between these architectural variants are further discussed to enhance clarity to summarize different combinations of layers.

Comments 5:

In reporting precision, recall, and F₁, would the inclusion of per-class confusion matrices or ROC curves for Span, Recent, and Recent_LH improve interpretability and allow direct comparison to existing disorder predictors?

Response 5:

Thank you for this valuable suggestion regarding interpretability. To address the question:

First, a supplementary report file has been uploaded containing complete confusion matrices and performance metrics, including TP, TN, FP, FN, Precision, Recall, F₁, and AUC, for all primary models (LR_pLDDT and others, CLTC, CLTC_pLDDT) evaluated across Span, Recent, and Recent_LH datasets.

Next, the manuscript now incorporates Table 1 and revised Figure 3 in the main text, which summarizes key performance indicators (Precision, Recall, F₁, AUC) for these datasets. This addition enables direct benchmarking against existing disorder predictors.

After these revisions, the evaluation framework provides both more details and concise comparative insights. We greatly appreciate your recommendation, which strengthens the methodological rigor of this work.

Comments 6:

Could the author clarify how the entity/alignment ratio was calculated and whether any proteins with unusually high replication skew the soft‐missing rate reported for the Span group?

Response 6:

Thank you for this insightful query regarding methodology.

Entity/alignment ratio calculation:

The entity/alignment ratio was determined as follows: First, protein sequences were grouped by their reference alignment sequence (representing a structural clan). Next, each reference sequence was classified temporally: Legacy (all associated PDB structures pre-2022), Recent (all post-2022), or Span (structures spanning both periods). Finally, the ratio was computed by dividing the total number of entities within a group by the number of reference alignment sequences.

Potential replication bias in Span group:

Regarding high-replication concerns, analysis revealed three clans with elevated sequence representation: CL0192 (GPCR_A, 32 sequences), CL0186 (Beta_propeller, 34 sequences), and CL0021 (OB domains, 28 sequences). While these functionally important proteins show higher representation, no statistically significant skew was detected in the ratio shifting across all datasets, although pair-wise do have something to say. The observed distribution likely reflects sustained research interest in structurally challenging functional proteins throughout cryo-EM's development.

We acknowledge that continuous study of certain protein families may influence longitudinal analyses. Additional methodological details appear in related method section of the manuscript, and further validation would be welcome if specific concerns arise. We appreciate your rigorous scrutiny of this methodological aspect.

Comments 7:

In Figure 1D, the resolution bins (<2.5 Å, 2.5–3.5 Å, etc.) appear arbitrary. Would reporting exact resolution thresholds tied to hardware milestones (e.g., pre- and post-direct detector adoption) improve clarity?

Response 7:

Thank you for this thoughtful suggestion regarding Figure 1D. We acknowledge that the <2.5 Å resolution threshold may initially appear arbitrary, but it was chosen to provide readers with an overview of resolution trends most recently. While hardware milestones like the resolution revolution (2013-2014) undoubtedly transformed cryo-EM, we found comparisons across pre/post-2014 periods challenging due to fundamental data imbalances: very few cryo-EM structures exist before 2014, while post-2014 releases show exponential annual growth that continues to this day. We also note that attributing improvements to single milestone is difficult, as resolution advances might result from multiple interconnected factors.

Our primary goal was to, hopefully, illustrate progressive resolution improvements in the Recent dataset (2022-2024) relative to earlier structures, and explore current bottlenecks and how to avoid/predict. As you rightly highlighted the milestone, we hypothesized these advances stem from synergistic developments, including direct detector adoption, energy filters enhancing SNR, GPU-accelerated reconstruction in RELION/cryoSPARC, and refined sample preparation techniques.

We also considered tracking resolution evolution within specific protein families in earlier stage, but found selecting representative families challenging due to structural heterogeneity across targets. While comprehensive analysis over time in more detailed and statistical analysis would offer more valuable insights, this approach lies beyond our current study scope and resources, and we agree it represents important future work for deeper classification analysis.

Given these considerations, we opted for broader resolution categories to maintain visual clarity in depicting temporal progress. We would be grateful for your guidance on how we might refine this approach to better serve readers.

Comments 8:

The consistent ~20 % hard-missing rate across datasets is intriguing. Might the author comment on whether certain protein classes (e.g., membrane vs. soluble) deviate from this average?

Response 8:

Thank you for this insightful question. To address potential deviations in protein classes like membrane versus soluble proteins, we performed a more detailed MANOVA analysis. While we observed no clans exhibiting significant differences across all three groups (Legacy, Span, and Recent), we did identify numerous clans showing significant divergence in pairwise comparisons, particularly between Legacy-Span, Span-Recent, and Legacy-Recent. These specific findings, now included in the manuscript, enhance our understanding of functional and domain-level shifts underlying the consistent hard-missing rate

Comments 9:

In the PFAM annotation analysis (Fig. 2A), what statistical tests were applied to assess significance between groups, and could a heatmap of p-values help readers understand which families truly shift over time?

Response 9:

Thank you for this excellent suggestion. We agree that visualizing significance enhances interpretability. For the PFAM analysis in Fig. 2 and new supplementary xlsx, we applied MANOVA to compare shifting ratios (modeled/soft-missing/hard-missing) across the three groups (Legacy/Span/Recent). Following the recommendation, we've added a heatmap of p-values to the figure to highlight families with significant temporal shifts. Additional details about these statistical comparisons are now included in the manuscript.

Comments 10:

For the disorder score distributions in Fig. 3A, would adding ROC curves or AUC values for hard- vs. modeled-residue separation strengthen the claim that pLDDT outperforms other metrics?

Response 10:

Yes, it has incorporated AUC values into Figure 3 to quantitatively compare the separation performance between hard- and modeled residues. Additional metrics, including precision, recall, and a per-class confusion matrix, are provided in the supplementary report for comprehensive model evaluation. Thank you for so many insightful suggestions.

Comments 11:

The author highlights “actionable strategies for temporal parameter optimization” in the Discussion. Could a brief example (e.g., adjusting vitrification temperature or particle count) describe how such guidelines might be implemented in practice?

Response 11:

Thank you for this excellent suggestion. To illustrate actionable strategies for temporal parameter optimization, here's a practical implementation example we considered:

First, when detecting soft-missing regions (moderately disordered residues with experimental evidence of potential structure), optimization approaches could include increasing particle counts during data collection to improve averaged signal-to-noise ratio and more detailed-3D classification to split unique structures from others, or adjusting sample preparation, like actionally adding ligands to help preserve transient conformational states. Additionally, modifying buffer conditions to stabilize binding interfaces is also effective. These parameter adjustments align with evidence that soft-missing regions frequently represent dynamic regions amenable to stabilization through biochemical optimization.

Conversely, for hard-missing regions (persistently disordered across experimental conditions), optimization strategies shift fundamentally. Rather than parameter adjustments, solutions typically involve protein engineering approaches such as designing stabilizing mutations, or construct redesign efforts. These represent more resource-intensive interventions compared to temporal parameter tuning.

We've revised the manuscript accordingly to include this explicit implementation example and better distinguish between these optimization tiers. Thank you for helping us enhance the practical guidance in our Discussion.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript presents an analysis of cryo-EM SPA data to characterize dynamic disorder and proposes a hybrid deep learning model (CLTC) for predicting structurally unresolvable regions. The work is timely and relevant to structural biology and computational modeling. Good work in overall, but requires further clarification and details:

The manuscript uses the terms soft missing (residues sometimes missing) and hard missing (residues always missing). Can you provide examples or data showing that soft missing regions overlap with known disordered areas, such as those in DisProt? It would help if you could explain whether these labels were compared to other independent disorder annotations or predictors besides the logistic regression you used.

The study mainly measures prediction performance using cryo-EM missing regions as the “ground truth.” You briefly mention a CAID3 test but only give one F1 score. Could you share more detailed results?

The CLTC model uses CNN, BiLSTM, Transformer, and CRF layers together. Can you explain why you chose this combination instead of simpler setups like only a Transformer or only a BiLSTM?
Did you do any tests where you removed parts of the model (ablation studies) to see how each piece contributes? It would be helpful to include a table that compares these different setups so readers can see why this design is best.

You mention that your Recent_LH dataset has less than 20% sequence identity compared to training. What steps did you take to make sure there were no hidden similarities, like homologous domains, in the AlphaFold pLDDT inputs or training data? Is there a chance that such overlap could have made your model look better than it really is on low-homology tests?

Author Response

Comments 1:

The manuscript uses the terms soft missing (residues sometimes missing) and hard missing (residues always missing). Can you provide examples or data showing that soft missing regions overlap with known disordered areas, such as those in DisProt? It would help if you could explain whether these labels were compared to other independent disorder annotations or predictors besides the logistic regression you used.

Response 1:

Thank you for the valuable suggestion. To address your query about soft missing regions and independent disorder validation, we have enhanced the revised manuscript with a targeted comparison using DisProt data. Specifically, we analyzed cases such as BamC (DisProt ID: DP01339) to demonstrate how residues identified as "soft missing", those occasionally resolved across cryo-EM maps, overlap with regions experimentally validated as both ordered and disordered via NMR and X-ray crystallography. This external benchmarking extends beyond our initial logistic regression model, providing independent validation against gold-standard disorder annotations.

We believe this addition strengthens the study in two key ways: First, it reinforces the biological significance of soft missing regions as indicators of intrinsic disorder. Second, it underscores our focus on addressing conformational heterogeneity as a critical bottleneck in cryo-EM single-particle analysis, while highlighting the need for hybrid approaches that integrate disorder predictors with cryo-EM data to improve model completeness in dynamic regions. This analysis also compares the similarities and differences between missing residues and protein disorder, which we hope will respond to your thoughtful feedback.

Comments 2:

The study mainly measures prediction performance using cryo-EM missing regions as the “ground truth.” You briefly mention a CAID3 test but only give one F1 score. Could you share more detailed results?

Response 2:

Thank you for raising this important point. We agree that more detailed performance comparisons are necessary for a comprehensive evaluation. In the previous manuscript, we briefly reported the F1 score for the CAID3 benchmark because our CLTC model was still under development during the CAID3 submission deadline (April 2024), and we faced challenges in optimizing the LSTM/Transformer architecture at that time. Since we were unable to submit the model officially, we conducted a post-hoc evaluation using the CAID3 "Disorder" dataset (based on DisProt annotations, comprising 232 sequences with 20,917 positive residues [18%] and 96,136 negative residues [82%]) after its public release.

To ensure full transparency, we have expanded the Results section in the revised manuscript to show more details of CAID3 benchmarks. And more performance reported includes not only the F1 score but also additional key metrics such as precision and recall on datasets. Furthermore, we have uploaded another supplementary data, which contains the CLTC model’s prediction results on the CAID3 Disorder dataset along with comprehensive performance metrics. The GitHub link is also updated to provide more information on results.

Thank you again for this critical suggestion, which has helped strengthen the rigor of our model evaluation.

Comments 3:

The CLTC model uses CNN, BiLSTM, Transformer, and CRF layers together. Can you explain why you chose this combination instead of simpler setups like only a Transformer or only a BiLSTM?

Response 3:

This research initially evaluated simpler architectures, including standalone Transformer or BiLSTM layers, as well as various combinations of LSTM and Transformer layers. However, these configurations underperformed—even compared to baseline logistic regression models using pLDDT features. Further investigation of prior studies suggested that hybrid architectures could improve performance, motivating the integration of CNN, BiLSTM, Transformer, and CRF layers in the final CLTC model. Ablation studies confirming these comparative analyses have been included in the revised manuscript (Result 3.4, Table-1).

Thank you for raising this important question.

Comments 4:

Did you do any tests where you removed parts of the model (ablation studies) to see how each piece contributes? It would be helpful to include a table that compares these different setups so readers can see why this design is best.

Response 4:

As also noted in the response to Question 3, ablation studies were conducted to evaluate the contribution of each component in the CLTC architecture. These experiments compared the performance of the full model against variants with individual layers (e.g., BiLSTM-only, Transformer-only) and partial combinations (e.g., CNN+BiLSTM, CNN + BiLSTM+CRF). The results, summarized in Table 1 (added in the revised manuscript), demonstrate that the integrated design (combining CNN, BiLSTM, Transformer, and CRF layers) consistently outperforms simplified configurations. This systematic analysis justifies the current architecture and highlights the complementary roles of each layer.

Thank you for the suggestion to clarify this point.

Comments 5:

You mention that your Recent_LH dataset has less than 20% sequence identity compared to training. What steps did you take to make sure there were no hidden similarities, like homologous domains, in the AlphaFold pLDDT inputs or training data? Is there a chance that such overlap could have made your model look better than it really is on low-homology tests?

Response 5:

Thank you for raising this important point. To ensure minimal homology between the training data (the Legacy set, containing sequences before 2022) and the test set (Recent_LH), we applied strict filters using a 20% sequence identity threshold. The Recent_LH set was filtered to retain only sequences with <20% identity to both the Legacy set and the Span subset of Recent.

As you rightly noted, even at this low homology level, domain-level homology persisted. PFAM domain detection revealed some residual homologous domains. To ensure fair evaluation, all three models (LR_pLDDT, CLTC, and CLTC_pLDDT) were tested from high to low-homology datasets (Span, Recent, Recent_LH). The results showed comparable F1 scores for LR_pLDDT and CLTC (0.63), while CLTC_pLDDT achieved a modest improvement (0.68). We attribute this gain to CLTC_pLDDT’s hybrid architecture, which better captures sequence features than simpler logistic regression (LR_pLDDT). For full transparency, we’ve also updated more detailed validation results (including precision, recall, F1 scores, and ablation metrics) for open review.