Previous Article in Journal
Organs-on-Chips: Revolutionizing Biomedical Research
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring the Bottleneck in Cryo-EM Dynamic Disorder Feature and Advanced Hybrid Prediction Model

by
Sen Zheng
Bio-Electron Microscopy Facility, iHuman Institution, ShanghaiTech University, Shanghai 201210, China
Biophysica 2025, 5(3), 39; https://doi.org/10.3390/biophysica5030039
Submission received: 16 June 2025 / Revised: 11 July 2025 / Accepted: 14 July 2025 / Published: 29 August 2025

Abstract

Cryo-electron microscopy single-particle analysis (cryo-EM SPA) has advanced three-dimensional protein structure determination, yet resolving intrinsically disordered proteins and regions (IDPs/IDRs) remains challenging due to conformational heterogeneity. This research evaluates cryo-EM’s capacity to map dynamic regions, assesses the adaptability of disorder prediction tools, and explores optimization strategies for dynamic structure prediction. Cryo-EM SPA datasets from 2000 to 2024 were categorized into different periods, forming a database integrating sequence data and disorder indices. Established prediction tools—AlphaFold2 (pLDDT), flDPnn, and IUPred—were evaluated for transferability, while a multi-level CLTC hybrid model (combining CNN, LSTM, Transformer, and CRF architectures) was developed to link local conformational fluctuations with global sequence contexts. Analyses revealed consistent advancements in average resolution and model counts over the past decade, although mapping disordered regions remained technically demanding. Both the adapted AlphaFold pLDDT and the CLTC model demonstrated efficacy in predicting structurally variable and poorly resolved regions. A subset of the cryo-EM missing residues exhibited intermediate conformational features, suggesting classification ambiguities potentially influenced by experimental conditions. These findings systematically outline the evolving capabilities of cryo-EM in resolving dynamic regions, benchmark the adaptability of computational tools, and introduce a hybrid model to enhance prediction accuracy. This study provides a framework for addressing conformational heterogeneity, contributing to methodological advancements in structural biology.

1. Introduction

Understanding three-dimensional protein structures is fundamental to elucidating biological function. The “structure–function” paradigm has driven significant progress in structural biology, with cryo-electron microscopy (Cryo-EM) emerging as a pivotal tool for analyzing macromolecular dynamics. Cryo-EM preserves samples in near-native states through rapid vitrification, and advancements in electron detectors and computational three-dimensional reconstruction [1,2,3] have enabled high-resolution imaging [4]. Among Cryo-EM methodologies, single-particle analysis (SPA) stands out for its ability to resolve macromolecules under near-physiological conditions, thereby revealing dynamic structural changes essential for protein function [5].
Meanwhile, protein structures deposited in the Protein Data Bank (PDB) frequently exhibit regions lacking electron density. These gaps correlate with flexible, disordered segments characterized by elevated B-factors in X-ray crystallography [6,7]. Such observations highlight the intrinsic structural dynamics of proteins, which exhibit a blend of rigid, well-defined domains and flexible, disordered regions. This dynamic interplay enables proteins to adopt diverse conformational states, emphasizing the delicate equilibrium between order and disorder in protein structures under varied conditions [8,9], termed intrinsically disordered proteins (IDPs) or regions (IDPRs). These flexible segments facilitate dynamic interactions with binding partners, enabling diversified functions [10,11]. In Cryo-EM SPA such conformational heterogeneity introduces signal averaging artifacts, resulting in blurred density maps and missing atomic coordinates in reconstructed models [12,13]. These limitations mirror challenges observed in X-ray crystallography, highlighting a persistent gap between capturing biomolecular dynamics and generating static structural representations [14].
To address these gaps in experimental structural data, computational prediction methods have become integral to predict and evaluate protein disorder. Early representative algorithms, such as IUPred, predicted disordered regions by evaluating the energetics of amino acid interactions [15], with later iterations refining its performance [16,17]. Concurrently, the establishment of curated disorder databases like DisProt [18] facilitated the development of advanced prediction tools, such as flDPnn, which leverages machine learning to predict disordered regions and their functional motifs with enhanced specificity [19]. A significant paradigm shift occurred in 2020–2021 with the advent of AlphaFold2, which transformed structural prediction with its high-level accuracy [20,21]. Subsequent studies revealed that the predicted local distance difference test (pLDDT) from AlphaFold2 serves as an indicator of disordered regions, demonstrating unique capabilities in disorder prediction [22,23]. By 2021, the field had expanded significantly with over 100 disorder prediction tools developed, offering various insights into protein disorder [24].
However, the current state-of-the-art frameworks for evaluating disorder prediction algorithms were initially benchmarked using experimental annotations derived from conventional structural biology methods [25]. These foundational approaches primarily incorporated disorder labels from X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and far-UV CD spectroscopy, supplemented by curated databases like DisProt [18]. At the time, Cryo-EM-related structural data were relatively scarce. Meanwhile, recent breakthroughs in Cryo-EM SPA methodologies, such as optimized cryo-freezing protocols [26,27], enhanced reconstruction algorithms [28,29], and improved hardware capabilities [30], have dramatically increased both the quantity and structural diversity of SPA-derived data. This is also reflected in the growing number of disorder annotations derived from Cryo-EM evidence [31].
The widening gap between conventional disorder prediction benchmarks and modern experimental techniques highlights the necessity for advanced computational strategies that integrate experimental data while developing novel methods optimized for cryo-EM SPA. To address this gap, previous foundational work established key principles for predicting and evaluating dynamic disordered and unstructured regions [32]. Building upon this foundation, this study proposes the following three-phase framework specifically designed to analyze and model these challenging regions: (1) Systematic evaluation of SPA methodologies: Analyzing SPA’s capacity to manage dynamic disordered regions, which generate unstructured experimental data. (2) Adaptation of disorder prediction tools: Optimizing established tools (AlphaFold2 pLDDT, flDPnn, and IUPred) to align with SPA-derived unstructured classifications. (3) Integrative model development: Designing architectures to predict structural absences and synergize with disorder predictions.
Datasets were constructed using Cryo-EM SPA entries from the PDB (2000–2024) grouped to reflect technological advancements in resolving structural absences and distinct features. Existing disorder prediction tools were evaluated for their adaptability to experimental SPA classifications. To address limitations in dynamic and cross-field predictions, the CLTC hybrid model (CNN–LSTM–Transformer–CRF) was developed. Its multi-scale architecture captures local conformational fluctuations and global sequence-structure relationships, demonstrating consistent performance across low-homology targets. Integration of CLTC outputs with disorder features further improved prediction accuracy for unstructured regions in SPA data.
This research establishes a bidirectional feedback loop between computational predictions and experimental refinement. It provides curated SPA datasets with structural absence labels, optimizations on current tools with configurations, and the CLTC hybrid model as modular resources for diverse research applications.

2. Methods

2.1. Data Preparation from Cryo-EM Single-Particle Analysis

To evaluate incomplete structural regions, this research adopted “missing coordinates” in Cryo-EM structures as unstructured labels [18,25]. Firstly, Cryo-EM SPA structures and corresponding non-redundant protein sequences were retrieved from the PDB up to December 2024. The sequences of entities were aligned to their reference sequences using a previously reported method [32,33]. Based on deposition dates, the PDB entries were categorized into the following three chronological groups:
  • Legacy: This category includes all aligned entries deposited between 2000 and 2022, representing historical structural data (Training dataset).
  • Span: This category comprises aligned entries that span the entire period from 2000 to 2024, capturing datasets that have been continuously updated over time (Validation dataset 01).
  • Recent: This category consists of aligned entries deposited between 2022 and 2024, reflecting the latest advancements in SPA (Validation dataset 02).
For each group, the entity-to-alignment ratio was calculated by dividing the total number of entities by the number of successfully aligned sequences. By aligning entities with their unique reference sequences, missing atomic coordinates in these structures were quantified, providing SPA experimental evidence for residue-level classification. The 2022 boundary was selected to ensure balanced dataset sizes for comparative analysis and to account for the emergence of AlphaFold [20,21,34], a development that significantly influenced structural biology by integrating computational and experimental approaches.
To isolate low-homology samples within the Recent group, sequence alignment was performed using MMseqs2 [35] against the Legacy and Span datasets. Entries with ≤20% sequence identity and an e-value ≤ 0.01 [36] were retained, resulting in the Recent_LH group (Validation dataset 03). This subset enabled the final analysis of structural novelty independent of evolutionary homology for validation. This classification ensures a systematic evaluation of structural data across different levels of homology and timeframe, facilitating a comprehensive analysis of trends and developments in the field.

2.2. Structural Absence Classification, Functional Annotation, and Statical Analysis

Residue-level absence labels were assigned based on the presence of α-carbon (Cα) coordinates across Cryo-EM structures of the reference sequence. The classification into three categories adheres to rules from previous studies [32,33], as used in similar alignment methods (see Figure 1A and Figure 2C for examples):
(1)
Modeled: Residues with Cα coordinates consistently present in all corresponding PDB entities at the same residue location.
(2)
Soft Missing: Residues with Cα coordinates present in some, but not all, corresponding PDB entities.
(3)
Hard Missing: Residues entirely lacking Cα coordinates in all corresponding PDB entities.
Non-standard amino acids, sequence mismatches, gaps, or ambiguous residues were classified as ‘undefined’ and excluded from the analysis. Proportions of each classification within temporal groups (Legacy, Span, and Recent) and related details are quantified and provided in Table S1. For functional annotation, protein domains were identified using the Pfam database via InterPro [37]. Homologous domains belonging to the same Pfam clan were clustered across Legacy, Span, and Recent groups to assess conservation patterns in structurally absent regions.
From the clustering results, 78 clans were selected based on the criterion of containing at least three distinct source proteins per temporal group (Legacy, Span, and Recent). Residues in these clans were therefore distinguished by their structural states (modeled, soft missing, or hard missing). The residue ratios were initially analyzed among all temporal groups using a MANOVA at clan level. Subsequently, pairwise comparisons of residue ratio shifts were also performed between Legacy–Span, Span–Recent, and Legacy–Recent groups for clans.

2.3. Disorder Feature Calculation and Calibration with Structure Absence

Disorder propensity scores for each dataset were derived using the following three distinct methods:
(1)
pLDDT: Values were retrieved from downloaded AlphaFold2 predicted structures from the AlphaFoldDB (AlphaFold version 2.0, https://alphafold.ebi.ac.uk/).
(2)
flDPnn: Values were generated using the Docker implementation of flDPnn [19] with default setting (flDPnn docker version December 2021, https://gitlab.com/sina.ghadermarzi/fldpnn_docker, accessed on 13 July 2025).
(3)
IUPred: Values were computed using the latest version of IUPred3 [17] in either “Short” (IUPredS) or “Long” (IUPredL) modes with default settings (version 3.0, https://iupred3.elte.hu/download_new, accessed on 13 July 2025).
To calibrate disorder scores against structural absence labels and enable integration with existing tools we implemented a one-vs-rest (OvR) logistic regression (LR) framework trained on the Legacy dataset. This approach converts raw scores (pLDDT, flDPnn, and IUPredS/L) into class probabilities by training classifiers for each structural category (modeled, soft-missing, and hard-missing). The predicted class is determined by the highest decision function score (intercept + coefficient × feature value), ensuring consistent classification thresholds across datasets. These models produced adapted scores (LR_pLDDT, LR_flDPnn, LR_IUPredS, and LR_IUPredL), which were subsequently validated using the Span, Recent, and Recent_LH groups (Tables S4 and S5).

2.4. Hybrid Model Architecture and Optimization

This research proposes a hybrid architecture that integrates convolutional, recurrent, and self-attention mechanisms to predict protein structural states. The following two model variants were developed: the base CLTC model, which processes amino acid sequences, and the enhanced CLTC_pLDDT model, which incorporates AlphaFold-derived pLDDT scores. The neural framework comprises three computational stages (Figure 3C).
In the first stage, amino acid sequences are encoded into 64-dimensional vectors using trainable character-level embeddings. Parallel convolutional branches then extract multi-scale features: three dilated convolutions (kernel size = 3, dilation rates = 1/2/3) capture hierarchical patterns while two standard convolutions (kernel sizes = 3 and 5) model local residue interactions. These operations produce concatenated 256-dimensional features (64 × 3 from dilated convolutions and 32 × 2 from standard convolutions), preserving multi-scale information.
In the second stage, bidirectional LSTM (BiLSTM) with 64 hidden units per direction processes the features to generate sequence-aware representations through bidirectional context integration. The subsequent Transformer encoder (4 attention heads) models long-range dependencies. Linear transformations ensure consistent input and output dimensions across all Transformer layers.
In the final stage, the Transformer encoder outputs are passed through a projection layer to map the representations to the dimensionality of the target tags. A conditional random field (CRF) layer is then applied to model the dependencies between sequential labels, ensuring globally optimal predictions. The CRF layer leverages the emission scores from the projection layer and computes the most likely sequence of labels by considering transitions between adjacent states.
Model optimization minimizes the CRF negative log-likelihood using the Adam optimizer (learning rate = 0.001) with gradient clipping (max_norm = 1.0). Through systematic grid search evaluation, the optimized architecture was determined to include a BiLSTM with 2 hidden layers (64 units per layer) and Transformer blocks with 4 encoding layers (128-dimensional representations). Regularization combines layer-wise dropout (rate = 0.2) applied to BiLSTM and Transformer components with a gradient norm constraint. The model was trained for 20 epochs using batches of size 32, with automatic best-parameter checkpointing based on the training loss trajectory.

2.5. Verification of Model Predictions

To evaluate the performance of the models in structural states classification prediction after training, we assess the model’s output classification trained with different feature input modes using the Span, Recent, and Recent_LH datasets through precision, recall, and F1-score [25] (Table S5). Definitions are as follows:
(1)
True positives (TPs): A residue is correctly predicted as belonging to its respective category.
(2)
True negatives (TNs): A residue is correctly predicted as not belonging to a specific category.
(3)
False positives (FPs): A residue is incorrectly predicted as belonging to a specific category when it does not.
(4)
False negatives (FNs): A residue is wrongly predicted as not belonging to a specific category when it actually does.
The formulas for precision, recall, and the F1-score are as follows:
Precision = TP/(TP + FP); Recall = TP/(TP + FN); F1-score = 2 × Precision × Recall/(Precision + Recall)

2.6. Software

The main scripts used in this study, including data preparation and feature extraction, were written in Python (3.10). Logistic regression and hybrid models were built and trained using the TensorFlow (2.16.1), Keras (3.3.3), scikit-learn (1.5.0), Pytorch (2.1.0), and NumPy (1.26.4) packages. Visualization of the results were accomplished with Matplotlib (3.8.4), seaborn (0.12.2), and basic Office software.

3. Results

3.1. Temporal Analyses of Dynamic Maps in Cryo-EM Single-Particle Analysis

This study systematically evaluates how advancements in SPA have refined the study of protein dynamics and disordered regions. A structural database was developed, encompassing entries from 2000 to 2024 sourced from the PDB center. The database contained 22,107 structural entities, further streamlined into 14,652 alignments (Figure 1A, Table S1). These alignments were categorized into the following three groups: Legacy (entries before 2022), Span (entries crossing multiple years), and Recent (entries after 2022), to track progress in the field.
The annual increase in SPA structures correlates with technological breakthroughs, such as the introduction of direct electron detectors around 2013 and the advent of advanced computational methods [1,2,3]. Notably, the period from 2022 to 2024 produced more new structures than the previous years combined, aligning with the increasing number of disorder annotations (Disprot) derived from Cryo-EM (Figure 1B,C). The average resolution improved from 4.5 Å (Legacy) to 3.3 Å (Recent), and the proportion of high-precision structures (<2.5 Å) increased from 2.5% to 6.1%, symbolizing synergy techniques advancements (Figure 1D) [5,38].
Following this an analysis of “structural absence” indicating residues lacking resolved coordinates was conducted and taken as a measure for disorder. By comparing multiple structural revelations of identical protein alignments [18,33,39], the following two types of absences were classified: soft missing, referring to residues inconsistently absent across PDB entries, and hard missing, indicating residues absent in all entries (Figure 1A,E). With an increase in experimental replications (represented by the entity/alignment ratio), the rate of soft absence rose to 15% in Span (entity/alignment, 4.20), as compared to 3% in Legacy (1.34) and 5% in Recent (1.31). This suggests enhanced exposure of conformational variability in diverse experimental settings corresponding to such absences.
Interestingly, hard missing remained consistently above 20% across all groups (Figure 1E), unaffected by improvements in resolution, dataset growth, or experimental replications. This implies that hard missing may reflect inherent protein disorder resistant to structural determination even at current stage. These findings underscore the need for complementary computational prediction methods to explore these dynamic regions.

3.2. Characteristics of Structural Absences in Single-Particle Analysis

To investigate the impact of structural features on functional interpretation in SPA, particularly in dynamic regions, cross-referencing with the PFAM database was progressed to reveal the correlation between structural integrity and functional annotation rate (Figure 1E). Integrated structured residues retained PFAM annotation rates of 67–72%, while regions with soft absence exhibited reduced annotation rates (57–63%). For regions with hard missing, annotation rates dropped to 27–30%, suggesting that functional knowledge aligns with structural stability.
Through functional categorization, 78 unique classes were identified, each exhibiting distinct patterns of structural absence across three period groups (Figure 2A, Supplementary File S1). Residues from structurally stable domains, such as N-terminal hydrolases (NTN hydrolases; PFAM CL0052), showed negligible absence rates throughout all periods. In contrast, residues within the ankyrin repeat clan (Ank; PFAM CL0465), characterized by repeats of two beta strands and two alpha helices forming long arrays, displayed increased structural absence in the Recent group. Absence rose from negligible levels (including both soft- and hard-missing residues) in Legacy structures to pronounced levels in the Span and Recent datasets, suggesting progressive challenges in structural resolution for this clan. Separately, functional categories like the Transporter clan (PFAM CL0375) maintained stable absence rates across Legacy, Span, and Recent groups (summed soft/hard missing: 20%/29%/26%), indicating persistent structural resolvability challenges over time in Cryo-EM SPA.
To statistically validate shifts in residue ratios, MANOVA was applied to the 78 classes (Figure 2B, Global). No significant differences emerged across all three period groups. However, pairwise comparisons revealed significant differences, most notably between the Legacy and Span datasets (Figure 2 LvS), with fewer significant cases between the Span and Recent groups (Figure 2 SvR). Only one group (OB, CL0021) exhibited significance in Legacy vs. Recent comparisons (Figure 2 LvR). This anomaly may reflect functional properties, such as oligonucleotide/oligosaccharide binding, that drive heightened variability in ratio shifts across modeled, soft-missing, and hard-missing residues. Such variability suggests a functional–structural linkage. The absence of global significance may correspond to distinct significant case patterns between LvS and SvR comparisons, indicating gradual technological evolution and study-object-specific differences across periods
Comparative structural analyses revealed pervasive structure-dependent plasticity across experimental conditions, and ordered and disordered states exhibited frequent transitions, as demonstrated by BamC (UniProt ID: P0A903; Figure 2C) [40]. In this protein as example, structured–unstructured transitions formed a continuum from the N-terminus to the midpoint, while the C-terminus displayed heightened flexibility. This observation implies context-dependent ordering in lipoprotein-binding regions identified through Cryo-EM SPA [41,42] and Pfam detection (Figure 2C). Notably, DisProt annotations (version 2024_12) highlight methodological discrepancies: residues 1–100 are designated as disordered (based on NMR), while residues 26–98 are classified as ordered (based on X-ray crystallography). Such cross-technique variance in structural evidence both aligns with and contradicts the exclusive reliance on patterns of structural absence from Cryo-EM SPA in this research. Consequently, the BamC case illustrates how technique-specific parameters, particularly resolution constraints in SPA, affect structural determinability assessments.
Structural absence also displayed a positional bias with termini showing higher hard absence rates than internal regions (Figure 2D). This is likely due to a combination of both purification-linked tags and inherent terminal flexibility, while this pattern was not observed in regions of soft absence, suggesting another distinct biophysical origins for these two types.
The amino acid composition analysis further distinguished between types of absence areas (Table S2). Residues of hard missing showed a higher disorder propensity than structured ones, exhibiting heightened polar and reduced hydrophobic residues [43,44]. Soft-missing residues displayed intermediary compositions between order and disorder, aligning with their conditional resolvability.
This comprehensive analyses of Cryo-EM structures highlight how technological advancements enable the mapping of some functionally critical flexible regions. The disorder features, positional patterns, and sequence signatures of structural absences provide predictive benchmarks for interpreting SPA unstructured results. These findings motivated the implementation of two strategies: leveraging current state-of-the-art disorder prediction tools and developing a novel algorithm using advanced machine learning methods. Both approaches were subsequently applied to address the challenges identified.

3.3. Variable Performance of Disorder Prediction Tools in Characterizing Cryo-EM Structural Absences

Despite significant advancements in SPA, persistent challenges remain in interpreting structures with dynamic disorder. This underscores the need to benchmark disorder prediction tools against expanding SPA datasets. Three widely used disorder prediction metrics, including AlphaFold’s pLDDT, flDPnn, and IUPred, were systematically evaluated for their ability to discriminate between structured regions and experimental absences in SPA datasets.
Cross-temporal analysis unveiled distinct disorder score distributions across absence types (Figure 3A, Table S3). For example, hard-missing residues in the Recent dataset displayed significantly higher predicted disorder than modeled residues (pLDDT: 58.4 ± 23.9 vs. 88.6 ± 12.2; flDPnn: 0.09 ± 0.13 vs. 0.23 ± 0.23; IUPredS: 0.42 ± 0.25 vs. 0.19 ± 0.17; p < 0.001). However, soft-missing residues showcased intermediate scores overlapping with both categories (pLDDT: 81.5 ± 15.5; flDPnn: 0.10 ± 0.14; IUPredS: 0.26 ± 0.18), suggesting limited distinguishing capability for this transient disordered state by disorder prediction.
To facilitate structural-state classification, logistic regression (LR) models were trained on Legacy data and validated on hard-missing residues from other datasets. The LR-pLDDT model achieved optimal discrimination, with F1-scores of 0.71 (Span dataset) and 0.65 (Recent dataset) (Figure 3B). This performance advantage may stem from pLDDT’s incorporation of evolutionary constraints via multiple sequence alignments (MSAs) and its reliance on accurate structural coordinates through deep learning [20,21,34].
Interestingly, the physics-based IUPred (“short” and “long” methods) demonstrated greater cross-dataset robustness than flDPnn, achieving F1-scores of 0.59 (Span) and 0.51–0.52 (Recent), compared to flDPnn’s 0.55 and 0.49. This suggests that physics-based methods adapt better to detecting experimental dynamic absences, even when repurposed from their original designs of identifying disordered segments [45].
In the low-homology validation using the Recent_LH dataset (sequence identity < 20%), consistent performance trends were observed. LR-pLDDT achieved an optimal F1-score of 0.63, although this was lower than its performance on the Recent dataset. Meanwhile, IUPred showed improvement, with F1-scores ranging from 0.55 to 0.56, consistently surpassing flDPnn (Figure 3B). These findings highlight fundamental differences in algorithmic design. Methods dependent on homologues searching, such as AlphaFold2, exhibit reduced confidence when template availability is limited [46,47,48]. In contrast, physics-based approaches demonstrate greater resilience and lower computational requirements, albeit with modest peak performance [45].

3.4. Predictive Performance of the Hybrid Model and Advancement

The hybrid CLTC framework was developed to address current challenges in identifying missing residues in SPA structures using adapted disorder prediction models. This innovative model integrates sequence modeling architectures from previous studies [49,50,51,52] through a multi-scale feature fusion mechanism that combines local and global sequence patterns across distinct layers (Figure 3C).
After being trained on Legacy benchmark datasets, the CLTC achieved F1-scores of 0.71 (Span) and 0.62 (Recent) for hard-missing classification validation (Figure 3B). These results matched the performance of the computationally intensive LR_pLDDT benchmark while utilizing fewer parameters compared to AlphaFold2. The model also demonstrated robust stability under low sequence homology conditions (Recent_LH), maintaining an F1-score of 0.63 with precision and recall rates of 0.63 and 0.60, respectively (Table S5).
Integration of the CLTC with pLDDT to form CLTC_pLDDT led to performance improvements across all validation datasets (Figure 3B). This hybrid approach achieved F1-scores 4–8% higher than both the baseline CLTC and LR_pLDDT. Prior studies [53,54] and preliminary testing confirmed each component’s essential role. Replacing the hybrid architecture with single LSTM or Transformer layers reduced performance by a large amount (Table 1). These results demonstrate a critical synergy among components, with CNN-LSTM for local dynamics, Transformer for global attention, and CRF for positional constraints, which is consistent with ablation outcomes showing performance variations across layer combinations.
Notably, CLTC demonstrated cross-domain generalization capability in predicting protein disorder. In a home test using the CAID3 DisProt Disorder benchmark dataset (232 sequences with 18% disordered residues), CLTC achieved an F1-score of 0.51 without fine-tuning. This performance was 4–8% lower than specialized disorder predictors trained explicitly on the CAID3 dataset (pLDDT: 0.55; flDPnn/IUPred3: 0.53), suggesting a significant yet subtle overlap between experimental missing density and disorder annotations [14,55]. These results also underscore CLTC’s potential for both specialized and general applications, whilst also pointing to opportunities for further advancements in this field.
The performance disparity observed among the Span, Recent, and Recent_LH groups highlights the significant reliance of existing models on evolutionary homology signals. This reliance underscores the importance of precisely understanding folding rules and the influence of experimental conditions on local folding environments. Numerous discussions in the literature further underscore this dependency, emphasizing the need to address the limitations of prediction tools across various fields even beyond the homology problem, including highly advanced deep-learning methodology like AlphaFold2 [22,46,47,48,56,57]. This disparity reveals a critical challenge in evaluating novel biomolecules with understudied transient dynamics. To address this challenge, frameworks that integrate sequence-derived evolutionary information with physics-based folding principles are necessary to accommodate diverse structural states across experimental conditions. Such integration is particularly pivotal as Cryo-EM continues to rapidly expand the diversity of resolved structures, demanding methodologies capable of resolving both conserved and emergent conformational landscapes.

4. Discussion

This study establishes a temporally curated database to systematically evaluate historical and recent outcomes in SPA. Despite technological advances in sample preparation, instrumentation, and computational algorithms, the persistent challenge of resolving biomolecular order–disorder dynamics remains a fundamental limitation in the field.
First, transient conformational states, referred to as soft-missing residues, exhibit condition-dependent variability, with increased prevalence in transitional datasets reflecting spatiotemporal heterogeneity. In contrast, persistent structural voids, termed hard-missing residues, maintain stable proportions across datasets, underscoring the inherent difficulty in characterizing disorder during SPA reconstruction. This distinction stems from the inherent tension between biomolecular flexibility, which is often conditional, and opposes SPA’s static averaging process.
Second, analytical evaluations demonstrate that structural absences can be predicted using transfer learning strategies based on disorder prediction models and hybrid frameworks. Both AlphaFold2’s pLDDT confidence metric and the CLTC hybrid model achieve optimal accuracy, with the latter also demonstrating robust cross-homology performance through the integration of multi-scale local–global features. Additionally, the hybrid model exhibits potential for two distinct applications: precise structural absence detection and context-aware disorder prediction.
Importantly, this research bridges theoretical and practical paradigms by offering guidelines to correlate structural absence patterns with dynamic features. It advocates for hybrid architectures that combine transfer learning efficiencies with adaptive network designs, aiming to steer next-generation tools towards resolving dynamic ensembles.
In summary, this integrative study deciphers the biological relevance of SPA structural absences through temporal analytics, predictive modeling, and functional interpretation. By linking methodological advancements with dynamic ensemble analysis, we advance Cryo-EM beyond static structure determination. Our framework provides actionable strategies for temporal parameter optimization. For instance, when handling soft-missing regions (moderately disordered residues), one could improve reconstruction detail by adjusting sample conditions such as incorporating ligands to stabilize transient states, guided by clues from structural homologs or related references. Conversely, for hard-missing regions (persistently disordered segments), protein engineering or construct redesign may prove necessary. These context-specific approaches, combined with adaptive computational workflows, offer practical pathways for detecting and resolving structural ambiguity during initial investigations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biophysica5030039/s1, Supplementary File S1 contains: Table S1: Summary of classifications across temporal groups; Table S2: Compositional summary of datasets; Table S3: Disorder score distributions across datasets; Table S4: Details of logistic regression for disorder predicted scores; Table S5: Model performance summary on dataset; Table S6: Model performance summary on classification of modeled residue. Supplementary File S2 is PFAM_summary.xlsx for functional domain and MANOVA analysis results. Supplementary File S3 is Full Model Reports.txt.

Funding

This research received no external funding.

Data Availability Statement

Supplementary Data, models made in this study, and full datasets are available at URL (https://github.com/thsformygod/Explore-SPA2024, accessed on 13 July 2025).

Acknowledgments

I would like to express my sincere gratitude to Zhanyu Ding from Shanghai Yuexin Life Science Information Technology Co., Ltd. and Qiang Huang from Fudan University for their generous support in providing resources and motivation for the data analysis in this study. I also extend my appreciation to my colleagues in the Bio-Electron Microscopy Facility team for their valuable suggestions and generous support under the leadership of team leader Qianqian Sun.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Kuhlbrandt, W. The resolution revolution. Science 2014, 343, 1443–1444. [Google Scholar] [CrossRef] [PubMed]
  2. Hagen, W.J.H.; Wan, W.; Briggs, J.A.G. Implementation of a cryo-electron tomography tilt-scheme optimized for high resolution subtomogram averaging. J. Struct. Biol. 2017, 197, 191–198. [Google Scholar] [CrossRef] [PubMed]
  3. Scheres, S.H. RELION: Implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 2012, 180, 519–530. [Google Scholar] [CrossRef] [PubMed]
  4. Cressey, D.; Callaway, E. Cryo-electron microscopy wins chemistry Nobel. Nature 2017, 550, 167. [Google Scholar] [CrossRef] [PubMed]
  5. Nakane, T.; Kotecha, A.; Sente, A.; McMullan, G.; Masiulis, S.; Brown, P.; Grigoras, I.T.; Malinauskaite, L.; Malinauskas, T.; Miehling, J.; et al. Single-particle cryo-EM at atomic resolution. Nature 2020, 587, 152–156. [Google Scholar] [CrossRef] [PubMed]
  6. Radivojac, P.; Obradovic, Z.; Smith, D.K.; Zhu, G.; Vucetic, S.; Brown, C.J.; Lawson, J.D.; Dunker, A.K. Protein flexibility and intrinsic disorder. Protein Sci. 2004, 13, 71–80. [Google Scholar] [CrossRef] [PubMed]
  7. Le Gall, T.; Romero, P.R.; Cortese, M.S.; Uversky, V.N.; Dunker, A.K. Intrinsic disorder in the Protein Data Bank. J. Biomol. Struct. Dyn. 2007, 24, 325–342. [Google Scholar] [CrossRef] [PubMed]
  8. Uversky, V.N. A decade and a half of protein intrinsic disorder: Biology still waits for physics. Protein Sci. 2013, 22, 693–724. [Google Scholar] [CrossRef] [PubMed]
  9. Uversky, V.N. Unusual biophysics of intrinsically disordered proteins. Biochim. Biophys. Acta 2013, 1834, 932–951. [Google Scholar] [CrossRef] [PubMed]
  10. Receveur-Brechot, V.; Bourhis, J.M.; Uversky, V.N.; Canard, B.; Longhi, S. Assessing protein disorder and induced folding. Proteins 2006, 62, 24–45. [Google Scholar] [CrossRef] [PubMed]
  11. Gsponer, J.; Futschik, M.E.; Teichmann, S.A.; Babu, M.M. Tight regulation of unstructured proteins: From transcript synthesis to protein degradation. Science 2008, 322, 1365–1368. [Google Scholar] [CrossRef] [PubMed]
  12. Nwanochie, E.; Uversky, V.N. Structure Determination by Single-Particle Cryo-Electron Microscopy: Only the Sky (and Intrinsic Disorder) is the Limit. Int. J. Mol. Sci. 2019, 20, 4186. [Google Scholar] [CrossRef] [PubMed]
  13. Dodd, T.; Yan, C.; Ivanov, I. Simulation-Based Methods for Model Building and Refinement in Cryoelectron Microscopy. J. Chem. Inf. Model. 2020, 60, 2470–2483. [Google Scholar] [CrossRef] [PubMed]
  14. Uversky, V.N. Intrinsically Disordered Proteins and Their “Mysterious” (Meta)Physics. Front. Phys. 2019, 7, 10. [Google Scholar] [CrossRef]
  15. Dosztanyi, Z.; Csizmok, V.; Tompa, P.; Simon, I. IUPred: Web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 2005, 21, 3433–3434. [Google Scholar] [CrossRef] [PubMed]
  16. Meszaros, B.; Erdos, G.; Dosztanyi, Z. IUPred2A: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018, 46, W329–W337. [Google Scholar] [CrossRef] [PubMed]
  17. Erdos, G.; Pajkos, M.; Dosztanyi, Z. IUPred3: Prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021, 49, W297–W303. [Google Scholar] [CrossRef] [PubMed]
  18. Hatos, A.; Hajdu-Soltesz, B.; Monzon, A.M.; Palopoli, N.; Alvarez, L.; Aykac-Fas, B.; Bassot, C.; Benitez, G.I.; Bevilacqua, M.; Chasapi, A.; et al. DisProt: Intrinsic protein disorder annotation in 2020. Nucleic Acids Res. 2020, 48, D269–D276. [Google Scholar] [CrossRef] [PubMed]
  19. Hu, G.; Katuwawala, A.; Wang, K.; Wu, Z.; Ghadermarzi, S.; Gao, J.; Kurgan, L. flDPnn: Accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat. Commun. 2021, 12, 4438. [Google Scholar] [CrossRef] [PubMed]
  20. Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Zidek, A.; Nelson, A.W.R.; Bridgland, A.; et al. Improved protein structure prediction using potentials from deep learning. Nature 2020, 577, 706–710. [Google Scholar] [CrossRef] [PubMed]
  21. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Zidek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
  22. Saldano, T.; Escobedo, N.; Marchetti, J.; Zea, D.J.; Mac Donagh, J.; Velez Rueda, A.J.; Gonik, E.; Garcia Melani, A.; Novomisky Nechcoff, J.; Salas, M.N.; et al. Impact of protein conformational diversity on AlphaFold predictions. Bioinformatics 2022, 38, 2742–2748. [Google Scholar] [CrossRef] [PubMed]
  23. Wilson, C.J.; Choy, W.Y.; Karttunen, M. AlphaFold2: A Role for Disordered Protein/Region Prediction? Int. J. Mol. Sci. 2022, 23, 4591. [Google Scholar] [CrossRef] [PubMed]
  24. Zhao, B.; Kurgan, L. Surveying over 100 predictors of intrinsic disorder in proteins. Expert. Rev. Proteom. 2021, 18, 1019–1029. [Google Scholar] [CrossRef] [PubMed]
  25. Necci, M.; Piovesan, D.; Predictors, C.; DisProt, C.; Tosatto, S.C.E. Critical assessment of protein intrinsic disorder prediction. Nat. Methods 2021, 18, 472–481. [Google Scholar] [CrossRef] [PubMed]
  26. Fan, H.; Sun, F. Developing Graphene Grids for Cryoelectron Microscopy. Front. Mol. Biosci. 2022, 9, 937253. [Google Scholar] [CrossRef] [PubMed]
  27. Liu, N.; Wang, H.W. Better Cryo-EM Specimen Preparation: How to Deal with the Air-Water Interface? J. Mol. Biol. 2023, 435, 167926. [Google Scholar] [CrossRef] [PubMed]
  28. He, J.; Lin, P.; Chen, J.; Cao, H.; Huang, S.Y. Model building of protein complexes from intermediate-resolution cryo-EM maps with deep learning-guided automatic assembly. Nat. Commun. 2022, 13, 4066. [Google Scholar] [CrossRef] [PubMed]
  29. Jamali, K.; Kall, L.; Zhang, R.; Brown, A.; Kimanius, D.; Scheres, S.H.W. Automated model building and protein identification in cryo-EM maps. Nature 2024, 628, 450–457. [Google Scholar] [CrossRef] [PubMed]
  30. Yang, Y.; Arseni, D.; Zhang, W.; Huang, M.; Lovestam, S.; Schweighauser, M.; Kotecha, A.; Murzin, A.G.; Peak-Chew, S.Y.; Macdonald, J.; et al. Cryo-EM structures of amyloid-beta 42 filaments from human brains. Science 2022, 375, 167–172. [Google Scholar] [CrossRef] [PubMed]
  31. Aspromonte, M.C.; Nugnes, M.V.; Quaglia, F.; Bouharoua, A.; DisProt, C.; Tosatto, S.C.E.; Piovesan, D. DisProt in 2024: Improving function annotation of intrinsically disordered proteins. Nucleic Acids Res. 2024, 52, D434–D441. [Google Scholar] [CrossRef] [PubMed]
  32. Zheng, S. Navigating the unstructured by evaluating alphafold’s efficacy in predicting missing residues and structural disorder in proteins. PLoS ONE 2025, 20, e0313812. [Google Scholar] [CrossRef] [PubMed]
  33. Seoane, B.; Carbone, A. Soft disorder modulates the assembly path of protein complexes. PLoS Comput. Biol. 2022, 18, e1010713. [Google Scholar] [CrossRef] [PubMed]
  34. Tunyasuvunakool, K.; Adler, J.; Wu, Z.; Green, T.; Zielinski, M.; Zidek, A.; Bridgland, A.; Cowie, A.; Meyer, C.; Laydon, A.; et al. Highly accurate protein structure prediction for the human proteome. Nature 2021, 596, 590–596. [Google Scholar] [CrossRef] [PubMed]
  35. Steinegger, M.; Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017, 35, 1026–1028. [Google Scholar] [CrossRef] [PubMed]
  36. Barrio-Hernandez, I.; Yeo, J.; Janes, J.; Mirdita, M.; Gilchrist, C.L.M.; Wein, T.; Varadi, M.; Velankar, S.; Beltrao, P.; Steinegger, M. Clustering predicted structures at the scale of the known protein universe. Nature 2023, 622, 637–645. [Google Scholar] [CrossRef] [PubMed]
  37. Mistry, J.; Chuguransky, S.; Williams, L.; Qureshi, M.; Salazar, G.A.; Sonnhammer, E.L.L.; Tosatto, S.C.E.; Paladin, L.; Raj, S.; Richardson, L.J.; et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021, 49, D412–D419. [Google Scholar] [CrossRef] [PubMed]
  38. Yip, K.M.; Fischer, N.; Paknia, E.; Chari, A.; Stark, H. Atomic-resolution protein structure determination by cryo-EM. Nature 2020, 587, 157–161. [Google Scholar] [CrossRef] [PubMed]
  39. DeForte, S.; Uversky, V.N. Resolving the ambiguity: Making sense of intrinsic disorder when PDB structures disagree. Protein Sci. 2016, 25, 676–688. [Google Scholar] [CrossRef] [PubMed]
  40. Hagan, C.L.; Kim, S.; Kahne, D. Reconstitution of outer membrane protein assembly from purified components. Science 2010, 328, 890–892. [Google Scholar] [CrossRef] [PubMed]
  41. Iadanza, M.G.; Higgins, A.J.; Schiffrin, B.; Calabrese, A.N.; Brockwell, D.J.; Ashcroft, A.E.; Radford, S.E.; Ranson, N.A. Lateral opening in the intact beta-barrel assembly machinery captured by cryo-EM. Nat. Commun. 2016, 7, 12865. [Google Scholar] [CrossRef] [PubMed]
  42. Fenn, K.L.; Horne, J.E.; Crossley, J.A.; Bohringer, N.; Horne, R.J.; Schaberle, T.F.; Calabrese, A.N.; Radford, S.E.; Ranson, N.A. Outer membrane protein assembly mediated by BAM-SurA complexes. Nat. Commun. 2024, 15, 7612. [Google Scholar] [CrossRef] [PubMed]
  43. Theillet, F.X.; Kalmar, L.; Tompa, P.; Han, K.H.; Selenko, P.; Dunker, A.K.; Daughdrill, G.W.; Uversky, V.N. The alphabet of intrinsic disorder: I. Act like a Pro: On the abundance and roles of proline residues in intrinsically disordered proteins. Intrinsically Disord. Proteins 2013, 1, e24360. [Google Scholar] [CrossRef] [PubMed]
  44. Zhao, B.; Kurgan, L. Compositional Bias of Intrinsically Disordered Proteins and Regions and Their Predictions. Biomolecules 2022, 12, 888. [Google Scholar] [CrossRef] [PubMed]
  45. Kurgan, L.; Hu, G.; Wang, K.; Ghadermarzi, S.; Zhao, B.; Malhis, N.; Erdos, G.; Gsponer, J.; Uversky, V.N.; Dosztanyi, Z. Tutorial: A guide for the selection of fast and accurate computational tools for the prediction of intrinsic disorder in proteins. Nat. Protoc. 2023, 18, 3157–3172. [Google Scholar] [CrossRef] [PubMed]
  46. Chakravarty, D.; Porter, L.L. AlphaFold2 fails to predict protein fold switching. Protein Sci. 2022, 31, e4353. [Google Scholar] [CrossRef] [PubMed]
  47. Terwilliger, T.C.; Liebschner, D.; Croll, T.I.; Williams, C.J.; McCoy, A.J.; Poon, B.K.; Afonine, P.V.; Oeffner, R.D.; Richardson, J.S.; Read, R.J.; et al. AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nat. Methods 2024, 21, 110–116. [Google Scholar] [CrossRef] [PubMed]
  48. Agarwal, V.; McShan, A.C. The power and pitfalls of AlphaFold2 for structure prediction beyond rigid globular proteins. Nat. Chem. Biol. 2024, 20, 950–959. [Google Scholar] [CrossRef] [PubMed]
  49. Hanson, J.; Paliwal, K.K.; Litfin, T.; Zhou, Y. SPOT-Disorder2: Improved Protein Intrinsic Disorder Prediction by Ensembled Deep Learning. Genom. Proteom. Bioinform. 2019, 17, 645–656. [Google Scholar] [CrossRef] [PubMed]
  50. Hanson, J.; Paliwal, K.; Zhou, Y. Accurate Single-Sequence Prediction of Protein Intrinsic Disorder by an Ensemble of Deep Recurrent and Convolutional Architectures. J. Chem. Inf. Model. 2018, 58, 2369–2376. [Google Scholar] [CrossRef] [PubMed]
  51. Walsh, I.; Martin, A.J.; Di Domenico, T.; Tosatto, S.C. ESpritz: Accurate and fast prediction of protein disorder. Bioinformatics 2012, 28, 503–509. [Google Scholar] [CrossRef] [PubMed]
  52. Ullah, I.; Mahmoud, Q.H. Design and Development of RNN Anomaly Detection Model for IoT Networks. IEEE Access 2022, 10, 62722–62750. [Google Scholar] [CrossRef]
  53. Dou, B.; Zhu, Z.; Merkurjev, E.; Ke, L.; Chen, L.; Jiang, J.; Zhu, Y.; Liu, J.; Zhang, B.; Wei, G.W. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem. Rev. 2023, 123, 8736–8780. [Google Scholar] [CrossRef] [PubMed]
  54. Colon-Ruiz, C.; Segura-Bedmar, I. Comparing deep learning architectures for sentiment analysis on drug reviews. J. Biomed. Inform. 2020, 110, 103539. [Google Scholar] [CrossRef] [PubMed]
  55. Uversky, V.N. Protein intrinsic disorder and structure-function continuum. Prog. Mol. Biol. Transl. Sci. 2019, 166, 1–17. [Google Scholar] [CrossRef] [PubMed]
  56. Outeiral, C.; Nissley, D.A.; Deane, C.M. Current structure predictors are not learning the physics of protein folding. Bioinformatics 2022, 38, 1881–1887. [Google Scholar] [CrossRef] [PubMed]
  57. Alderson, T.R.; Pritisanac, I.; Kolaric, D.; Moses, A.M.; Forman-Kay, J.D. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proc. Natl. Acad. Sci. USA 2023, 120, e2304302120. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Analyses and summary of dynamic maps in Cryo-EM SPA. (A) Workflow for processing PDB entries: splitting, alignment to reference proteins, and classification into Legacy (2000–2022), Span (2000–2024), and Recent (2022–2024) groups. (B) Structural states in the Disprot database, categorized by experimental evidence, inner circle (2022-03), and outer circle (2024-12). (C) Annual deposition of SPA structures in the PDB (2010–2024). (D) Resolution distribution of no redundant PDB entries in Legacy (blue), Span (orange), and Recent (yellow) datasets. (E) Distribution of residues with structural absence states; modeled residues (blue), soft missing (yellow), and hard missing (red), annotated with PFAM (full) or without PFAM (dotted).
Figure 1. Analyses and summary of dynamic maps in Cryo-EM SPA. (A) Workflow for processing PDB entries: splitting, alignment to reference proteins, and classification into Legacy (2000–2022), Span (2000–2024), and Recent (2022–2024) groups. (B) Structural states in the Disprot database, categorized by experimental evidence, inner circle (2022-03), and outer circle (2024-12). (C) Annual deposition of SPA structures in the PDB (2010–2024). (D) Resolution distribution of no redundant PDB entries in Legacy (blue), Span (orange), and Recent (yellow) datasets. (E) Distribution of residues with structural absence states; modeled residues (blue), soft missing (yellow), and hard missing (red), annotated with PFAM (full) or without PFAM (dotted).
Biophysica 05 00039 g001
Figure 2. Functional and positional analyses of structural absence in SPA structures. (A) Three example functional clans (NTN hydrolases, Ank, and Transporter) from 78 classes showcasing distinct patterns of structural absence across Legacy, Span, and Recent groups. (B) Heatmap showing the p-values of statistical analysis results, with significance indicated by markers: *** <0.0001, ** <0.001, * <0.05. Tests: global (MANOVA across Legacy/Span/Recent); LvS (Legacy vs. Span); SvR (Span vs. Recent); and LvR (Legacy vs. Recent). (C) Order–disorder dynamics in BamC (P0A903) across PDB structures from multiple SPA experiments. Sequence alignment of BamC (25–344) from PDB entries 5LJO_2, 6SOC_3, 8PZ1_3, 8Q0G_5, and 8QPU_4 highlights structured regions (blue) and flexible/unresolved regions by SPA experiments (yellow/red, corresponding to soft/hard-missing residues). The Lipoprotein_18-like domain (green bar) spans residues 41–343. DisProt annotations reveal regions with conflicting structural evidence: Disorder predictions (black bar) span positions 1–100 and 214–227, while residues 26–98 (gray bar) exhibit overlapping order–disorder annotations. These discrepancies arise from methodological variations across structural determination sources (NMR, X-ray crystallography, and HSQC spectroscopy). (D) Positional distribution of structural absence proportions.
Figure 2. Functional and positional analyses of structural absence in SPA structures. (A) Three example functional clans (NTN hydrolases, Ank, and Transporter) from 78 classes showcasing distinct patterns of structural absence across Legacy, Span, and Recent groups. (B) Heatmap showing the p-values of statistical analysis results, with significance indicated by markers: *** <0.0001, ** <0.001, * <0.05. Tests: global (MANOVA across Legacy/Span/Recent); LvS (Legacy vs. Span); SvR (Span vs. Recent); and LvR (Legacy vs. Recent). (C) Order–disorder dynamics in BamC (P0A903) across PDB structures from multiple SPA experiments. Sequence alignment of BamC (25–344) from PDB entries 5LJO_2, 6SOC_3, 8PZ1_3, 8Q0G_5, and 8QPU_4 highlights structured regions (blue) and flexible/unresolved regions by SPA experiments (yellow/red, corresponding to soft/hard-missing residues). The Lipoprotein_18-like domain (green bar) spans residues 41–343. DisProt annotations reveal regions with conflicting structural evidence: Disorder predictions (black bar) span positions 1–100 and 214–227, while residues 26–98 (gray bar) exhibit overlapping order–disorder annotations. These discrepancies arise from methodological variations across structural determination sources (NMR, X-ray crystallography, and HSQC spectroscopy). (D) Positional distribution of structural absence proportions.
Biophysica 05 00039 g002
Figure 3. Analysis and evaluation of disorder prediction on structural state classification. (A) Disorder predicted scores (pLDDT, flDPnn, and IUPredS), with dashed lines indicating the one-vs-rest logistic regression decision boundaries. Distributions across structural absence types: modeled residues (blue), soft missing (orange), and hard missing (red). Distributions are shown for Legacy (left), Span (middle), and Recent (right) datasets. (B) Classification performance for hard-missing residues by models, logistic regression (LR) on disorder scores such as pLDDT (green), flDPnn (yellow), IUPredS, and IUPredL (cyan), hybrid model CLTC, and CLTC_pLDDT (blue), evaluated using F1-scores (columns) and AUC scores (lines) across Span, Recent, and Recent_LH datasets. (C) Schematic representation of the CLTC framework, illustrating the integration of sequence modeling architectures through a multi-scale feature fusion mechanism. Residue structured groups: Modeled (blue), Soft-missing (yellow), Hard-missing (red). Representation style: Ground truth (solid), Predicted residues (translucent).
Figure 3. Analysis and evaluation of disorder prediction on structural state classification. (A) Disorder predicted scores (pLDDT, flDPnn, and IUPredS), with dashed lines indicating the one-vs-rest logistic regression decision boundaries. Distributions across structural absence types: modeled residues (blue), soft missing (orange), and hard missing (red). Distributions are shown for Legacy (left), Span (middle), and Recent (right) datasets. (B) Classification performance for hard-missing residues by models, logistic regression (LR) on disorder scores such as pLDDT (green), flDPnn (yellow), IUPredS, and IUPredL (cyan), hybrid model CLTC, and CLTC_pLDDT (blue), evaluated using F1-scores (columns) and AUC scores (lines) across Span, Recent, and Recent_LH datasets. (C) Schematic representation of the CLTC framework, illustrating the integration of sequence modeling architectures through a multi-scale feature fusion mechanism. Residue structured groups: Modeled (blue), Soft-missing (yellow), Hard-missing (red). Representation style: Ground truth (solid), Predicted residues (translucent).
Biophysica 05 00039 g003
Table 1. Classification performance of hybrid model ablations across datasets.
Table 1. Classification performance of hybrid model ablations across datasets.
DatasetModelsPrecisionRecallF1AUC
SpanL0.780.520.630.86
T0.390.580.470.70
LT0.710.640.670.85
CLT0.720.660.690.87
CLC0.690.670.680.86
CTC0.460.650.540.77
CLTC0.740.670.710.88
RecentL0.770.420.540.82
T0.380.590.460.71
LT0.630.570.600.83
CLT0.610.610.610.84
CLC0.630.580.600.82
CTC0.450.590.510.76
CLTC0.620.620.620.84
Recent_LHL0.770.440.560.80
T0.420.610.500.71
LT0.650.570.610.81
CLT0.630.600.620.82
CLC0.640.600.620.82
CTC0.470.590.510.76
CLTC0.630.630.630.82
Model abbreviations: L, single LSTM layer; T, single Transformer layer; LT, LSTM and Transformer layers; CLT, CNN, LSTM, and Transformer layers; CLC, CNN, LSTM, and CRF layers; CTC, CNN, Transformer, and CRF layers; CLTC, CNN, LSTM, Transformer, and CRF layers.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, S. Exploring the Bottleneck in Cryo-EM Dynamic Disorder Feature and Advanced Hybrid Prediction Model. Biophysica 2025, 5, 39. https://doi.org/10.3390/biophysica5030039

AMA Style

Zheng S. Exploring the Bottleneck in Cryo-EM Dynamic Disorder Feature and Advanced Hybrid Prediction Model. Biophysica. 2025; 5(3):39. https://doi.org/10.3390/biophysica5030039

Chicago/Turabian Style

Zheng, Sen. 2025. "Exploring the Bottleneck in Cryo-EM Dynamic Disorder Feature and Advanced Hybrid Prediction Model" Biophysica 5, no. 3: 39. https://doi.org/10.3390/biophysica5030039

APA Style

Zheng, S. (2025). Exploring the Bottleneck in Cryo-EM Dynamic Disorder Feature and Advanced Hybrid Prediction Model. Biophysica, 5(3), 39. https://doi.org/10.3390/biophysica5030039

Article Metrics

Back to TopTop