TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features
Abstract
1. Introduction
- PSSP continues to be the foundation for understanding the tertiary structure of proteins [11].
- PSSP remains important in cases when structural confidence is lower or for interpreting dynamic and disordered regions erroneously captured by tertiary structure prediction methods [18].
- PSSP continues to play a crucial role in resolving protein functions and properties, as this structure is the basis for the formation of the tertiary structure [19].
- PSSP can be a low-cost and efficient alternative to wet experiments, making it particularly valuable for large-scale proteomic studies and drug discovery applications where experimental protein structure recovery can be prohibitively expensive and time-consuming [11]. For the same applications, if homologous structural data are unavailable, PSSP modeling is the only way to enable large-scale screening in silico [20].
2. Results
2.1. Descriptor Pre-Selection Can Substantially Improve Prediction Quality
- RMSD-based structural descriptors;
- Physicochemical-based descriptors, describing specific periodicities of the protein secondary structure;
- Physicochemical- and structure-based descriptors, capturing non-periodic properties that influence the formation of the protein backbone configuration.
2.1.1. RMSD-Based Structural Descriptors
- Nocc(seq)—the number of occurrences of seq among the sequences with known structures (i.e., the training dataset);
- —the mean distance between the structures with sequence seq and the PBj;
- —the average distance between PBj and all pentapeptides in the training dataset;
- —the sampling variance of PBj;
- —the variance for structures with sequence seq;
- N—the size of the training dataset.
2.1.2. Descriptors, Describing Periodicities of PSSs
2.1.3. Descriptors, Capturing PSS Non-Periodic Properties
2.1.4. Results Obtained by the Combined Feature Set
2.2. Prediction Results
3. Discussion
3.1. Confusion Matrices and Their Analysis
3.2. Importance of Descriptor Selection for PSSP
3.3. Comparative Evaluation of PSSP Accuracy with AlphaFold 2
3.4. A Promising Avenue for Improving PSSP Accuracy
3.5. Discussion of CASP FM Target Performance
3.6. Performance Gain Relative to the Previous Method
4. Materials and Methods
4.1. Training Dataset Compilation
4.2. Description of Benchmarking Datasets
- The CB513 [75] dataset, which remains a widely used benchmark that was designed specifically to evaluate the accuracy of secondary structure prediction methods. This dataset consists of 513 nonhomologous protein domains, accounting for 435 protein chains in total (some chains contain two or more domains). In this dataset, many protein chains are split into domains and are considered as separate targets. Our prediction method, however, accounts for the impact of each entire chain on every position. Indeed, the ESM2 embeddings we employ yield different representations for a fragment depending on the length of its parent chain. Therefore, we perform predictions using complete chains—that is why the “Number of Chains” in Table 1 is less than 513, although all CB513 segments are included in our performance evaluation.
- The TS115 [76] dataset, which contains proteins that were released after 1st January 2016 and whose structures were recovered via X-ray with a resolution ≤3.0 Å. Additionally, sequences with an identity >30% to those released before 2016 were removed. This dataset consists of 115 proteins.
- The TEST2018 [77] dataset, which consists of 250 proteins deposited between Jan 2018 and July 2018 with a resolution <2.5 Å and R-free < 0.25 that have sequence similarities of less than 25% to all pre-2018 proteins.
- The TEST2020-HQ [78] dataset, which includes all proteins released between May 2018 and April 2020, with the removal of homologues to all proteins released before 2018 on PDB [43]. Proteins with lengths greater than 1024 were also removed. Further constraints of <2.5 Å and R-free < 0.25 resulted in 124 proteins.
4.3. Secondary Structures and Sequences Extraction
4.4. Feature Set Compilation
- Selecting an appropriate physicochemical (e.g., hydrophobicity scale) or structural property across the many available properties;
- Quantitatively encoding the structural periodicity via a procedure that attenuates the impact of residues depending on their distance from the target position.
- The optimal hydrophobicity scale (selected from all AAindex + AAindexNC properties);
- T—the periodicity described by this descriptor;
- The relevant weighting function that quantifies the attenuation of the contribution to the descriptor’s value with increasing sequence distance (in residues).
4.5. Descriptor Pre-Selection Procedure
4.6. Neural Network Architecture
4.7. Prediction Quality Evaluation Metrics
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| DSSP | Dictionary of Secondary Structure in Proteins |
| PSSP | Protein Secondary Structure Prediction |
| PSS | Protein Secondary Structure |
| SDA | Stepwise Discriminant Analysis |
| LDA | Linear Discriminant Analysis |
| ML | Machine Learning |
| ncAA | Non-canonical Amino Acid |
References
- Freitas, R.A. Nanomedicine, Volume I: Basic Capabilities; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
- Stollar, E.J.; Smith, D.P. Uncovering protein structure. Essays Biochem. 2020, 64, 649–680, Correction in Essays Biochem. 2021, 65, 407. [Google Scholar] [CrossRef] [PubMed]
- Price, W.N., 2nd; Chen, Y.; Handelman, S.K.; Neely, H.; Manor, P.; Karlin, R.; Nair, R.; Liu, J.; Baran, M.; Everett, J.; et al. Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat. Biotechnol. 2009, 27, 51–57. [Google Scholar] [CrossRef] [PubMed]
- Slabinski, L.; Jaroszewski, L.; Rodrigues, A.P.; Rychlewski, L.; Wilson, I.A.; Lesley, S.A.; Godzik, A. The challenge of protein structure determination--lessons from structural genomics. Protein Sci. 2007, 16, 2472–2482. [Google Scholar] [CrossRef] [PubMed]
- Ismi, D.P.; Pulungan, R.; Afiahayati. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold. Comput. Struct. Biotechnol. J. 2022, 20, 6271–6286. [Google Scholar] [CrossRef]
- Rennie, M.L.; Oliver, M.R. Emerging frontiers in protein structure prediction following the AlphaFold revolution. J. R. Soc. Interface 2025, 22, 20240886. [Google Scholar] [CrossRef]
- Huang, B.; Kong, L.; Wang, C.; Ju, F.; Zhang, Q.; Zhu, J.; Gong, T.; Zhang, H.; Yu, C.; Zheng, W.M.; et al. Protein structure prediction: Challenges, advances, and the shift of research paradigms. Genom. Proteom. Bioinform. 2023, 21, 913–925. [Google Scholar] [CrossRef]
- Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Zidek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
- Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
- Jiang, Q.; Jin, X.; Lee, S.J.; Yao, S. Protein secondary structure prediction: A survey of the state of the art. J. Mol. Graph. Model. 2017, 76, 379–402. [Google Scholar] [CrossRef]
- Fischer, D.; Eisenberg, D. Protein fold recognition using sequence-derived predictions. Protein Sci. 1996, 5, 947–955. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Y.; Karplus, M. Interpreting the folding kinetics of helical proteins. Nature 1999, 401, 400–403. [Google Scholar] [CrossRef] [PubMed]
- Ozkan, S.B.; Wu, G.A.; Chodera, J.D.; Dill, K.A. Protein folding by zipping and assembly. Proc. Natl. Acad. Sci. USA 2007, 104, 11987–11992. [Google Scholar] [CrossRef] [PubMed]
- Zhou, J.; Wang, H.; Zhao, Z.; Xu, R.; Lu, Q. CNNH_PSS: Protein 8-class secondary structure prediction by convolutional neural network with highway. BMC Bioinform. 2018, 19, 60. [Google Scholar] [CrossRef]
- Sitbon, E.; Pietrokovski, S. Occurrence of protein structure elements in conserved sequence regions. BMC Struct. Biol. 2007, 7, 3. [Google Scholar] [CrossRef]
- Watkins, A.M.; Wuo, M.G.; Arora, P.S. Protein-protein interactions mediated by helical tertiary structure motifs. J. Am. Chem. Soc. 2015, 137, 11622–11630. [Google Scholar] [CrossRef]
- Wuyun, Q.; Chen, Y.; Shen, Y.; Cao, Y.; Hu, G.; Cui, W.; Gao, J.; Zheng, W. Recent progress of protein tertiary structure prediction. Molecules 2024, 29, 832. [Google Scholar] [CrossRef]
- Dong, B.; Liu, Z.; Xu, D.; Hou, C.; Niu, N.; Wang, G. Impact of multi-factor features on protein secondary structure prediction. Biomolecules 2024, 14, 1155. [Google Scholar] [CrossRef]
- Du, H.; Brender, J.R.; Zhang, J.; Zhang, Y. Protein structure prediction provides comparable performance to crystallographic structures in docking-based virtual screening. Methods 2015, 71, 77–84. [Google Scholar] [CrossRef]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [CrossRef]
- Rao, R.; Liu, J.; Verkuil, R.; Meier, J.; Canny, J.; Abbeel, P.; Sercu, T.; Rives, A. MSA transformer. bioRxiv 2021. [Google Scholar] [CrossRef]
- Hekkelman, M.L.; Salmoral, D.A.; Perrakis, A.; Joosten, R.P. DSSP 4: Fair annotation of protein secondary structure. Protein Sci. 2025, 34, e70208. [Google Scholar] [CrossRef] [PubMed]
- Tunyasuvunakool, K.; Adler, J.; Wu, Z.; Green, T.; Zielinski, M.; Zidek, A.; Bridgland, A.; Cowie, A.; Meyer, C.; Laydon, A.; et al. Highly accurate protein structure prediction for the human proteome. Nature 2021, 596, 590–596. [Google Scholar] [CrossRef] [PubMed]
- Bertoline, L.M.F.; Lima, A.N.; Krieger, J.E.; Teixeira, S.K. Before and after AlphaFold2: An overview of protein structure prediction. Front. Bioinform. 2023, 3, 1120370. [Google Scholar] [CrossRef] [PubMed]
- Agarwal, V.; McShan, A.C. The power and pitfalls of AlphaFold2 for structure prediction beyond rigid globular proteins. Nat. Chem. Biol. 2024, 20, 950–959. [Google Scholar] [CrossRef]
- Feng, R.; Wang, X.; Xia, Z.; Han, T.; Wang, H.; Yu, W. MHTAPred-SS: A highly targeted autoencoder-driven deep multi-task learning framework for accurate protein secondary structure prediction. Int. J. Mol. Sci. 2024, 25, 13444. [Google Scholar] [CrossRef]
- Alanazi, W.; Meng, D.; Pollastri, G. Porter 6: Protein secondary structure prediction by leveraging pre-trained language models (plms). Int. J. Mol. Sci. 2024, 26, 130. [Google Scholar] [CrossRef]
- Zakharov, O.S.; Rudik, A.V.; Filimonov, D.A.; Lagunin, A.A. Prediction of protein secondary structures based on substructural descriptors of molecular fragments. Int. J. Mol. Sci. 2024, 25, 12525. [Google Scholar] [CrossRef]
- Dong, B.; Su, H.; Xu, D.; Hou, C.; Liu, Z.; Niu, N.; Wang, G. ILMCnet: A deep neural network model that uses PLM to process features and employs CRF to predict protein secondary structure. Genes 2024, 15, 1350. [Google Scholar] [CrossRef]
- Cheng, L.; Lu, W.; Xia, Y.; Lu, Y.; Shen, J.; Hui, Z.; Xu, Y.; Wu, H.; Chen, J.; Fu, Q.; et al. ProAttUnet: Advancing protein secondary structure prediction with deep learning via U-net dual-pathway feature fusion and ESM2 pretrained protein language model. Comput. Biol. Chem. 2025, 118, 108429. [Google Scholar] [CrossRef]
- Pinto Corujo, M.; Michal, P.; Ang, D.; Vivian, L.; Chmel, N.; Rodger, A. Prediction of secondary structure content of proteins using raman spectroscopy and self-organizing maps. Appl. Spectrosc. 2025, 79, 1497–1507. [Google Scholar] [CrossRef]
- Zhao, L.; Li, J.; Zhang, B.; Jiang, X. Combining knowledge distillation and neural networks to predict protein secondary structure. Sci. Rep. 2025, 15, 32031. [Google Scholar] [CrossRef]
- Alanazi, W.; Meng, D.; Pollastri, G. DeepPredict: A state-of-the-art web server for protein secondary structure and relative solvent accessibility prediction. Front. Bioinform. 2025, 5, 1607402. [Google Scholar] [CrossRef] [PubMed]
- Das, S.; Ghosh, S.; Jana, N.D. TransConv: Convolution-infused transformer for protein secondary structure prediction. J. Mol. Model. 2025, 31, 37. [Google Scholar] [CrossRef] [PubMed]
- Wu, T.; Cheng, W.; Cheng, J. Improving protein secondary structure prediction by deep language models and transformer networks. Methods Mol. Biol. 2025, 2867, 43–53. [Google Scholar] [PubMed]
- Dong, B.; Liu, Z.; Xu, D.; Hou, C.; Dong, G.; Zhang, T.; Wang, G. SERT-StructNet: Protein secondary structure prediction method based on multi-factor hybrid deep model. Comput. Struct. Biotechnol. J. 2024, 23, 1364–1375. [Google Scholar] [CrossRef]
- Sanjeevi, M.; Mohan, A.; Ramachandran, D.; Jeyaraman, J.; Sekar, K. CSSP-2.0: A refined consensus method for accurate protein secondary structure prediction. Comput. Biol. Chem. 2024, 112, 108158. [Google Scholar] [CrossRef]
- Chen, Y.; Chen, G.; Chen, C.Y. MFTrans: A multi-feature transformer network for protein secondary structure prediction. Int. J. Biol. Macromol. 2024, 267, 131311. [Google Scholar] [CrossRef]
- Sonsare, P.M.; Gunavathi, C. A novel approach for protein secondary structure prediction using encoder-decoder with attention mechanism model. Biomol. Concepts 2024, 15, 20220043. [Google Scholar] [CrossRef]
- Chen, Y.; Chen, G.; Chen, C.Y. PSSP-MFFNet: A multifeature fusion network for protein secondary structure prediction. ACS Omega 2024, 9, 5985–5994. [Google Scholar] [CrossRef]
- Peracha, O. PS4: A next-generation dataset for protein single-sequence secondary structure prediction. Biotechniques 2024, 76, 63–70. [Google Scholar] [CrossRef] [PubMed]
- Berman, H.M.; Kleywegt, G.J.; Nakamura, H.; Markley, J.L. The protein data bank at 40: Reflecting on the past to prepare for the future. Structure 2012, 20, 391–396. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Hou, Y.; Dai, Z.; Hu, C.A.; Wu, G. Metabolism, nutrition, and redox signaling of hydroxyproline. Antioxid. Redox Signal 2019, 30, 674–682. [Google Scholar] [CrossRef]
- Bella, J.; Eaton, M.; Brodsky, B.; Berman, H.M. Crystal and molecular structure of a collagen-like peptide at 1.9 Å resolution. Science 1994, 266, 75–81. [Google Scholar] [CrossRef]
- Milchevskiy, Y.V.; Kravatskaya, G.I.; Kravatsky, Y.V. AAindexNC: Estimating the physicochemical properties of non-canonical amino acids, including those derived from the PDB and PDBeChem databank. Int. J. Mol. Sci. 2024, 25, 12555. [Google Scholar] [CrossRef]
- Milchevskiy, Y.V.; Milchevskaya, V.Y.; Nikitin, A.M.; Kravatsky, Y.V. Effective local and secondary protein structure prediction by combining a neural network-based approach with extensive feature design and selection without reliance on evolutionary information. Int. J. Mol. Sci. 2023, 24, 15656. [Google Scholar] [CrossRef]
- Yang, J.Y.; Peng, Z.L.; Chen, X. Prediction of protein structural classes for low-homology sequences based on predicted secondary structure. BMC Bioinform. 2010, 11 (Suppl. S1), S9. [Google Scholar] [CrossRef]
- DeBartolo, J.; Colubri, A.; Jha, A.K.; Fitzgerald, J.E.; Freed, K.F.; Sosnick, T.R. Mimicking the folding pathway to improve homology-free protein structure prediction. Proc. Natl. Acad. Sci. USA 2009, 106, 3734–3739. [Google Scholar] [CrossRef]
- Schmirler, R.; Heinzinger, M.; Rost, B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat. Commun. 2024, 15, 7407. [Google Scholar] [CrossRef]
- Sun, X.; Wu, Z.; Su, J.; Li, C. GraphPBSP: Protein binding site prediction based on graph attention network and pre-trained model ProstT5. Int. J. Biol. Macromol. 2024, 282, 136933. [Google Scholar] [CrossRef]
- Fang, Y.; Jiang, Y.; Wei, L.; Ma, Q.; Ren, Z.; Yuan, Q.; Wei, D.Q. DeepProSite: Structure-aware protein binding site prediction using ESMfold and pretrained language model. Bioinformatics 2023, 39, btad718. [Google Scholar] [CrossRef]
- Jiao, S.; Ye, X.; Sakurai, T.; Zou, Q.; Han, W.; Zhan, C. Integration of pre-trained protein language models with equivariant graph neural networks for peptide toxicity prediction. BMC Biol. 2025, 23, 229. [Google Scholar] [CrossRef]
- Capela, J.; Zimmermann-Kogadeeva, M.; Dijk, A.; de Ridder, D.; Dias, O.; Rocha, M. Comparative assessment of protein large language models for enzyme commission number prediction. BMC Bioinform. 2025, 26, 68. [Google Scholar] [CrossRef] [PubMed]
- Milchevskiy, Y.V.; Milchevskaya, V.Y.; Kravatsky, Y.V. Method to generate complex predictive features for machine learning-based prediction of the local structure and functions of proteins. Mol. Biol. 2023, 57, 136–145. [Google Scholar] [CrossRef]
- Huberty, C.J. Applied Discriminant Analysis; Wiley-Interscience: New York, NY, USA, 1994. [Google Scholar]
- Thompson, B. Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educ. Psychol. Meas. 1995, 55, 525–534. [Google Scholar] [CrossRef]
- Tharwat, A.; Gaber, T.; Ibrahim, A.; Hassanien, A. Linear discriminant analysis: A detailed tutorial. AI Commun. 2017, 30, 169–190. [Google Scholar] [CrossRef]
- Zhao, S.; Zhang, B.; Yang, J.; Zhou, J.; Xu, Y. Linear discriminant analysis. Nat. Rev. Methods Primers 2024, 4, 70. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Kabsch, W.; Sander, C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22, 2577–2637. [Google Scholar] [CrossRef]
- De Brevern, A.G. New assessment of a structural alphabet. Silico Biol 2005, 5, 283–289. [Google Scholar] [CrossRef]
- Etchebest, C.; Benros, C.; Hazout, S.; de Brevern, A.G. A structural alphabet for local protein structures: Improved prediction methods. Proteins 2005, 59, 810–827. [Google Scholar] [CrossRef]
- De Brevern, A.G.; Etchebest, C.; Benros, C.; Hazout, S. “Pinning strategy”: A novel approach for predicting the backbone structure in terms of protein blocks from sequence. J. Biosci. 2007, 32, 51–70. [Google Scholar] [CrossRef] [PubMed]
- Chou, P.Y.; Fasman, G.D. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 1978, 47, 45–148. [Google Scholar] [PubMed]
- Wertz, D.H.; Scheraga, H.A. Influence of water on protein structure. An analysis of the preferences of amino acid residues for the inside or outside and for specific conformations in a protein molecule. Macromolecules 1978, 11, 9–15. [Google Scholar] [CrossRef]
- Kakraba, S.; Knisley, D. A graph theoretic model of single point mutations in the cystic fibrosis transmembrane conductance regulator. J. Adv. Biotechnol. 2016, 6, 780–786. [Google Scholar] [CrossRef]
- Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36, D202–D205. [Google Scholar] [CrossRef]
- Munoz, V.; Serrano, L. Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: Comparison with experimental scales. Proteins 1994, 20, 301–311. [Google Scholar] [CrossRef]
- Miyazawa, S.; Jernigan, R.L. Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 1999, 34, 49–68. [Google Scholar] [CrossRef]
- Ptitsyn, O.B.; Finkelstein, A.V. Theory of protein secondary structure and algorithm of its prediction. Biopolymers 1983, 22, 15–25. [Google Scholar] [CrossRef]
- Mészáros, B.; Erdős, G.; Dosztányi, Z. Iupred2a: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018, 46, W329–W337. [Google Scholar] [CrossRef]
- Ponnuswamy, P.K.; Prabhakaran, M.; Manavalan, P. Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim. Biophys. Acta 1980, 623, 301–316. [Google Scholar] [CrossRef] [PubMed]
- Ho, C.T.; Huang, Y.W.; Chen, T.R.; Lo, C.H.; Lo, W.C. Discovering the ultimate limits of protein secondary structure prediction. Biomolecules 2021, 11, 1627. [Google Scholar] [CrossRef] [PubMed]
- Cuff, J.A.; Barton, G.J. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 1999, 34, 508–519. [Google Scholar] [CrossRef]
- Yang, Y.; Gao, J.; Wang, J.; Heffernan, R.; Hanson, J.; Paliwal, K.; Zhou, Y. Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Brief. Bioinform. 2018, 19, 482–494. [Google Scholar] [CrossRef]
- Hanson, J.; Paliwal, K.; Litfin, T.; Yang, Y.; Zhou, Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 2019, 35, 2403–2410. [Google Scholar] [CrossRef]
- Singh, J.; Paliwal, K.; Litfin, T.; Singh, J.; Zhou, Y. Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment. Sci. Rep. 2022, 12, 7607. [Google Scholar] [CrossRef]
- Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction (CASP)-round XIII. Proteins 2019, 87, 1011–1020. [Google Scholar] [CrossRef]
- Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction (CASP)-round XIV. Proteins 2021, 89, 1607–1617. [Google Scholar] [CrossRef]
- Shapovalov, M.; Dunbrack, R.L., Jr.; Vucetic, S. Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction. PLoS ONE 2020, 15, e0232528. [Google Scholar] [CrossRef]
- Heffernan, R.; Paliwal, K.; Lyons, J.; Singh, J.; Yang, Y.; Zhou, Y. Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning. J. Comput. Chem. 2018, 39, 2210–2216. [Google Scholar] [CrossRef]
- Fang, C.; Shang, Y.; Xu, D. MUFold-SS: New deep inception-inside-inception networks for protein secondary structure prediction. Proteins 2018, 86, 592–598. [Google Scholar] [CrossRef]
- Stapor, K.; Kotowski, K.; Smolarczyk, T.; Roterman, I. Lightweight ProteinUnet2 network for protein secondary structure prediction: A step towards proper evaluation. BMC Bioinform. 2022, 23, 100. [Google Scholar] [CrossRef]
- Singh, J.; Litfin, T.; Paliwal, K.; Singh, J.; Hanumanthappa, A.K.; Zhou, Y. SPOT-1D-Single: Improving the single-sequence-based prediction of protein secondary structure, backbone angles, solvent accessibility and half-sphere exposures using a large training set and ensembled deep learning. Bioinformatics 2021, 37, 3464–3472. [Google Scholar] [CrossRef]
- Yang, W.; Liu, Y.; Xiao, C. Deep metric learning for accurate protein secondary structure prediction. Knowl.-Based Syst. 2022, 242, 108356. [Google Scholar] [CrossRef]
- Klausen, M.S.; Jespersen, M.C.; Nielsen, H.; Jensen, K.K.; Jurtz, V.I.; Sonderby, C.K.; Sommer, M.O.A.; Winther, O.; Nielsen, M.; Petersen, B.; et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins 2019, 87, 520–527. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Peng, J.; Ma, J.; Xu, J. Protein secondary structure prediction using deep convolutional neural fields. Sci. Rep. 2016, 6, 18962. [Google Scholar] [CrossRef] [PubMed]
- Milchevskaya, V.; Nikitin, A.M.; Lukshin, S.A.; Filatov, I.V.; Kravatsky, Y.V.; Tumanyan, V.G.; Esipova, N.G.; Milchevskiy, Y.V. Structural coordinates: A novel approach to predict protein backbone conformation. PLoS ONE 2021, 16, e0239793. [Google Scholar] [CrossRef]
- Garbuzynskiy, S.O.; Melnik, B.S.; Lobanov, M.Y.; Finkelstein, A.V.; Galzitskaya, O.V. Comparison of X-ray and NMR structures: Is there a systematic difference in residue contacts between X-ray- and NMR-resolved protein structures? Proteins 2005, 60, 139–147. [Google Scholar] [CrossRef]
- Wang, G.; Dunbrack, R.L., Jr. PISCES: Recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005, 33, W94–W98. [Google Scholar] [CrossRef]
- Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction (CASP)-round XV. Proteins 2023, 91, 1539–1549. [Google Scholar] [CrossRef]
- Touw, W.G.; Baakman, C.; Black, J.; te Beek, T.A.; Krieger, E.; Joosten, R.P.; Vriend, G. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 2015, 43, D364–D368. [Google Scholar] [CrossRef]
- Zhang, S.; Krieger, J.M.; Zhang, Y.; Kaya, C.; Kaynak, B.; Mikulska-Ruminska, K.; Doruker, P.; Li, H.; Bahar, I. ProDy 2.0: Increased scale and scope after 10 years of protein dynamics modelling with Python. Bioinformatics 2021, 37, 3657–3659. [Google Scholar] [CrossRef]
- Miller, A. Subset Selection in Regression; Chapman and Hall/CRC: New York, NY, USA, 2002. [Google Scholar]
- Lee, J. Measures for the assessment of fuzzy predictions of protein secondary structure. Proteins 2006, 65, 453–462. [Google Scholar] [CrossRef] [PubMed]
- Taha, A.A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef]
- Ting, K.M. Confusion matrix. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2011; p. 209. [Google Scholar]



| LDA Index | Alphabet | Reduced Alphabet Length | PB Sequence | Offset | Power |
|---|---|---|---|---|---|
| 1 | Full | 5 | ACDDDFB | 0 | 1 |
| 2 | Full | 5 | PAP | 0 | 1 |
| 3 | Full | 5 | EHJIA | 0 | 1 |
| 4 | Full | 5 | MMMMM | 5 | 1 |
| 7 | PB_W7_tail_GP | 7 | MMMMMMM | 0 | 1 |
| 11 | Full | 5 | MMMMM | −4 | 1 |
| 14 | Full | 5 | PAFKL | 0 | 1 |
| 16 | Full | 5 | DDDDD | 0 | 1 |
| 17 | Full | 5 | KLMMM | 2 | 1 |
| 22 | PB_w11_tail | 11 | PGB | 0 | 2 |
| 24 | PB_w11_tail | 11 | JIA | 0 | 1 |
| 30 | PB_W3 | 3 | ACDDDFB | 0 | 1 |
| 31 | Full | 5 | DDFBG | 0 | 1 |
| 34 | Full | 5 | GHI | 0 | 1 |
| 36 | Full | 5 | KLN | −3 | 2 |
| LDA Index | Property | Period | Number of Periods | Power | Description |
|---|---|---|---|---|---|
| 25 | KARS160108 | 3.7 | 5 | 1 | Average weighted degree [67] |
| 51 | MUNV940103 | 10.0 | 3 | 2 | Free energy in β-strand conformation [69] |
| 63 | MIYS990102 | 3.6 | 7 | 1 | Optimized relative partition energies—method A [70] |
| 77 | MUNV940102 | 3.0 | 2 | 1 | Free energy in α-helical region [69] |
| LDA Index | Property | Left Offset | Right Offset | Power | Description |
|---|---|---|---|---|---|
| 6 | MUNV940103 | −5 | 6 | 1 | Free energy in β-strand conformation [69] |
| 8 | PTIO830101 | −5 | 6 | 1 | Helix–coil equilibrium constant [71] |
| 12 | IUPred2 short mode | 3 | IUPred2 [72] predicts intrinsically disordered regions and domain boundaries | ||
| 21 | MUNV940102 | −5 | 16 | 3 | Free energy in α-helical region [69] |
| 35 | PTIO830101 | −15 | 16 | 3 | Helix–coil equilibrium constant [71] |
| 60 | PONP800104 | −5 | 5 | 2 | Surrounding hydrophobicity in α-helix [73] |
| 73 | End_N_3 | 3 | The current amino acid position is ≤3 residues from the N-end | ||
| 87 | PTIO830101 | −5 | 5 | 3 | Helix–coil equilibrium constant [71] |
| 93 | MUNV940103 | −5 | 5 | 2 | Free energy in β-strand conformation [69] |
| Dataset | Q8, LDA | F1:8, LDA | Q3, LDA | F1:3, LDA | Q8, Mix | F1:8, Mix | Q3, Mix | F1:3, Mix | Dataset Size, Chains |
|---|---|---|---|---|---|---|---|---|---|
| Validation | 0.8343 | 0.6949 | 0.9015 | 0.8992 | 0.8446 | 0.7137 | 0.9085 | 0.9062 | 5325 |
| CB513 | 0.8557 | 0.7461 | 0.9148 | 0.9128 | 0.8624 | 0.7599 | 0.9175 | 0.9157 | 435 * |
| TS115 | 0.8548 | 0.7411 | 0.9153 | 0.9096 | 0.8647 | 0.7628 | 0.9175 | 0.9119 | 115 |
| TEST2018 | 0.8541 | 0.7518 | 0.9149 | 0.9122 | 0.8624 | 0.7671 | 0.9178 | 0.9150 | 245 |
| TEST2020-HQ | 0.8253 | 0.6882 | 0.8877 | 0.8841 | 0.8378 | 0.7091 | 0.8949 | 0.8911 | 124 |
| CASP13 | 0.7876 | 0.6456 | 0.8700 | 0.8683 | 0.7957 | 0.6619 | 0.8704 | 0.8684 | 40 |
| CASP14 | 0.7461 | 0.5438 | 0.8475 | 0.8472 | 0.7646 | 0.5668 | 0.8522 | 0.8513 | 34 |
| CASP15 | 0.7286 | 0.5600 | 0.8338 | 0.8366 | 0.7357 | 0.5674 | 0.8402 | 0.8426 | 39 |
| Method | CB513 | TS115 | TEST 2018 | TEST 2020-HQ | CASP13-FM | CASP14-FM |
|---|---|---|---|---|---|---|
| TruMPET, LDA | 0.8541 | 0.8219 | 0.8417 | 0.8103 | 0.7712 | 0.7281 |
| TruMPET, mix | 0.8608 | 0.8352 | 0.8486 | 0.8179 | 0.7799 | 0.7488 |
| MilchStruct | 0.7940 | 0.7491 | 0.7606 | 0.7537 | 0.6207 | 0.5140 |
| SPIDER3-Single | 0.5464 | 0.5965 | 0.5981 | 0.5823 | 0.6081 | 0.5914 |
| MUFOLD-SS | 0.7432 | 0.6257 | 0.7429 | — | 0.7185 | 0.6863 |
| ProteinUnet2 | — | — | 0.7460 | 0.5878 | 0.6081 | — |
| SPOT-1D-Single | 0.5464 | 0.5965 | 0.7217 | 0.6035 | 0.6093 | 0.6177 |
| SPOT-1D-Profile | — | — | 0.7541 | 0.7041 | 0.7122 | 0.6186 |
| SPOT-1D-LM | 0.6489 | 0.6832 | 0.7647 | 0.6773 | 0.7095 | 0.6318 |
| MHTAPred-SS | 0.7653 | — | 0.7728 | — | 0.7600 | 0.7254 |
| ProtTrans | 0.7450 | 0.7710 | — | — | — | — |
| DML_SSembed | 0.7554 | — | 0.7648 | — | 0.7417 | 0.7022 |
| Method | CB513 | TS115 | TEST 2018 | TEST 2020-HQ | CASP13-FM | CASP14-FM |
|---|---|---|---|---|---|---|
| TruMPET, LDA | 0.9136 | 0.8970 | 0.9064 | 0.8787 | 0.8595 | 0.8364 |
| TruMPET, mix | 0.9166 | 0.8993 | 0.9088 | 0.8824 | 0.8585 | 0.8420 |
| MilchStruct | 0.8599 | 0.8254 | 0.8336 | 0.8220 | 0.7183 | 0.6524 |
| SPIDER3-Single | 0.7367 | 0.7556 | 0.7257 | 0.7102 | 0.7512 | 0.7188 |
| MUFOLD-SS | 0.8444 | — | 0.8463 | — | 0.8343 | 0.7872 |
| ProteinUnet2 | — | — | 0.7257 | 0.7128 | 0.7439 | — |
| SPOT-1D-Single | 0.7367 | 0.7556 | 0.7428 | 0.7222 | 0.7321 | 0.7470 |
| SPOT-1D-Profile | — | — | 0.8618 | 0.8197 | 0.8355 | 0.7566 |
| SPOT-1D-LM | 0.8841 | 0.8786 | 0.8674 | 0.7970 | 0.8215 | 0.7654 |
| MHTAPred-SS | 0.8743 | — | 0.8742 | — | 0.8604 | 0.8321 |
| ProtTrans | 0.8620 | 0.8690 | — | — | — | — |
| DML_SSembed | 0.8641 | — | 0.8682 | — | 0.8495 | 0.8075 |
| Protein Chain | Q8-LDA, ncAA | ncAA | ncAA Amount | Q8-LDA, AA | AA | Accuracy Gain, % |
|---|---|---|---|---|---|---|
| 1P9GA | 0.9024 | PCA | 1 | 0.7805 | GLU | 12.19 |
| 4XA6A | 0.8036 | MLY | 16 | 0.7202 | LYS | 8.34 |
| 3EJHE | 0.8000 | HYP | 3 | 0.7333 | PRO | 6.67 |
| 4WIDB | 0.8711 | MLY, MLZ | 6 | 0.8338 | LYS | 3.73 |
| 7RUPA | 0.9206 | CSO | 1 | 0.8889 | CYS | 3.17 |
| 5BQ8A | 0.7885 | MLY | 4 | 0.7596 | LYS | 2.89 |
| 3UFIA | 0.8237 | MLY | 14 | 0.7986 | LYS | 2.51 |
| 4QE0B | 0.8323 | MLY | 16 | 0.8084 | LYS | 2.39 |
| 4QE0A | 0.8708 | MLY | 16 | 0.8483 | LYS | 2.25 |
| 2VGXB | 0.9172 | MLY | 9 | 0.8966 | LYS | 2.06 |
| 3KV0A | 0.9381 | MLY | 17 | 0.9175 | LYS | 2.06 |
| 6S98A | 0.8986 | OCS | 1 | 0.8784 | CYS | 2.02 |
| Protein Chain | AlphaFold2 Model | Q8, AlphaFold 2 | Q8, TruMPET | Accuracy Remainder, % |
|---|---|---|---|---|
| CASP14 [78] Free Modeling Category Targets | ||||
| 7D2OA | T1027TS427_1-D1 | 0.5906 | 0.6250 | +3.44 |
| 6VR4A | T1040TS427_1-D1 | 0.7538 | 0.7124 | −4.14 |
| 7BGLHXX | T1047s2TS427_1 | 0.6706 | 0.7026 | +3.20 |
| 7ZHJEA | T1061TS427_1-D1 | 0.4844 | 0.5880 | +10.36 |
| 7REJA | T1070TS427_1-D1 | 0.8026 | 0.8179 | +1.53 |
| 7UM1A | T1096TS427_1-D2 | 0.8246 | 0.8035 | −2.11 |
| Proteins from [26] | ||||
| 1S4TA | AF-P23907-F1-model_v6 | 0.2381 | 0.3333 | + 9.52 |
| 2KQPA | AF-P01308-F1-model_v6 | 0.4651 | 0.5465 | + 8.14 |
| 6OFSA | AF-P31828-F1-model_v6 | 0.9171 | 0.8564 | −6.07 |
| 3T5OA | AF-P13671-F1-model_v6 | 0.7026 | 0.7061 | +0.35 |
| 1WVKA | AF-O64818-F1-model_v6 | 0.3718 | 0.4615 | +8.97 |
| 4ORWA | AF-P41222-F1-model_v6 | 0.9623 | 0.9497 | −1.26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Milchevskiy, Y.V.; Kravatskaya, G.I.; Kravatsky, Y.V. TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features. Int. J. Mol. Sci. 2025, 26, 11284. https://doi.org/10.3390/ijms262311284
Milchevskiy YV, Kravatskaya GI, Kravatsky YV. TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features. International Journal of Molecular Sciences. 2025; 26(23):11284. https://doi.org/10.3390/ijms262311284
Chicago/Turabian StyleMilchevskiy, Yury V., Galina I. Kravatskaya, and Yury V. Kravatsky. 2025. "TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features" International Journal of Molecular Sciences 26, no. 23: 11284. https://doi.org/10.3390/ijms262311284
APA StyleMilchevskiy, Y. V., Kravatskaya, G. I., & Kravatsky, Y. V. (2025). TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features. International Journal of Molecular Sciences, 26(23), 11284. https://doi.org/10.3390/ijms262311284

